Removed the resets to INT_MAX in `etc/tune.c` #453

czurnieden · 2019-11-07T20:55:02Z

Restarted the work on Toom-Cook 4,5-way and got an irritating first result from tune.c:

#define MP_DEFAULT_MUL_KARATSUBA_CUTOFF 101
#define MP_DEFAULT_SQR_KARATSUBA_CUTOFF 128 <- MP_MAX_COMBA/2 (60 bit)
#define MP_DEFAULT_MUL_TOOM_CUTOFF      132
#define MP_DEFAULT_SQR_TOOM_CUTOFF      128 <- MP_MAX_COMBA/2 (60 bit)
#define MP_DEFAULT_MUL_TOOM_4_CUTOFF    210
#define MP_DEFAULT_SQR_TOOM_4_CUTOFF    128 <- MP_MAX_COMBA/2 (60 bit)
#define MP_DEFAULT_MUL_TOOM_5_CUTOFF    244
#define MP_DEFAULT_SQR_TOOM_5_CUTOFF    128 <- MP_MAX_COMBA/2 (60 bit)

Switching off COMBA gave a more reasonable result:

#define MP_DEFAULT_MUL_KARATSUBA_CUTOFF 59
#define MP_DEFAULT_SQR_KARATSUBA_CUTOFF 51
#define MP_DEFAULT_MUL_TOOM_CUTOFF      74
#define MP_DEFAULT_SQR_TOOM_CUTOFF      47
#define MP_DEFAULT_MUL_TOOM_4_CUTOFF    107
#define MP_DEFAULT_SQR_TOOM_4_CUTOFF    52
#define MP_DEFAULT_MUL_TOOM_5_CUTOFF    129
#define MP_DEFAULT_SQR_TOOM_5_CUTOFF    65

(Yes, it is reasonable that T-C 3-way can be faster than Karatsuba, at least according to Paul Zimmerman and Marco Bodrato)
These values are all below MP_MAX_COMBA/2 for squaring and MP_MAX_COMBA for multiplication, COMBA is faster than Karatsuba up to the WARRAY limit, hence the curious results in the very first listing here.

The original used the slower algorithms at their benchmarked cut-off points, too, that is without the resetting (as in this PR). With that method and comba switched back on:

#define MP_DEFAULT_MUL_KARATSUBA_CUTOFF 104
#define MP_DEFAULT_SQR_KARATSUBA_CUTOFF 128
#define MP_DEFAULT_MUL_TOOM_CUTOFF      164
#define MP_DEFAULT_SQR_TOOM_CUTOFF      239
#define MP_DEFAULT_MUL_TOOM_4_CUTOFF    707
#define MP_DEFAULT_SQR_TOOM_4_CUTOFF    997 <- maxed out
#define MP_DEFAULT_MUL_TOOM_5_CUTOFF    666
#define MP_DEFAULT_SQR_TOOM_5_CUTOFF    821

These are just the cut-offs and are relative values, COMBA is still faster in its limits! But I haven't measured the actual timings yet to get a better picture, the new functions might have some potential for optimizations (they are not usable otherwise) and there is #447 , too.

This PR gives more realistic results for Toom-Cook 3-way (ignoring the new ones) but doesn't fully solve the underlying problem.
Your opinion?

sjaeckel · 2019-11-08T01:06:16Z

TBH I had the impression for already a while that the tuning doesn't really do what it promises... until now I only suspected it, but check out yourself... I executed make tune, committed the updated tommath_cutoffs.h and that's one of the results after running the benchmarking tool

but I'm a bit surprised of a lot of the timing diagrams I've seen today... maybe just another case of "wer viel misst, misst viel Mist"!?

czurnieden · 2019-11-08T03:07:15Z

So it's slower now, if I understand your graphic correctly and it's not even a small difference?
Great.
*sigh*

But I still made no absolute timings with the benchmark in tune.c to get a couple of curves to overlay (with and w/o Comba, INT_MAX, etc). For the famous "squint&eyeball" technique ;-)

The problem is that the Comba multiplication has an upper limit, all of the faster ones have lower limits (besides FFT which has both, but that's not in here). There is also a steep jump at Comba's upper limit. That together with the cascading down of the T-C algorithms makes the behaviour a bit chaotic overall.
I think.
The cut-offs without Comba and with/without resetting are well in line with the theory. The numbers from the ones with resetting are even in the same ballpark as the results of Zimmerman and Bodrato.

maybe just another case of "wer viel misst, misst viel Mist"!?
[engl: "The one who measures a lot measures a lot of crap" ]

There's a very good chance that that is true ;-)

But the original run I made with all the squaring functions having a 128 limbs cut-off as a result was a bit…uhm…suspicious, to say the least and the single thing that changed was the resetting.

Why did I decide against resetting? All of our fast multiplication/squaring algorithms are recursive and call a lower (or equal) T-C/school/Comba algo chosen by mp_mul. They also have only one cut-off, a lower one. So I thought it would make the most sense if I keep the evaluated cut-off values from the slower algorithms to get the cut-offs for the faster ones. We don't want to know when e.g.: T-C 4-way is faster than school, we want to know the point where it is faster than T-C -3-way.
Is that wrong?

I'll make some "pretty pictures" this evening with some curves of the absolute run times of the individual functions to get a rough overview. It might help.

And if it doesn't help we can still put it on the strawberries ;-)

…n at command-line

czurnieden · 2019-11-08T23:07:59Z

It seems as if the Toom-Cook 3-way functions are fast enough to replace the Karatsuba functions (They are also bigger but I don't think that anybody will use them if they are a bit short on memory.).
So if you, or me in this case, raise the cut-offs of TC it will act more as a ballast than as an accelerator. It was happily doing it's work on it's own before but has to share with Karatsuba now.

But is that the reason for the large gap you measured? The most significant differences between the smaller variations of cut-off values are the placements and sizes of the stairs, the rest is hidden inside the noise—see samples below.

I don't know what to do with it for now (just put a note in the documentation?) but I found a wrong printout when the cut-offs are given at the command-line so it was not all a waste here ;-)

Multiplying (all AMD 64-bit with 60-bit limbs. 2,000 was a bit excessive, admitted).

The area around the Comba cut-off.

It is a bit more obvious with squaring when the cut-off values are way off.

Again: the area around the Comba cut-off.

And because I already have it and don't want to waste it:

minad · 2019-11-08T23:27:33Z

I think you should do.profiling (e.g. using perf) to see the hot spots of the different implementations. In theory toom-2==karatsuba, right? Or is there some other structural difference in our implementation?

These higher toom variants should interpolate in-between until FFT becomes faster finally? Are you seeing this for larger bitsizes? I have to admit I cannot read your plots (I only see a difference between school and something else).

minad · 2019-11-08T23:33:49Z

Btw would it not make sense to avoid the issue you are describing by calling comba from toom-3 toom-4 etc instead of mp_mul (which could then again do karatsuba). Or is a karatsuba nested inside toom-3 still toom-3? If you want to avoid such nesting, this should give fixed cut off points or does it not? Hmm, according to what I read this nesting should indeed occur. Then I think you are right with not doing the reset.

And furthermore we should try to lift the comba limit by using a larger W if needed (as a first step get rid of WARRAY somehow, #447).

czurnieden · 2019-11-09T02:36:50Z

These higher toom variants should interpolate in-between until FFT becomes faster finally?

Yes.
I don't know how much are needed, GMP goes up to 8-way with several asymmetric variants in between. See $GMP/mpn/generic/toom* if you have some time at hand.
I don't plan to do all of them ;-)

You need that really fast multiplication for Newton division, I tried it with the current LTM versions but the cut-off is so far out that I first thought to have made an error while porting it from my old fork.

And then there is the question if LTM needs all of that in the first place.
As nice as it is to have all of that, at the end somebody has to maintain all of it!

I have to admit I cannot read your plots (I only see a difference between school and something else).

Argh, I didn't make them safe, sorry for that!
Will upload the data if I still have all of the files, but it will take a day to redo them if some are lost (overwritten).

Then I think you are right with not doing the reset.

Yes, that's the theory. And it would work. Normally. But LTM is not really "normal" in that sense ;-)
We have Comba which isn't as much of a problem as I thought it is, the main problem seems to be that T-C 3-way is about as fast as Karatsuba (on my machine, so YMMV!), especially the squaring. The very simple algorithm in tune.c does not recognize it.

I have no quick solution at hand besides putting a note in the documentation but will work at it. Especially by testing other architectures or a least LTM's other limb-sizes. If it is just my old machine I'll put a note in the documentation, if it is elsewhere too it gets interesting.

And furthermore we should try to lift the comba limit by using a larger W if needed

I thought there is a hard technical limit?

minad · 2019-11-09T05:38:00Z

I don't know how much are needed, GMP goes up to 8-way with several asymmetric variants in between. See $GMP/mpn/generic/toom* if you have some time at hand.
I don't plan to do all of them ;-)

I think such things should be generated. Only write the generator once. Also Karatsuba could be generated in the same way then.

the main problem seems to be that T-C 3-way is about as fast as Karatsuba (on my machine, so YMMV!), especially the squaring.

So TC3 is as fast as TC2 (Karatsuba) even for small numbers and will get faster than TC2 for larger numbers?

I thought there is a hard technical limit?

I lifted the limit in #447 as an example. This is how things are done in tfm. And after lifiting the limit, both full width and smaller digits could be used.

czurnieden · 2019-11-10T21:51:15Z

Me: Oh, nice, a free weekend! Time for working on the non-biz to-do lists!
Fate: BWAHAHAHAHAH! Nope!

But back to the to-do lists: while working at the accessibility of the graphics for @minad I came to the conclusion that there must a fundamental logic flaw in tune.c in regards to the COMBA multiplication. The original method (without resetting) is also (roughly) the method described in GMP although they added the caveat:

This rule is attractive because it's got a basis in reason and is fairly easy to implement, but no work has been done to actually compare it in absolute terms to other possibilities

Well…

But I don't give up that easily, I'll find a solution before I start drawing my pension! ;-)

I think such things should be generated. Only write the generator once. Also Karatsuba could be generated in the same way then.

Generating the function from Maths or from the PARI/GP code?
Generating from Maths would work but there is a lot of manual optimizations involved, I don't see a quick and easy way to automate that.

So TC3 is as fast as TC2 (Karatsuba) even for small numbers and will get faster than TC2 for larger numbers?

It seems so, yes. One of the reasons is COMBA. Without COMBA only TC3-squaring is as fast as TC2-squaring. (But TC3-multiplying is not that far off in that case).

And because you are currently working at it, @minad , it would be nice, if COMBA has a programmatically changeable cut-off (because it might be able to replace Karatsuba if COMBA has no upper limit)?

Oh, and before I forget it, the graphics:
They don't have that much information in it but I thought they might help to get a rough overview.

No COMBA:

And one with squaring. Interesting is the part without Karatsuba and the values evaluated with resetting (5000, 5000, 125, 128) and without (5000, 5000, 166, 246).

czurnieden · 2023-04-05T13:35:26Z

That whole mess is indeed caused by the Comba multiplication/squaring. i tried a couple of higher TC's (TC-6 to TC-9, the coefficients get too big with TC-10) and found that choosing coefficients that are powers of two alone make a siginifcant difference. But tweaking those is a lot of tedious manual work and more than can be shoveled in a single WE.

So closing this PR for good with the hope to open a new one with properly constructed Toom-Cook algorithms.

sjaeckel force-pushed the bugfix_tune branch from 2b7b9a9 to 729b5a0 Compare November 7, 2019 23:18

czurnieden added 2 commits November 9, 2019 00:01

removed the resets to INT_MAX

6c9919e

reinstated resetting and correctied printout in case of cut-offs give…

dedbd25

…n at command-line

czurnieden force-pushed the bugfix_tune branch from 729b5a0 to dedbd25 Compare November 8, 2019 23:03

czurnieden closed this Apr 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removed the resets to INT_MAX in `etc/tune.c` #453

Removed the resets to INT_MAX in `etc/tune.c` #453

czurnieden commented Nov 7, 2019

sjaeckel commented Nov 8, 2019

czurnieden commented Nov 8, 2019

czurnieden commented Nov 8, 2019

minad commented Nov 8, 2019

minad commented Nov 8, 2019 •

edited

Loading

czurnieden commented Nov 9, 2019

minad commented Nov 9, 2019 •

edited

Loading

czurnieden commented Nov 10, 2019

czurnieden commented Apr 5, 2023

Removed the resets to INT_MAX in etc/tune.c #453

Removed the resets to INT_MAX in etc/tune.c #453

Conversation

czurnieden commented Nov 7, 2019

sjaeckel commented Nov 8, 2019

czurnieden commented Nov 8, 2019

czurnieden commented Nov 8, 2019

minad commented Nov 8, 2019

minad commented Nov 8, 2019 • edited Loading

czurnieden commented Nov 9, 2019

minad commented Nov 9, 2019 • edited Loading

czurnieden commented Nov 10, 2019

czurnieden commented Apr 5, 2023

Removed the resets to INT_MAX in `etc/tune.c` #453

Removed the resets to INT_MAX in `etc/tune.c` #453

minad commented Nov 8, 2019 •

edited

Loading

minad commented Nov 9, 2019 •

edited

Loading