-
Notifications
You must be signed in to change notification settings - Fork 203
remove W array from s_mp_mul_comba and s_mp_sqr_comba #447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5c3d362
to
de0aa5f
Compare
@sjaeckel Thanks! which code did you use? If I continue to experiment with this PR, I could use these benchmarks. The question is what slowdown would be acceptable, if any. I would like to get rid of the stack allocations, since this would allow to generalize things and enable full digits. At this point we might win or lose again. But maybe we could add some unrolled comba code in the end from tfm. |
https://github.com/libtom/libtomprofile and libtom/libtomcrypt#520
good question, how about keeping the current "fast" as default and add this version as an option (maybe enabled via TBH I'd like to only accept it as soon as it's equally fast or faster ;)
IMO it's worth a try |
btw. @czurnieden you mentioned that with full digits your fourrier-transform mul/sqr would have to be re-written... AFAIU FT only makes sense for really really big MPI's, so wouldn't it be an option to re-shape the full-digit MPI into a 28/60-bit MPI before the FT operations and back afterwards? |
FYI I've updated the graphs, they also include TFM with the following modifications to speed up ECC and have at least 4096bit RSA
|
Everybody has their own use-case ;-)
I use the 28-bit limbs for the large numbers not only because it is faster ( Nuh, don't worry, just go on, I'll find a solution.
I'm not that strict and would give it a bit more leeway. 1-2% are OK, there is no free lunch, but as it is now it is a tad bit too much, sorry. |
@sjaeckel I will do some experiments at some point regarding full width digits. I will also do some profiling to see the hotspots. Regarding a merge of this PR - it would be nice to do things step by step and simplifying things as we go. But I agree that it doesn't make sense to merge something which leads to a big slowdown. Regarding your LOWMEN idea, I don't want to include a half baked other implementation next to the current one. |
Interestingly rsa_verify_hash gets faster here, so maybe for that benchmark no additional allocations are performed. |
c5030d3
to
1bec79e
Compare
remove calls to comba from s_mp_mul and s_mp_mul_high TODO: * Remove remaining W arrays * Replace mp_exch/mp_clear pairs by mp_clear/copy * Check if more mp_init* calls can be replaced by MP_ALIAS/mp_init_size/mp_grow optimization
this is how it is done in tfm
I remembered vaguely that there is a variation of COMBA that might fit our case. Took me a moment with Google but I found it. It shuffles the operations around to make it more parallelizable what made it interesting for me at that time. Two problems: they carefully calculated the complexity but gave no proof of correctness (That's why I didn't use it at the end) and they haven't implemented a parallel version when they benchmarked their algorithm against GMP. It's worth a look, yes, but I don't know if it's worth more than that. |
@czurnieden The version I added here is very slow though. Maybe it can be done in a better way. |
Another quick attempt at #441. This time comba will also work if the output aliases the inputs (at the cost of an allocation).
Advantages:
Disadvantage:
This needs benchmarking. Maybe we can also make a helper function for this mp_init_size/mp_grow prologue and the mp_exch/mp_clear epilogue to make this nicer.
Furthermore there are still some W arrays left in other functions, which I didn't remove yet. But things can be done similarly.
@sjaeckel Could you please try the RSA benchmark again? Maybe I should also run the timing tool...