remove W array from s_mp_mul_comba and s_mp_sqr_comba #447

minad · 2019-11-06T00:08:02Z

Another quick attempt at #441. This time comba will also work if the output aliases the inputs (at the cost of an allocation).

Advantages:

No huge stack allocation (better for embedded, bare metal, etc)
No fixed MP_WARRAY constant, simplification of the comba criterion (we only have to check that the mp_word doesn't overflow)
s_mp_mul and s_mp_mul_comba behave the same (no allocation if no aliasing, allocation otherwise)
Code is safer since buffer overflow in heap allocations is easier to detect by valgrind etc
Code is more consistent with the rest. These W arrays are a special case. Heap allocations are used everywhere else.

Disadvantage:

Allocation on the heap for aliased inputs/outputs necessary
mp_init_size/mp_grow/MP_ALIAS boilerplate needed
Maybe a little bit slower for aliased output/inputs (but no difference if output doesn't alias inputs)

This needs benchmarking. Maybe we can also make a helper function for this mp_init_size/mp_grow prologue and the mp_exch/mp_clear epilogue to make this nicer.
Furthermore there are still some W arrays left in other functions, which I didn't remove yet. But things can be done similarly.

@sjaeckel Could you please try the RSA benchmark again? Maybe I should also run the timing tool...

sjaeckel · 2019-11-07T17:03:47Z

minad · 2019-11-07T17:20:25Z

@sjaeckel Thanks! which code did you use? If I continue to experiment with this PR, I could use these benchmarks. The question is what slowdown would be acceptable, if any.

I would like to get rid of the stack allocations, since this would allow to generalize things and enable full digits. At this point we might win or lose again. But maybe we could add some unrolled comba code in the end from tfm.

sjaeckel · 2019-11-07T18:32:50Z

Thanks! which code did you use? If I continue to experiment with this PR, I could use these benchmarks.

https://github.com/libtom/libtomprofile and libtom/libtomcrypt#520

The question is what slowdown would be acceptable, if any.

good question, how about keeping the current "fast" as default and add this version as an option (maybe enabled via MP_LOW_MEM!?) until it is as fast as the version with the stack allocation?

TBH I'd like to only accept it as soon as it's equally fast or faster ;)

I would like to get rid of the stack allocations, since this would allow to generalize things and enable full digits. At this point we might win or lose again. But maybe we could then add some unrolled comba code in the end from tfm.

IMO it's worth a try

sjaeckel · 2019-11-07T18:36:00Z

btw. @czurnieden you mentioned that with full digits your fourrier-transform mul/sqr would have to be re-written... AFAIU FT only makes sense for really really big MPI's, so wouldn't it be an option to re-shape the full-digit MPI into a 28/60-bit MPI before the FT operations and back afterwards?

sjaeckel · 2019-11-07T19:32:02Z

FYI I've updated the graphs, they also include TFM with the following modifications to speed up ECC and have at least 4096bit RSA

diff --git a/src/headers/tfm.h b/src/headers/tfm.h
index c47b826..af48248 100644
--- a/src/headers/tfm.h
+++ b/src/headers/tfm.h
@@ -103,5 +103,11 @@
 #ifndef FP_MAX_SIZE
-   #define FP_MAX_SIZE           (4096+(8*DIGIT_BIT))
+   #define FP_MAX_SIZE           (8192+(8*DIGIT_BIT))
 #endif
 
+#define TFM_ECC192
+#define TFM_ECC224
+#define TFM_ECC256
+#define TFM_ECC384
+#define TFM_ECC521
+
 /* will this lib work? */

czurnieden · 2019-11-07T19:40:17Z

FT only makes sense for really really big MPI's,

Everybody has their own use-case ;-)

so wouldn't it be an option to re-shape the full-digit MPI into a 28/60-bit MPI before the FT operations and back afterwards?

I use the 28-bit limbs for the large numbers not only because it is faster (mp_word is a native data type on 64-bit CPUs which makes quite some difference once the numbers get big enough) but it is also easily parted into two 14-bit limbs which gives a large range before I have to cut the numbers into smaller chunks (with a variation of mp_balance) again. 32 bit would result in either 2x16 bit (too large) or 4x8 bit (too small) or 3x11 bit plus padding (too complicated) or restricting it to x86 and use long double (too…uhm…restrictive) or something I haven't even thought of yet.
Or I finally go with NTT, which doesn't have that problem although it has a much larger cut-off point.

Nuh, don't worry, just go on, I'll find a solution.

TBH I'd like to only accept it as soon as it's equally fast or faster

I'm not that strict and would give it a bit more leeway. 1-2% are OK, there is no free lunch, but as it is now it is a tad bit too much, sorry.

minad · 2019-11-07T19:44:02Z

@sjaeckel I will do some experiments at some point regarding full width digits. I will also do some profiling to see the hotspots.

Regarding a merge of this PR - it would be nice to do things step by step and simplifying things as we go. But I agree that it doesn't make sense to merge something which leads to a big slowdown.

Regarding your LOWMEN idea, I don't want to include a half baked other implementation next to the current one.

minad · 2019-11-07T19:47:47Z

Interestingly rsa_verify_hash gets faster here, so maybe for that benchmark no additional allocations are performed.

remove calls to comba from s_mp_mul and s_mp_mul_high TODO: * Remove remaining W arrays * Replace mp_exch/mp_clear pairs by mp_clear/copy * Check if more mp_init* calls can be replaced by MP_ALIAS/mp_init_size/mp_grow optimization

this is how it is done in tfm

czurnieden · 2019-11-10T21:51:10Z

I remembered vaguely that there is a variation of COMBA that might fit our case. Took me a moment with Google but I found it. It shuffles the operations around to make it more parallelizable what made it interesting for me at that time. Two problems: they carefully calculated the complexity but gave no proof of correctness (That's why I didn't use it at the end) and they haven't implemented a parallel version when they benchmarked their algorithm against GMP.

It's worth a look, yes, but I don't know if it's worth more than that.

minad · 2019-11-10T21:55:44Z

@czurnieden The version I added here is very slow though. Maybe it can be done in a better way.

minad requested a review from sjaeckel November 6, 2019 00:08

minad added the feedback required label Nov 6, 2019

minad force-pushed the remove-warray branch 2 times, most recently from 5c3d362 to de0aa5f Compare November 6, 2019 07:49

minad added the work in progress label Nov 6, 2019

minad force-pushed the remove-warray branch from de0aa5f to 827e6d1 Compare November 6, 2019 07:57

minad requested a review from czurnieden November 6, 2019 07:58

minad force-pushed the remove-warray branch from 827e6d1 to b2ae673 Compare November 6, 2019 08:16

czurnieden mentioned this pull request Nov 7, 2019

Removed the resets to INT_MAX in etc/tune.c #453

Closed

minad force-pushed the remove-warray branch 2 times, most recently from c5030d3 to 1bec79e Compare November 9, 2019 05:47

minad added 3 commits November 10, 2019 16:14

remove W array from s_mp_mul_comba and s_mp_sqr_comba

defc68b

remove calls to comba from s_mp_mul and s_mp_mul_high TODO: * Remove remaining W arrays * Replace mp_exch/mp_clear pairs by mp_clear/copy * Check if more mp_init* calls can be replaced by MP_ALIAS/mp_init_size/mp_grow optimization

lift comba limit for s_mp_mul_comba

c421004

this is how it is done in tfm

lift comba limit for s_mp_sqr_comba

c3fe1be

minad force-pushed the remove-warray branch from 1bec79e to c3fe1be Compare November 10, 2019 15:47

minad mentioned this pull request Jan 15, 2020

tfm/ltm relation #429

Closed

minad closed this Jan 16, 2020

sjaeckel mentioned this pull request Oct 27, 2022

add MP_SMALL_STACK_SIZE option #538

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove W array from s_mp_mul_comba and s_mp_sqr_comba #447

remove W array from s_mp_mul_comba and s_mp_sqr_comba #447

minad commented Nov 6, 2019 •

edited

Loading

sjaeckel commented Nov 7, 2019 •

edited

Loading

minad commented Nov 7, 2019 •

edited

Loading

sjaeckel commented Nov 7, 2019

sjaeckel commented Nov 7, 2019 •

edited

Loading

sjaeckel commented Nov 7, 2019 •

edited

Loading

czurnieden commented Nov 7, 2019

minad commented Nov 7, 2019

minad commented Nov 7, 2019

czurnieden commented Nov 10, 2019

minad commented Nov 10, 2019

remove W array from s_mp_mul_comba and s_mp_sqr_comba #447

remove W array from s_mp_mul_comba and s_mp_sqr_comba #447

Conversation

minad commented Nov 6, 2019 • edited Loading

sjaeckel commented Nov 7, 2019 • edited Loading

minad commented Nov 7, 2019 • edited Loading

sjaeckel commented Nov 7, 2019

sjaeckel commented Nov 7, 2019 • edited Loading

sjaeckel commented Nov 7, 2019 • edited Loading

czurnieden commented Nov 7, 2019

minad commented Nov 7, 2019

minad commented Nov 7, 2019

czurnieden commented Nov 10, 2019

minad commented Nov 10, 2019

minad commented Nov 6, 2019 •

edited

Loading

sjaeckel commented Nov 7, 2019 •

edited

Loading

minad commented Nov 7, 2019 •

edited

Loading

sjaeckel commented Nov 7, 2019 •

edited

Loading

sjaeckel commented Nov 7, 2019 •

edited

Loading