Skip to content

remove W array from s_mp_mul_comba and s_mp_sqr_comba #447

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

minad
Copy link
Member

@minad minad commented Nov 6, 2019

Another quick attempt at #441. This time comba will also work if the output aliases the inputs (at the cost of an allocation).

Advantages:

  • No huge stack allocation (better for embedded, bare metal, etc)
  • No fixed MP_WARRAY constant, simplification of the comba criterion (we only have to check that the mp_word doesn't overflow)
  • s_mp_mul and s_mp_mul_comba behave the same (no allocation if no aliasing, allocation otherwise)
  • Code is safer since buffer overflow in heap allocations is easier to detect by valgrind etc
  • Code is more consistent with the rest. These W arrays are a special case. Heap allocations are used everywhere else.

Disadvantage:

  • Allocation on the heap for aliased inputs/outputs necessary
  • mp_init_size/mp_grow/MP_ALIAS boilerplate needed
  • Maybe a little bit slower for aliased output/inputs (but no difference if output doesn't alias inputs)

This needs benchmarking. Maybe we can also make a helper function for this mp_init_size/mp_grow prologue and the mp_exch/mp_clear epilogue to make this nicer.
Furthermore there are still some W arrays left in other functions, which I didn't remove yet. But things can be done similarly.

@sjaeckel Could you please try the RSA benchmark again? Maybe I should also run the timing tool...

@minad minad requested a review from sjaeckel November 6, 2019 00:08
@minad minad force-pushed the remove-warray branch 2 times, most recently from 5c3d362 to de0aa5f Compare November 6, 2019 07:49
@minad minad requested a review from czurnieden November 6, 2019 07:58
@sjaeckel
Copy link
Member

sjaeckel commented Nov 7, 2019

ecc_decrypt_key
ecc_encrypt_key
ecc_sign_hash
ecc_verify_hash
rsa_decrypt_key
rsa_encrypt_key
rsa_sign_hash
rsa_verify_hash

@minad
Copy link
Member Author

minad commented Nov 7, 2019

@sjaeckel Thanks! which code did you use? If I continue to experiment with this PR, I could use these benchmarks. The question is what slowdown would be acceptable, if any.

I would like to get rid of the stack allocations, since this would allow to generalize things and enable full digits. At this point we might win or lose again. But maybe we could add some unrolled comba code in the end from tfm.

@sjaeckel
Copy link
Member

sjaeckel commented Nov 7, 2019

Thanks! which code did you use? If I continue to experiment with this PR, I could use these benchmarks.

https://github.com/libtom/libtomprofile and libtom/libtomcrypt#520

The question is what slowdown would be acceptable, if any.

good question, how about keeping the current "fast" as default and add this version as an option (maybe enabled via MP_LOW_MEM!?) until it is as fast as the version with the stack allocation?

TBH I'd like to only accept it as soon as it's equally fast or faster ;)

I would like to get rid of the stack allocations, since this would allow to generalize things and enable full digits. At this point we might win or lose again. But maybe we could then add some unrolled comba code in the end from tfm.

IMO it's worth a try

@sjaeckel
Copy link
Member

sjaeckel commented Nov 7, 2019

btw. @czurnieden you mentioned that with full digits your fourrier-transform mul/sqr would have to be re-written... AFAIU FT only makes sense for really really big MPI's, so wouldn't it be an option to re-shape the full-digit MPI into a 28/60-bit MPI before the FT operations and back afterwards?

@sjaeckel
Copy link
Member

sjaeckel commented Nov 7, 2019

FYI I've updated the graphs, they also include TFM with the following modifications to speed up ECC and have at least 4096bit RSA

diff --git a/src/headers/tfm.h b/src/headers/tfm.h
index c47b826..af48248 100644
--- a/src/headers/tfm.h
+++ b/src/headers/tfm.h
@@ -103,5 +103,11 @@
 #ifndef FP_MAX_SIZE
-   #define FP_MAX_SIZE           (4096+(8*DIGIT_BIT))
+   #define FP_MAX_SIZE           (8192+(8*DIGIT_BIT))
 #endif
 
+#define TFM_ECC192
+#define TFM_ECC224
+#define TFM_ECC256
+#define TFM_ECC384
+#define TFM_ECC521
+
 /* will this lib work? */

@czurnieden
Copy link
Contributor

FT only makes sense for really really big MPI's,

Everybody has their own use-case ;-)

so wouldn't it be an option to re-shape the full-digit MPI into a 28/60-bit MPI before the FT operations and back afterwards?

I use the 28-bit limbs for the large numbers not only because it is faster (mp_word is a native data type on 64-bit CPUs which makes quite some difference once the numbers get big enough) but it is also easily parted into two 14-bit limbs which gives a large range before I have to cut the numbers into smaller chunks (with a variation of mp_balance) again. 32 bit would result in either 2x16 bit (too large) or 4x8 bit (too small) or 3x11 bit plus padding (too complicated) or restricting it to x86 and use long double (too…uhm…restrictive) or something I haven't even thought of yet.
Or I finally go with NTT, which doesn't have that problem although it has a much larger cut-off point.

Nuh, don't worry, just go on, I'll find a solution.

TBH I'd like to only accept it as soon as it's equally fast or faster

I'm not that strict and would give it a bit more leeway. 1-2% are OK, there is no free lunch, but as it is now it is a tad bit too much, sorry.

@minad
Copy link
Member Author

minad commented Nov 7, 2019

@sjaeckel I will do some experiments at some point regarding full width digits. I will also do some profiling to see the hotspots.

Regarding a merge of this PR - it would be nice to do things step by step and simplifying things as we go. But I agree that it doesn't make sense to merge something which leads to a big slowdown.

Regarding your LOWMEN idea, I don't want to include a half baked other implementation next to the current one.

@minad
Copy link
Member Author

minad commented Nov 7, 2019

Interestingly rsa_verify_hash gets faster here, so maybe for that benchmark no additional allocations are performed.

@minad minad force-pushed the remove-warray branch 2 times, most recently from c5030d3 to 1bec79e Compare November 9, 2019 05:47
remove calls to comba from s_mp_mul and s_mp_mul_high

TODO:
* Remove remaining W arrays
* Replace mp_exch/mp_clear pairs by mp_clear/copy
* Check if more mp_init* calls can be replaced by MP_ALIAS/mp_init_size/mp_grow optimization
this is how it is done in tfm
@czurnieden
Copy link
Contributor

I remembered vaguely that there is a variation of COMBA that might fit our case. Took me a moment with Google but I found it. It shuffles the operations around to make it more parallelizable what made it interesting for me at that time. Two problems: they carefully calculated the complexity but gave no proof of correctness (That's why I didn't use it at the end) and they haven't implemented a parallel version when they benchmarked their algorithm against GMP.

It's worth a look, yes, but I don't know if it's worth more than that.

@minad
Copy link
Member Author

minad commented Nov 10, 2019

@czurnieden The version I added here is very slow though. Maybe it can be done in a better way.

@minad minad mentioned this pull request Jan 15, 2020
@minad minad closed this Jan 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants