-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Monero format issues and room for improvement #5668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Well, not exactly in
Before I started with this format a few days ago, this system was giving:
|
Actually, no, if the latter fails I think we'll also crash on trying to find and read |
With this change as well:
|
and check for memory allocation errors. Fixes openwall#5668
I'm not going to bother with this now. Someone else may. Also, someone else may add support for AES-CE on ARM. |
Indeed, there is https://github.com/monero-project/monero/blob/master/src/crypto/slow-hash.c which appears to implement other variants as well (for newer wallets?) and has ARM-specific optimizations. But it's 1887 lines and would be harder to integrate in here. Our code is currently smaller:
Interestingly, they try to allocate a huge page for the 2 MiB area. Our huge page threshold for our We could want to benchmark our Monero format with/without huge page allocation, although on modern Linux we may be getting transparent huge pages anyway. |
Did that. Usually there's no change, but on one system there's up to a 30% speedup (curiously, the exact same speedup is occasionally also seen in rare lucky runs without huge pages, so it may have more to do with data layout in L3 cache). Have to pre-allocate huge pages via
This also happens, per |
Now I know this was pure luck. Groestl is only used by the last test vector added separately via #4732 0c96d95 by @patrickd- in 2021. So e.g. our 1.9.0-jumbo-1's self-test of Monero was unfortunately incomplete (no known bug there, just lack of self-test, so e.g. a miscompile could silently result in 1/4 false negatives). |
With AES-NI, higher default |
There's 8x parallelism in the initial and final loops, which the current code does not expose. It could, also amortizing the cost of AES round key loads. This isn't expected to result in great speedup as maybe 95% of time is spent in the middle loop, but it should provide some speedup nevertheless. I'll re-open this issue for these additional optimizations to implement later. |
Implemented as part of #5677. On one system, this made no performance difference (probably the CPU was almost fully exploiting the parallelism anyway via out-of-order execution and register renaming). On another, there's a 1% to 2% speedup. |
to 2 MiB for most relevant formats, and 12 MiB for those using scrypt. 2 MiB threshold is important for Monero slow hash; Between 8 and 16 MiB was previously found good for yescrypt without ROM (upstream has it at 32 MiB to have RAM and ROM use different TLBs). See openwall#5668, openwall#5678
Speedup on our "super" with tuned thread count and configured huge pages is ~25x compared to what we had a month ago.
Current code with AES-NI forcibly disabled:
Current code:
Yes, 18 threads is optimal (only for fast new code with AES-NI, otherwise 32 was optimal) when huge pages are configured. Otherwise 16 is optimal, giving 698 c/s with defaults or up to ~730 c/s with increased For direct comparison against what we had at 32 threads:
and without huge pages (not important when running 2 threads/core, which compensates for latencies):
and without affinity:
That's still an ~18x speedup if someone doesn't tune anything. |
I inadvertently ran the benchmarks on "super" (which is 2x E5-2670v1 8x DDR3-1600) above with CPUs' turbo boost disabled (I keep it disabled lately to conserve power - there's in fact large difference - and I have shell scripts to turn it on/off as root). I now re-ran some of the above with turbo on (should have gone from 2.6 to 3.0 GHz all-core). 18 is still the optimal thread count (16, 17, 19, or 32 are still slower). Old code (from a month ago):
Current code:
32 threads:
No huge pages:
No tuning at all - 32 threads, no huge pages, no affinity, no
Max tuned speedup is 889/35.4 = ~25.1x, min speedup without new code tuning is ~17x (with affinity for old code) to ~20x (without). |
That was a system where my asm code didn't improve performance yet (it would improve it a lot if hacked to use smaller
|
This asm change also provides some speedup on "super" benchmarked above, at optimal settings (huge pages, 18 threads, affinity):
(was 876, 889) However, at 32 threads there's a regression:
(was ~600) Apparently, with 2 threads/core the latency reduction wasn't as important as avoiding duplicate computation, even when we access some RAM instead of L3. I worry that there may also be a similar regression on newer systems with more L3 cache per core (perhaps AMD), where 2 threads/core is actually optimal for this algorithm. |
Similar behavior for this latest asm change on our "well" (i7-4770K): 2% speedup for 4 threads (optimal), maybe slight slowdown for 8 threads (hard to tell how much since performance is less stable at unoptimal thread count). |
Indeed. Here's a blog post on early state of this code and optimizations: https://da-data.blogspot.com/2014/08/minting-money-with-monero-and-cpu.html It says 100x speedup was achieved relative to code (there's a git commit link) very similar to what we had before my optimizations. The only difference is that de-optimized (the word used by the blogger) version also had |
This has historical mining speeds for what I think is the same algorithm (CryptoNight v0): https://monero.stackexchange.com/questions/41/which-type-of-hardware-is-the-most-efficient-for-mining-monero Those speeds are similar to what I got here, e.g. 283 H/s on i7-4770K at 4 threads (I am now getting ~295 with huge pages), 443 H/s on one E5-2670 with 8 threads (just under half of what I'm getting on two such CPUs, albeit at 9 threads/CPU), but there's also a line showing 965 H/s on 2x E5-2670 (unclear if that's at 16 or 20 threads, two columns in the table say different things), which is 6% higher than my best result here. |
The mining speeds I found above also include those on GPUs (are on par with CPUs, but allowed for higher speeds per chassis, especially with concurrent use of CPUs+GPUs), and also "The Bitmain Antminer X3 is U$1255, ships Aug. 21-31 2018, and weighs 7 kg, it does 220000 H/s using 465 W +7% (2.27 J/kH +7%)". That ASIC was obviously quickly "broken" for its mining use by Monero hard-forks (first switching to algorithm variants, then to Random-X), but I wonder if it's maybe (or maybe not) reusable for Monero wallet password cracking (and could still be acquired, perhaps from former miners). This depends on its implementation detail. Someone with a large Monero wallet with a lost password could want to investigate this possibility. |
Two new related optimization ideas:
Not a lot of speedup is expected from either/both of these because relatively little time is spent in the initial/final loops. |
Non-temporal loads for the final loop sound reasonable in all cases (we're reading this data one final time and aren't going to use it again), but testing (just) them on my 3 test systems made no performance difference. +++ b/src/slow_hash_plug.c
@@ -111,7 +111,7 @@ static inline void sum_half_blocks(block *a, const block *b) {
static inline void xor_blocks(block *a, const block *b) {
#if 1 && MBEDTLS_AESNI_HAVE_CODE == 2
- a->v = _mm_xor_si128(a->v, b->v);
+ a->v = _mm_xor_si128(a->v, _mm_stream_load_si128((void *)b));
#else
a->u64[0] ^= b->u64[0];
a->u64[1] ^= b->u64[1]; (when using inline asm for the middle loop, |
"CUDA-accelerated Monero wallet password cracker based on John the Ripper" by @cynicalpeace https://github.com/cynicalpeace/monero_gpu_cracker 1900 h/s on RTX 4090, based on code from the middle of my work on this issue, much room for further improvement - ideally, I'd like this to become a monero-opencl format in John. |
I figured out why 18 and not 20 threads was optimal on "super" in my earlier tests. I wasn't setting the affinity right to distribute 20 threads across physical sockets equally. 20 is supposed to be optimal according to L3 cache size (we have 40 MiB total across 2 sockets), and with proper affinity setting actually is:
|
Running |
Here are some issues I noticed while optimizing the "slow hash" for our Monero format:
set_key
for a great multi-salt speedup, I guess.oaes_*
functions, butoaes_alloc
andoaes_key_import_data
mail fail on memory allocation. If the former fails, we'll just crash, but if the latter fails we'll get false negatives.Also, perhaps optimized versions of this code exist somewhere already. Isn't it almost the same algorithm they were using as PoW? If so, miners from that era probably contain optimized implementations.
An observation is that the final 1 of 4 hash functions used depends on the password. Luckily (or intentionally?), our 6 test vectors do cover all 4.
The text was updated successfully, but these errors were encountered: