Skip to content

Implement memory-mapped IO and multi-threading for BLAKE3 hashing #12676

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 4, 2025

Conversation

silvanshade
Copy link
Member

@silvanshade silvanshade commented Mar 18, 2025

This PR implements memory-mapped IO and multi-threading for BLAKE3 hashing.

Performance with these changes is now on par with the proposed Rust interop: #12416.

Benchmarks

NOTE: These numbers are taken from https://github.com//pull/12416 based on the Rust implementation, and so are only an estimate. In practice they should (for the BLAKE3 results) be nearly identical, based on my local testing and original testing for the `libblake3` TBB feature upstream.

NOTE: The non-BLAKE3 results may be faster by a small margin with this PR than what is stated in the benchmarks since they will also make use of the memory-mapping changes.

Config

CPU: AMD Ryzen 9 7950X 16-Core @ 5.88 GHz
RAM: 96GB @ 6400 MT/s
OS: CachyOS February 2025 release w/ bpfland scx

Input files created with:

head -c <size> /dev/urandom > ~/<size>.bin

example:

head -c 1G /dev/urandom > 1G.bin

Benchmarks all used the following:

hyperfine --warmup 3 './outputs/out/bin/nix hash file --type <algo> <file>'

100K file

BLAKE3 (original)

Benchmark 1: ./outputs/out/bin/nix hash file --type blake3 ~/100K.bin
  Time (mean ± σ):       9.5 ms ±   0.1 ms    [User: 5.6 ms, System: 3.6 ms]
  Range (min … max):     9.1 ms …  10.0 ms    299 runs

BLAKE3 (memory-mapping + tbb)

Benchmark 1: ./outputs/out/bin/nix hash file --type blake3 ~/100K.bin
  Time (mean ± σ):       9.6 ms ±   0.9 ms    [User: 5.9 ms, System: 3.3 ms]
  Range (min … max):     9.2 ms …  24.7 ms    285 runs

SHA256

Benchmark 1: ./outputs/out/bin/nix hash file --type sha256 ~/100K.bin
  Time (mean ± σ):       9.6 ms ±   0.9 ms    [User: 5.7 ms, System: 3.5 ms]
  Range (min … max):     9.0 ms …  23.1 ms    290 runs

SHA512

Benchmark 1: ./outputs/out/bin/nix hash file --type sha512 ~/100K.bin
  Time (mean ± σ):       9.6 ms ±   0.9 ms    [User: 5.8 ms, System: 3.3 ms]
  Range (min … max):     9.0 ms …  23.3 ms    288 runs

10M file

BLAKE3 (original)

Benchmark 1: ./outputs/out/bin/nix hash file --type blake3 ~/10M.bin
  Time (mean ± σ):      11.9 ms ±   2.6 ms    [User: 6.7 ms, System: 4.5 ms]
  Range (min … max):    11.0 ms …  32.8 ms    240 runs

BLAKE3 (memory-mapping + tbb)

Benchmark 1: ./outputs/out/bin/nix hash file --type blake3 ~/10M.bin
  Time (mean ± σ):      12.9 ms ±   1.1 ms    [User: 9.9 ms, System: 22.7 ms]
  Range (min … max):    11.3 ms …  16.6 ms    215 runs

SHA256

Benchmark 1: ./outputs/out/bin/nix hash file --type sha256 ~/10M.bin
  Time (mean ± σ):      13.9 ms ±   0.4 ms    [User: 9.4 ms, System: 4.1 ms]
  Range (min … max):    13.3 ms …  16.3 ms    201 runs

SHA512

Benchmark 1: ./outputs/out/bin/nix hash file --type sha512 ~/10M.bin
  Time (mean ± σ):      18.2 ms ±   0.5 ms    [User: 13.3 ms, System: 4.4 ms]
  Range (min … max):    17.4 ms …  21.6 ms    162 runs

100M file

BLAKE3 (original)

Benchmark 1: ./outputs/out/bin/nix hash file --type blake3 ~/100M.bin
  Time (mean ± σ):      26.2 ms ±   0.8 ms    [User: 17.0 ms, System: 8.7 ms]
  Range (min … max):    24.9 ms …  29.1 ms    111 runs

BLAKE3 (memory-mapping + tbb)

Benchmark 1: ./outputs/out/bin/nix hash file --type blake3 ~/100M.bin
  Time (mean ± σ):      17.5 ms ±   1.6 ms    [User: 33.9 ms, System: 41.7 ms]
  Range (min … max):    15.2 ms …  23.8 ms    128 runs

SHA256

Benchmark 1: ./outputs/out/bin/nix hash file --type sha256 ~/100M.bin
  Time (mean ± σ):      54.1 ms ±   0.5 ms    [User: 44.0 ms, System: 9.5 ms]
  Range (min … max):    53.4 ms …  55.5 ms    55 runs

SHA512

Benchmark 1: ./outputs/out/bin/nix hash file --type sha512 ~/100M.bin
  Time (mean ± σ):      96.1 ms ±   0.9 ms    [User: 85.8 ms, System: 9.4 ms]
  Range (min … max):    95.1 ms …  98.5 ms    31 runs

300M file

BLAKE3 (original)

Benchmark 1: ./outputs/out/bin/nix hash file --type blake3 ~/300M.bin
  Time (mean ± σ):      59.2 ms ±   0.9 ms    [User: 37.7 ms, System: 20.8 ms]
  Range (min … max):    57.8 ms …  61.9 ms    49 runs

BLAKE3 (memory-mapping + tbb)

Benchmark 1: ./outputs/out/bin/nix hash file --type blake3 ~/300M.bin
  Time (mean ± σ):      26.0 ms ±   1.6 ms    [User: 85.6 ms, System: 65.8 ms]
  Range (min … max):    22.8 ms …  31.1 ms    104 runs

SHA256

Benchmark 1: ./outputs/out/bin/nix hash file --type sha256 ~/300M.bin
  Time (mean ± σ):     139.6 ms ±   0.8 ms    [User: 116.4 ms, System: 22.5 ms]
  Range (min … max):   138.1 ms … 141.0 ms    21 runs

SHA512

Benchmark 1: ./outputs/out/bin/nix hash file --type sha512 ~/300M.bin
  Time (mean ± σ):     263.5 ms ±   3.0 ms    [User: 238.0 ms, System: 22.9 ms]
  Range (min … max):   260.4 ms … 269.5 ms    11 runs

1G file

BLAKE3 (original)

Benchmark 1: ./outputs/out/bin/nix hash file --type blake3 ~/1G.bin
  Time (mean ± σ):     190.9 ms ±   1.5 ms    [User: 113.6 ms, System: 76.1 ms]
  Range (min … max):   188.8 ms … 194.3 ms    15 runs

BLAKE3 (memory-mapping + tbb)

Benchmark 1: ./outputs/out/bin/nix hash file --type blake3 ~/1G.bin
  Time (mean ± σ):      52.5 ms ±   5.0 ms    [User: 304.9 ms, System: 114.4 ms]
  Range (min … max):    50.0 ms …  88.9 ms    58 runs

SHA256

Benchmark 1: ./outputs/out/bin/nix hash file --type sha256 ~/1G.bin
  Time (mean ± σ):     465.0 ms ±   4.5 ms    [User: 384.8 ms, System: 77.4 ms]
  Range (min … max):   461.8 ms … 477.0 ms    10 runs

SHA512

Benchmark 1: ./outputs/out/bin/nix hash file --type sha512 ~/1G.bin
  Time (mean ± σ):     877.5 ms ±   8.9 ms    [User: 795.5 ms, System: 77.3 ms]
  Range (min … max):   870.8 ms … 900.8 ms    10 runs

20G file

BLAKE3 (original)

Benchmark 1: ./outputs/out/bin/nix hash file --type blake3 ~/20G.bin
  Time (mean ± σ):      3.155 s ±  0.009 s    [User: 2.236 s, System: 0.914 s]
  Range (min … max):    3.143 s …  3.168 s    10 runs

BLAKE3 (memory-mapping + tbb)

Benchmark 1: ./outputs/out/bin/nix hash file --type blake3 ~/20G.bin
  Time (mean ± σ):     574.1 ms ±   9.8 ms    [User: 8339.7 ms, System: 1430.7 ms]
  Range (min … max):   563.6 ms … 596.4 ms    10 runs

SHA256

Benchmark 1: ./outputs/out/bin/nix hash file --type sha256 ~/20G.bin
  Time (mean ± σ):      8.756 s ±  0.011 s    [User: 7.812 s, System: 0.933 s]
  Range (min … max):    8.737 s …  8.767 s    10 runs

SHA512

Benchmark 1: ./outputs/out/bin/nix hash file --type sha512 ~/20G.bin
  Time (mean ± σ):     17.280 s ±  0.077 s    [User: 16.301 s, System: 0.954 s]
  Range (min … max):   17.220 s … 17.395 s    10 runs

64G file

BLAKE3 (original)

Benchmark 1: ./outputs/out/bin/nix hash file --type blake3 ~/64G.bin
  Time (mean ± σ):     17.145 s ±  0.086 s    [User: 7.154 s, System: 9.578 s]
  Range (min … max):   17.018 s … 17.276 s    10 runs

BLAKE3 (memory-mapping + tbb)

Benchmark 1: ./outputs/out/bin/nix hash file --type blake3 ~/64G.bin
  Time (mean ± σ):      1.822 s ±  0.011 s    [User: 26.895 s, System: 4.875 s]
  Range (min … max):    1.802 s …  1.832 s    10 runs

SHA256

Benchmark 1: ./outputs/out/bin/nix hash file --type sha256 ~/64G.bin
  Time (mean ± σ):     27.455 s ±  0.066 s    [User: 24.323 s, System: 3.072 s]
  Range (min … max):   27.343 s … 27.554 s    10 runs

SHA512

Benchmark 1: ./outputs/out/bin/nix hash file --type sha512 ~/64G.bin
  Time (mean ± σ):     53.807 s ±  0.212 s    [User: 50.615 s, System: 3.118 s]
  Range (min … max):   53.446 s … 54.187 s    10 runs

Motivation

This PR adds additional functionality to the existing BLAKE3 implementation in nix to bring the performance on par with b3sum.

The performance difference between the two is due b3sum making use of the Rust BLAKE3 implementation which uses both memory-mapped IO and multi-threading.

Until recently, multi-threading was not available for the C-based libblake3 but is now supported in release 1.7.0.

Context

This PR is a follow up to #12379 (comment).

Related: NixOS/nixpkgs#390458

Design Considerations

This PR implements memory-mapped IO via boost::iostreams::mapped_file, which adds a boost component dependency for iostreams.

Enabling multi-threading for libblake3 also adds a dependency on tbb of at least version 2021_11.

Memory-mapping is performed in:

void readFile(const Path & path, Sink & sink, bool memory_map = true);

and a new optional parameter memory_map is used to control whether memory-mapping is skipped in favor of normal file reading. (If memory-mapping fails, normal file reading is also used as the fallback).

This makes memory-mapping the default, which likely has performance implications beyond hashing. I would expect this to often be more performant than the alternative given available resources and modern hardware but haven't tested beyond hashing.

It may be appropriate to only enable memory-mapping when explicitly requested and/or gate memory-mapping behind an experimental feature. I can make those changes if requested.

@Ericson2314 @edolstra

@silvanshade silvanshade requested a review from Mic92 March 19, 2025 17:41
@Ericson2314
Copy link
Member

Ericson2314 commented Mar 19, 2025

@silvanshade NixOS/nixpkgs#390458 (comment) I left instructions on how to backport things to Nixpkgfs 24.11. Then we can bump the Nixpkgs in the flake (to a newer version of 24.11), and can do things more simply here.

@silvanshade
Copy link
Member Author

silvanshade commented Mar 20, 2025

I've created 24.11 backport PRs for the tbb and libblake3 versions needed here:

NixOS/nixpkgs#391413
NixOS/nixpkgs#391418

@Mic92
Copy link
Member

Mic92 commented Mar 24, 2025

Do you have by chance some examples, I could use to check the performance?

@Mic92
Copy link
Member

Mic92 commented Mar 24, 2025

I added some patches that I needed to get this build with current nixpkgs unstable. However I think we are currently ending up with two tbb versions somehow? At least I get now crashes on macOS during early inintialization (not just in this pull request but also on master with nixpkgs-unstable).

@silvanshade
Copy link
Member Author

silvanshade commented Mar 24, 2025

Do you have by chance some examples, I could use to check the performance?

I just added a benchmarks section to the original post that gives some details on this.

Performance is an estimate (since it's using the Rust numbers) but practically identical based on my testing locally and original testing upstream for the libblake3 tbb feature.

Also note that if you are testing on macOS, the difference likely won't be as significant due to the lower relative performance per-core of the NEON implementation versus the AVX implementation.

@Mic92
Copy link
Member

Mic92 commented Mar 27, 2025

Please update to this nixpkgs revision once it's merged and in the channel: NixOS/nixpkgs#393691

Than we can get rid of our overrides.

@silvanshade silvanshade force-pushed the blake3-tbb branch 2 times, most recently from 9bad31c to 60b15a6 Compare March 31, 2025 14:32
@silvanshade
Copy link
Member Author

@Ericson2314 @Mic92 I've updated the nixpkgs input with the libblake3 backport and removed the overrides.

@silvanshade
Copy link
Member Author

Reviving this PR now that I think most of the libblake3 build related fixes are in place.

We want to rebase one more time with updated nixpkgs once the very latest 24.11 backported changes (with the MinGW and FreeBSD and i686 fixes) hit the release channel.

Copy link
Member

@Ericson2314 Ericson2314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy all those TBB build system changes are indeed no longer needed!

@silvanshade
Copy link
Member Author

I updated the nixpkgs input and rebased and now libblake3 includes the most recent patches (fixes MinGW, FreeBSD, and 32-bit builds).

I believe this resolves all the known issues with failing builds and this PR should be ready for final review/merge.

Copy link
Member

@Ericson2314 Ericson2314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK!

@Ericson2314 Ericson2314 merged commit 2676ae7 into NixOS:master May 4, 2025
12 checks passed
@edolstra edolstra mentioned this pull request May 9, 2025
2 tasks
@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/nix-2-29-0-released/64609/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants