Skip to content

rpc : update README for cache usage #12620

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 28, 2025
Merged

rpc : update README for cache usage #12620

merged 1 commit into from
Mar 28, 2025

Conversation

rgerganov
Copy link
Collaborator

ref: #10095

@rgerganov rgerganov merged commit ef03229 into master Mar 28, 2025
2 checks passed
@steampunque
Copy link

This functionality is a welcome addition. It speeds up loading a 32B model to one RPC by about a factor of 4 on a 1Gb local lan:

rpc-server -c : real 0m24.977s
rpc-server: real 1m43.777s

There is much more potential speedup available however because I believe the current approach always requires all the weights to be loaded and hashes computed on the master computer loading the model. This is expensive and will still be slow on very big models with a lot of RPCs. It should be possible to reduce RPC load time significantly by not reading any weights on the master computer that will be sent to RPC, instead just send the gguf name and tensor index over RPC and implement a protocol handler for it. An intermediate workaround would be to have a cache of pre-computed hashes for tensors for ggufs on the master computer so when a tensor is to be loaded and hash computed, it first checks local precomputed hash cache and skips both reading the tensor weights and computing the hash if a match is found. This would fit into the existing design smoother. It would also remove the need to check tensor size: all tensors can be efficiently loaded through pre computed hash cache mechansim since no time is spent computing the hashes or reading the weights from the gguf on the master computer.

@rgerganov
Copy link
Collaborator Author

Thanks for the feedback.

Yes, the current approach requires the main host to read the entire model and compute the hashes. We can put precomputed hashes in the GGUF as metadata but we'll still need changes in the backend interface to leverage this, e.g:

void (*load_tensor)  (ggml_backend_buffer_t buffer, struct ggml_tensor * tensor, uint64_t hash);

@steampunque
Copy link

Thanks for the feedback.

Yes, the current approach requires the main host to read the entire model and compute the hashes. We can put precomputed hashes in the GGUF as metadata but we'll still need changes in the backend interface to leverage this, e.g:

void (*load_tensor)  (ggml_backend_buffer_t buffer, struct ggml_tensor * tensor, uint64_t hash);

I think that would be a good approach. Hashes are locked to gguf if stored in metadata so no issue with getting out of sync or maintaining and indexing a messy local precompute cache. If I understand correctly this would make the RPC loading as fast as it could possibly be since host no longer needs to either read or compute hashes on tensors going out to RPC.

@ggerganov
Copy link
Member

I think it would be good to profile what part of the current implementation is spent for hashing and for loading the data. If hashing is the bottleneck, it could be speed-up a lot.

@steampunque
Copy link

I think it would be good to profile what part of the current implementation is spent for hashing and for loading the data. If hashing is the bottleneck, it could be speed-up a lot.

The bottleneck is almost certainly loading the data :

Experiment 1, compute a MD5 of the 32B model:

echo 3 > /proc/sys/vm/drop_caches && time md5sum QwQ-32B.IQ4_XS.gguf
6116cf3feedccc505bef8feafab389ed QwQ-32B.IQ4_XS.gguf

real 1m10.109s
user 0m31.193s
sys 0m3.720s

Experiment 2, read the 32B model off disk:

echo 3 > /proc/sys/vm/drop_caches && time cat QwQ-32B.IQ4_XS.gguf >/dev/null

real 1m9.894s
user 0m0.076s
sys 0m4.060s

MD5 compute negligible overhead.

Experiment 3, load the 32B model into CPU:

echo 3 > /proc/sys/vm/drop_caches && time NGL=0 DRAFT=-1 ll_start qwq 107

real 1m16.130s
user 0m6.153s
sys 0m4.950s

Only 5s slower than reading the raw file.

My models are stored on 16TB exos ext HD connected via USB3. Clearly results will change a lot with a NVME SSD drive but I think USB3 to HD is potentially a very widespread use case.

Once the file is cached into system RAM things get much faster:

time md5sum QwQ-32B.IQ4_XS.gguf
6116cf3feedccc505bef8feafab389ed QwQ-32B.IQ4_XS.gguf

real 0m28.267s
user 0m26.780s
sys 0m1.485s

time cat QwQ-32B.IQ4_XS.gguf >/dev/null

real 0m1.504s
user 0m0.013s
sys 0m1.490s

This shows the MD5 is quite expensive (9900k machine).

time NGL=0 DRAFT=-1 ll_start qwq 107

real 0m4.068s
user 0m6.255s
sys 0m1.693s

this shows extremely fast reload once model is in system RAM cache.

Conclusion is that RPC model loading could be sped up a factor of approximately N where N is total number of computers + RPC (2x for 2 machines, 3x for 3 machines, etc.) by avoiding loading all the model weights on the main host and instead just read a pre computed hash for the layer from gguf metadata and ship it.

Reload speedup after the model is cached in RAM will be more than this N factor (the 4X I found was for a reload i.e. the RPC had cached the hash in RAM and the host had cached the full model in RAM so there was no transfer of model from disk). I would expect very fast reload times though, much more than the 4X, with the pre-computed hashes.

@rgerganov
Copy link
Collaborator Author

@steampunque I have created #12954 to track this effort

@jukofyork
Copy link
Collaborator

jukofyork commented Apr 15, 2025

This shows the MD5 is quite expensive (9900k machine).

Yeah, MD5 is definitely not a good hash to use for this.

https://github.com/Cyan4973/xxHash

is much faster and really simple to implement:

https://create.stephan-brumme.com/xxhash/

Also do we really need to hash the full tensor? Would just the beginning and end bytes, along with the length XORed in to the hash not likely provide a pretty reliable hash?

@steampunque
Copy link

Also do we really need to hash the full tensor? Would just the beginning and end bytes, along with the length XORed in to the hash not likely provide a pretty reliable hash?

I think full tensor is needed due to possibility of fine tunes which might change only parts of data. If any byte anywhere is different in the data it needs to have a different hash to be reliable.

@rgerganov
Copy link
Collaborator Author

Yeah, MD5 is definitely not a good hash to use for this.

We are using FNV-1a hash

Also do we really need to hash the full tensor?

If we don't have to read the whole tensor for computing the hash, then we can achieve a significant speed up when using mmap for loading the model (which is the default). Hashing only the first N bytes of the tensor gives faster load times even with local RPC server.

I think full tensor is needed due to possibility of fine tunes which might change only parts of data. If any byte anywhere is different in the data it needs to have a different hash to be reliable.

Unfortunately that might be the case.

@jukofyork
Copy link
Collaborator

jukofyork commented Apr 17, 2025

Yeah, MD5 is definitely not a good hash to use for this.

We are using FNV-1a hash

It's probably worth checking out the xxHash algorithm then as it's probably an order of magnitude faster - AFAIK, it's the fastest non-cryptographic hash to date:

Screenshot_20250417-082608

Also do we really need to hash the full tensor?

If we don't have to read the whole tensor for computing the hash, then we can achieve a significant speed up when using mmap for loading the model (which is the default). Hashing only the first N bytes of the tensor gives faster load times even with local RPC server.

I think full tensor is needed due to possibility of fine tunes which might change only parts of data. If any byte anywhere is different in the data it needs to have a different hash to be reliable.

Unfortunately that might be the case.

I think it's fairly unlikely, but if you want to reduce the chances even more then instead of just hashing the beginning and end, you could sample a small fraction using a low-discrepancy sequence:

  1. Start with $i_1 = 0$
  2. Then $i_{t + 1} = i_t + n (\phi - 1)$ mod $n-1$

Where $\phi$ is the Golden Ratio (see the "Additive recurrence" section of the wiki page).

There is almost zero chance that a fine-tuned model would only change the bytes of a small sample of this sequence, but I understand if this still seems too risky :)

@steampunque
Copy link

There is almost zero chance that a fine-tuned model would only change the bytes of a small sample of this sequence, but I understand if this still seems too risky :)

Interesting idea. Hard (impossible?) to get to 100% though without looking at every byte. Even the hashes themselves are many to one if one wanted to get pedantic. However the entire concern about hash goes away if they are stored as metadata in gguf. There they can actually act as a self check of data integrity during read (where again every byte is wanted) even if the rpc feature doesn't make use of them. If used for self check again you want a fast hash but if not then hash compute time becomes essentially a dont care and even sha256 could be used.

@steampunque
Copy link

Yeah, MD5 is definitely not a good hash to use for this.

We are using FNV-1a hash

It's probably worth checking out the xxHash algorithm then as it's probably an order of magnitude faster - AFAIK, it's the fastest non-cryptographic hash to date:

Just for fun. Llama-4 Scout Q3_K_M thoughts:

lm recommend the best non-cryptographic hash to use for tensors in a ML model file format supporting a wide range of both old and new hardware

Choosing a non-cryptographic hash function for tensors in a ML model file format that needs to support a wide range of both old and new hardware requires careful consideration. The hash function should be efficient, have a low collision rate, and be compatible with various architectures. Here are some factors to consider and a recommendation:

Requirements:

  1. Speed: The hash function should be fast to compute, as it will be applied to large tensors.
  2. Low collision rate: The hash function should have a low probability of producing the same hash value for different inputs.
  3. Hardware compatibility: The hash function should be compatible with a wide range of hardware, including old and new architectures.

Candidates:

  1. FNV-1a: A widely used, non-cryptographic hash function designed for fast performance. It's available in various sizes (32-bit, 64-bit, 128-bit, etc.).
  2. MurmurHash: Another popular, non-cryptographic hash function designed for performance and low collision rates. It's available in various sizes (32-bit, 64-bit, etc.).
  3. xxHash: A fast, non-cryptographic hash function designed for performance and low collision rates. It's available in various sizes (32-bit, 64-bit, 128-bit, etc.).

Recommendation:

Based on the requirements and candidates, I recommend using xxHash (specifically, xxHash64). Here's why:

  1. Speed: xxHash is designed for performance and is often faster than FNV-1a and MurmurHash.
  2. Low collision rate: xxHash has a low collision rate, which is essential for ensuring data integrity.
  3. Hardware compatibility: xxHash is designed to be compatible with a wide range of hardware, including x86, ARM, and PowerPC architectures.
  4. Wide support: xxHash has a large community and is widely used in various projects, including Google's TensorFlow.

xxHash64 specifics:

  • Hash size: 64 bits (8 bytes)
  • Seed: Optional, but recommended to reduce collision rates
  • Endianness: Supports both little-endian and big-endian architectures

Example use case:

In Python, you can use the xxhash library to compute the xxHash64 of a tensor:

import xxhash
import numpy as np

tensor = np.random.rand(10, 10).astype(np.float32)
hash_value = xxhash.xxh64(tensor.tobytes()).hexdigest()
print(hash_value)

In conclusion, xxHash64 is a suitable choice for a non-cryptographic hash function in a ML model file format, offering a good balance between speed, low collision rate, and hardware compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants