-
Notifications
You must be signed in to change notification settings - Fork 11.5k
rpc : update README for cache usage #12620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a1ee5de
to
c875e03
Compare
This functionality is a welcome addition. It speeds up loading a 32B model to one RPC by about a factor of 4 on a 1Gb local lan: rpc-server -c : real 0m24.977s There is much more potential speedup available however because I believe the current approach always requires all the weights to be loaded and hashes computed on the master computer loading the model. This is expensive and will still be slow on very big models with a lot of RPCs. It should be possible to reduce RPC load time significantly by not reading any weights on the master computer that will be sent to RPC, instead just send the gguf name and tensor index over RPC and implement a protocol handler for it. An intermediate workaround would be to have a cache of pre-computed hashes for tensors for ggufs on the master computer so when a tensor is to be loaded and hash computed, it first checks local precomputed hash cache and skips both reading the tensor weights and computing the hash if a match is found. This would fit into the existing design smoother. It would also remove the need to check tensor size: all tensors can be efficiently loaded through pre computed hash cache mechansim since no time is spent computing the hashes or reading the weights from the gguf on the master computer. |
Thanks for the feedback. Yes, the current approach requires the main host to read the entire model and compute the hashes. We can put precomputed hashes in the GGUF as metadata but we'll still need changes in the backend interface to leverage this, e.g: void (*load_tensor) (ggml_backend_buffer_t buffer, struct ggml_tensor * tensor, uint64_t hash); |
I think that would be a good approach. Hashes are locked to gguf if stored in metadata so no issue with getting out of sync or maintaining and indexing a messy local precompute cache. If I understand correctly this would make the RPC loading as fast as it could possibly be since host no longer needs to either read or compute hashes on tensors going out to RPC. |
I think it would be good to profile what part of the current implementation is spent for hashing and for loading the data. If hashing is the bottleneck, it could be speed-up a lot. |
The bottleneck is almost certainly loading the data : Experiment 1, compute a MD5 of the 32B model: echo 3 > /proc/sys/vm/drop_caches && time md5sum QwQ-32B.IQ4_XS.gguf real 1m10.109s Experiment 2, read the 32B model off disk: echo 3 > /proc/sys/vm/drop_caches && time cat QwQ-32B.IQ4_XS.gguf >/dev/null real 1m9.894s MD5 compute negligible overhead. Experiment 3, load the 32B model into CPU: echo 3 > /proc/sys/vm/drop_caches && time NGL=0 DRAFT=-1 ll_start qwq 107 real 1m16.130s Only 5s slower than reading the raw file. My models are stored on 16TB exos ext HD connected via USB3. Clearly results will change a lot with a NVME SSD drive but I think USB3 to HD is potentially a very widespread use case. Once the file is cached into system RAM things get much faster: time md5sum QwQ-32B.IQ4_XS.gguf real 0m28.267s time cat QwQ-32B.IQ4_XS.gguf >/dev/null real 0m1.504s This shows the MD5 is quite expensive (9900k machine). time NGL=0 DRAFT=-1 ll_start qwq 107 real 0m4.068s this shows extremely fast reload once model is in system RAM cache. Conclusion is that RPC model loading could be sped up a factor of approximately N where N is total number of computers + RPC (2x for 2 machines, 3x for 3 machines, etc.) by avoiding loading all the model weights on the main host and instead just read a pre computed hash for the layer from gguf metadata and ship it. Reload speedup after the model is cached in RAM will be more than this N factor (the 4X I found was for a reload i.e. the RPC had cached the hash in RAM and the host had cached the full model in RAM so there was no transfer of model from disk). I would expect very fast reload times though, much more than the 4X, with the pre-computed hashes. |
@steampunque I have created #12954 to track this effort |
Yeah, MD5 is definitely not a good hash to use for this. https://github.com/Cyan4973/xxHash is much faster and really simple to implement: https://create.stephan-brumme.com/xxhash/ Also do we really need to hash the full tensor? Would just the beginning and end bytes, along with the length XORed in to the hash not likely provide a pretty reliable hash? |
I think full tensor is needed due to possibility of fine tunes which might change only parts of data. If any byte anywhere is different in the data it needs to have a different hash to be reliable. |
We are using FNV-1a hash
If we don't have to read the whole tensor for computing the hash, then we can achieve a significant speed up when using
Unfortunately that might be the case. |
It's probably worth checking out the xxHash algorithm then as it's probably an order of magnitude faster - AFAIK, it's the fastest non-cryptographic hash to date:
I think it's fairly unlikely, but if you want to reduce the chances even more then instead of just hashing the beginning and end, you could sample a small fraction using a low-discrepancy sequence:
Where There is almost zero chance that a fine-tuned model would only change the bytes of a small sample of this sequence, but I understand if this still seems too risky :) |
Interesting idea. Hard (impossible?) to get to 100% though without looking at every byte. Even the hashes themselves are many to one if one wanted to get pedantic. However the entire concern about hash goes away if they are stored as metadata in gguf. There they can actually act as a self check of data integrity during read (where again every byte is wanted) even if the rpc feature doesn't make use of them. If used for self check again you want a fast hash but if not then hash compute time becomes essentially a dont care and even sha256 could be used. |
Just for fun. Llama-4 Scout Q3_K_M thoughts: lm recommend the best non-cryptographic hash to use for tensors in a ML model file format supporting a wide range of both old and new hardware Choosing a non-cryptographic hash function for tensors in a ML model file format that needs to support a wide range of both old and new hardware requires careful consideration. The hash function should be efficient, have a low collision rate, and be compatible with various architectures. Here are some factors to consider and a recommendation: Requirements:
Candidates:
Recommendation: Based on the requirements and candidates, I recommend using xxHash (specifically, xxHash64). Here's why:
xxHash64 specifics:
Example use case: In Python, you can use the import xxhash
import numpy as np
tensor = np.random.rand(10, 10).astype(np.float32)
hash_value = xxhash.xxh64(tensor.tobytes()).hexdigest()
print(hash_value) In conclusion, xxHash64 is a suitable choice for a non-cryptographic hash function in a ML model file format, offering a good balance between speed, low collision rate, and hardware compatibility. |
ref: #10095