-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Feature Request: Improve model load time when using the RPC backend #12954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This would really help big models like Llama 4 load to large number of RPCs faster. Reloads should also be instant once the hashes and main models weights are cached into system RAM so they don't need to transfer from disk at all. Also I don't know if its possible but some form of automatic per model cache directory hierarch management is really necessary. This is how I am currently handling this manually: Script to launch RPC (ll_startrpc) remotely on a machine via ssh:
Then on master, part of model loading script:
An cache directory is then automatically created when models load. I use model shortnames along with quant extents to differentiate each model/quant to a unique directory.
The hashes for each model are stored in their individual directories and they can now be easily deleted when either updated or obsoleted by newer models. Without some hierarchy manager like this the rpc cache directory is going to explode to infinity very quickly and become unmanageable. I'm not sure its possible to do this automatically in rpc code but if possible it would make launching much less messy. |
You can override the cache dir with the $ export LLAMA_CACHE=/path/to/rpccache/DSR132B.IQ4_XS
$ rpc-server -c |
Much cleaner, thanks for tip! |
I tried the cache and it was great. Thanks! |
Prerequisites
Feature Description
Load model faster when using one or several RPC servers
Motivation
The local cache of the
rpc-server
made things better but there is still room for improvements.Possible Implementation
We may explore storing pre-computed hashes in GGUF and avoid loading the entire model on the main host.
The text was updated successfully, but these errors were encountered: