Skip to content

Feature Request: Improve model load time when using the RPC backend #12954

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 tasks done
rgerganov opened this issue Apr 15, 2025 · 4 comments
Open
4 tasks done

Feature Request: Improve model load time when using the RPC backend #12954

rgerganov opened this issue Apr 15, 2025 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@rgerganov
Copy link
Collaborator

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Load model faster when using one or several RPC servers

Motivation

The local cache of the rpc-server made things better but there is still room for improvements.

Possible Implementation

We may explore storing pre-computed hashes in GGUF and avoid loading the entire model on the main host.

@rgerganov rgerganov added the enhancement New feature or request label Apr 15, 2025
@rgerganov rgerganov self-assigned this Apr 15, 2025
@steampunque
Copy link

steampunque commented Apr 15, 2025

We may explore storing pre-computed hashes in GGUF and avoid loading the entire model on the main host.

This would really help big models like Llama 4 load to large number of RPCs faster. Reloads should also be instant once the hashes and main models weights are cached into system RAM so they don't need to transfer from disk at all.

Also I don't know if its possible but some form of automatic per model cache directory hierarch management is really necessary. This is how I am currently handling this manually:

Script to launch RPC (ll_startrpc) remotely on a machine via ssh:

if [[ ! $AIDIR ]]; then
   AIDIR=~/.ai
fi

# Get machine config
. $AIDIR/ll_start.conf

# Verify the model directories
IFS=':' MODEL_PATHS=($(echo "$MODEL_ROOT"))

HAVE_PATH=0
for path in ${MODEL_PATHS[@]}; do
   if ! test -d $path; then
      echo "$path not found"
   else
      HAVE_PATH=1
      break
   fi
done

if [ $HAVE_PATH -eq 0 ]; then
   echo "No model path in \"$MODEL_ROOT\" found"
   exit 1
fi

if [ $# -gt 0 ]; then
   RPCMODEL=$1
   echo RPC CACHE DIR $path/rpccache/$RPCMODEL
   mkdir -p $path/rpccache/$RPCMODEL
   rm -f ~/.cache/llama.cpp/rpc
   ln -sf $path/rpccache/$RPCMODEL ~/.cache/llama.cpp/rpc
else
   echo RPC CACHE DIR $path/rpccache
   ln -sf $path/rpccache ~/.cache/llama.cpp/rpc
fi

killall -9 llama-server
killall -9 rpc-server

rpc-server -c -H 0.0.0.0 -p 50052 2>/dev/null | tee /dev/null &

Then on master, part of model loading script:

.
.
      # Launch server on remote host
      ssh $RPC0 -t "nohup ll_startrpc $MODEL_ID.$MODEL_QUANT"
.
.

An cache directory is then automatically created when models load. I use model shortnames along with quant extents to differentiate each model/quant to a unique directory.

ls rpccache
DSR132B.IQ4_XS  LL4SI108B.Q2_K_M  QWQ32B.IQ4_XS

The hashes for each model are stored in their individual directories and they can now be easily deleted when either updated or obsoleted by newer models. Without some hierarchy manager like this the rpc cache directory is going to explode to infinity very quickly and become unmanageable. I'm not sure its possible to do this automatically in rpc code but if possible it would make launching much less messy.

@rgerganov
Copy link
Collaborator Author

The hashes for each model are stored in their individual directories and they can now be easily deleted when either updated or obsoleted by newer models. Without some hierarchy manager like this the rpc cache directory is going to explode to infinity very quickly and become unmanageable. I'm not sure its possible to do this automatically in rpc code but if possible it would make launching much less messy.

You can override the cache dir with the LLAMA_CACHE env variable (this is documented in the rpc-server README). So you can do:

$ export LLAMA_CACHE=/path/to/rpccache/DSR132B.IQ4_XS
$ rpc-server -c

@steampunque
Copy link

The hashes for each model are stored in their individual directories and they can now be easily deleted when either updated or obsoleted by newer models. Without some hierarchy manager like this the rpc cache directory is going to explode to infinity very quickly and become unmanageable. I'm not sure its possible to do this automatically in rpc code but if possible it would make launching much less messy.

You can override the cache dir with the LLAMA_CACHE env variable (this is documented in the rpc-server README). So you can do:

$ export LLAMA_CACHE=/path/to/rpccache/DSR132B.IQ4_XS
$ rpc-server -c

Much cleaner, thanks for tip!

@segmond
Copy link

segmond commented Apr 22, 2025

I tried the cache and it was great. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants