Feature Request: Improve model load time when using the RPC backend #12954

rgerganov · 2025-04-15T07:54:42Z

Prerequisites

I am running the latest code. Mention the version if possible as well.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Load model faster when using one or several RPC servers

Motivation

The local cache of the rpc-server made things better but there is still room for improvements.

Possible Implementation

We may explore storing pre-computed hashes in GGUF and avoid loading the entire model on the main host.

The text was updated successfully, but these errors were encountered:

steampunque · 2025-04-15T13:08:16Z

We may explore storing pre-computed hashes in GGUF and avoid loading the entire model on the main host.

This would really help big models like Llama 4 load to large number of RPCs faster. Reloads should also be instant once the hashes and main models weights are cached into system RAM so they don't need to transfer from disk at all.

Also I don't know if its possible but some form of automatic per model cache directory hierarch management is really necessary. This is how I am currently handling this manually:

Script to launch RPC (ll_startrpc) remotely on a machine via ssh:

if [[ ! $AIDIR ]]; then
   AIDIR=~/.ai
fi

# Get machine config
. $AIDIR/ll_start.conf

# Verify the model directories
IFS=':' MODEL_PATHS=($(echo "$MODEL_ROOT"))

HAVE_PATH=0
for path in ${MODEL_PATHS[@]}; do
   if ! test -d $path; then
      echo "$path not found"
   else
      HAVE_PATH=1
      break
   fi
done

if [ $HAVE_PATH -eq 0 ]; then
   echo "No model path in \"$MODEL_ROOT\" found"
   exit 1
fi

if [ $# -gt 0 ]; then
   RPCMODEL=$1
   echo RPC CACHE DIR $path/rpccache/$RPCMODEL
   mkdir -p $path/rpccache/$RPCMODEL
   rm -f ~/.cache/llama.cpp/rpc
   ln -sf $path/rpccache/$RPCMODEL ~/.cache/llama.cpp/rpc
else
   echo RPC CACHE DIR $path/rpccache
   ln -sf $path/rpccache ~/.cache/llama.cpp/rpc
fi

killall -9 llama-server
killall -9 rpc-server

rpc-server -c -H 0.0.0.0 -p 50052 2>/dev/null | tee /dev/null &

Then on master, part of model loading script:

.
.
      # Launch server on remote host
      ssh $RPC0 -t "nohup ll_startrpc $MODEL_ID.$MODEL_QUANT"
.
.

An cache directory is then automatically created when models load. I use model shortnames along with quant extents to differentiate each model/quant to a unique directory.

ls rpccache
DSR132B.IQ4_XS  LL4SI108B.Q2_K_M  QWQ32B.IQ4_XS

The hashes for each model are stored in their individual directories and they can now be easily deleted when either updated or obsoleted by newer models. Without some hierarchy manager like this the rpc cache directory is going to explode to infinity very quickly and become unmanageable. I'm not sure its possible to do this automatically in rpc code but if possible it would make launching much less messy.

rgerganov · 2025-04-15T13:34:07Z

The hashes for each model are stored in their individual directories and they can now be easily deleted when either updated or obsoleted by newer models. Without some hierarchy manager like this the rpc cache directory is going to explode to infinity very quickly and become unmanageable. I'm not sure its possible to do this automatically in rpc code but if possible it would make launching much less messy.

You can override the cache dir with the LLAMA_CACHE env variable (this is documented in the rpc-server README). So you can do:

$ export LLAMA_CACHE=/path/to/rpccache/DSR132B.IQ4_XS
$ rpc-server -c

steampunque · 2025-04-15T13:36:42Z

The hashes for each model are stored in their individual directories and they can now be easily deleted when either updated or obsoleted by newer models. Without some hierarchy manager like this the rpc cache directory is going to explode to infinity very quickly and become unmanageable. I'm not sure its possible to do this automatically in rpc code but if possible it would make launching much less messy.

You can override the cache dir with the LLAMA_CACHE env variable (this is documented in the rpc-server README). So you can do:

$ export LLAMA_CACHE=/path/to/rpccache/DSR132B.IQ4_XS
$ rpc-server -c

Much cleaner, thanks for tip!

segmond · 2025-04-22T04:16:37Z

I tried the cache and it was great. Thanks!

rgerganov added the enhancement New feature or request label Apr 15, 2025

rgerganov self-assigned this Apr 15, 2025

rgerganov mentioned this issue Apr 15, 2025

rpc : update README for cache usage #12620

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Improve model load time when using the RPC backend #12954

Feature Request: Improve model load time when using the RPC backend #12954

rgerganov commented Apr 15, 2025

steampunque commented Apr 15, 2025 •

edited

Loading

rgerganov commented Apr 15, 2025

steampunque commented Apr 15, 2025

segmond commented Apr 22, 2025

Feature Request: Improve model load time when using the RPC backend #12954

Feature Request: Improve model load time when using the RPC backend #12954

Comments

rgerganov commented Apr 15, 2025

Prerequisites

Feature Description

Motivation

Possible Implementation

steampunque commented Apr 15, 2025 • edited Loading

rgerganov commented Apr 15, 2025

steampunque commented Apr 15, 2025

segmond commented Apr 22, 2025

steampunque commented Apr 15, 2025 •

edited

Loading