-
-
Notifications
You must be signed in to change notification settings - Fork 6.9k
[Feature]: Composite model loading using AutoWeightsLoader
for all models
#15697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@DarkLight1337 I can try to process several models |
Can you indicate which models you are working on to avoid others duplicating your work? |
I haven't started yet, do you have any recommendations for simpler models? |
Most language models should work in pretty much the same way (except SSMs like Mamba I guess). You can go in alphabetical order. |
Thanks ~ |
Tips, these are unimplemented models, i use shell script to count, this is the approximate result:
|
The multi-modal models don't need this change, can you remove them from the list? |
Hi, I could take on a few models, e.g. baichuan, gpt_neox, and mpt. |
I’ll take on a few more models next week. |
@DarkLight1337 I can add two new skip field, eg. vllm/vllm/model_executor/models/utils.py Line 87 in e9528f6
|
Unless you also need |
Issue record: #16548 (comment) |
🚀 The feature, motivation and pitch
#9160 first introduced
AutoWeightsLoader
to recursively callload_weights
on sub-modules. This lets composite models (most notably multi-modal models) use language backbones (*Model
classes such asLlamaModel
) without having to repeat their weight loading logic.Currently,
load_weights
is only implemented in a few language backbones. It would be great to standardize this approach and apply it to all language backbones in vLLM. The steps to do this are pretty straightforward:load_weights
function from*ForCausalLM
to*Model
.load_weights
function in*ForCausalLM
that loads the weights usingAutoWeightsLoader
.*Model.load_weights
that only applies to*ForCausalLM
back to*ForCausalLM.load_weights
. Usually, this involveslm_head
.For reference, you can look at the implementation for models such as Llama, Gemma2/3, Qwen2 and ChatGLM.
To avoid scope creep, I suggest opening a PR for updating only a few models at a time
Alternatives
No response
Additional context
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: