-
-
Notifications
You must be signed in to change notification settings - Fork 6.9k
[Bug]: Huge memory overhead with V1 (multiprocessing) when handling several multimodal inputs #16185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you see if |
|
cc @ywang96 @robertgshaw2-redhat maybe it's related to transferring the multimodal inputs between processes? |
We have some in-progress improvements for this aspect. |
Can you try out #16273 and see if it improves the memory overhead? |
It's much better ... but it's unfortunately still pretty much unusable. With muliprocessing, handling ~64 1Mpix images used to exceed 52GB and OOM at this point (on a 64GB system...) So while that's likely a 2x improvement, it's still very slow in the preparation phase and the memory use is still very significant. ~16GB memory load to process ~128MB worth of data seems pretty steep. |
How much memory does vLLM use if you disable preprocessor cache? |
If the preprocessor cache is disabled, the patch has no effect. There may be some difference at that ~4GB level since i was mostly looking at the multi-GB consumption. |
Quite surprising that disabling the cache actually increases the memory overhead by that much. I guess @njhill 's WIP should address that. Thanks for reporting the results! |
Well, the patch depends on the items to be in the cache, no ? So I would assume if the cache disabled, it will revert to the pre-patch behavior, basically. |
Yes, you're correct. So the main problem is still about how the data is being transferred in multiprocessing mode. |
Oh, okay, it seems that serialization itself is not really that big of the problem, but the actual issue was in encoding the multimodal args properly. This POC seems to fix the problem: |
Thanks @p88h, aren't encoding and serialization essentially the same thing? :) This is similar to what I was planning - it just needs to be a bit more general since AFAIK Oh and also to do this in a way that avoids using pickle since we're planning to disable the use of pickle by default (that was actually a secondary motivation). cc @russellb |
In this case they really aren't the same - just as in the previous similar workaround for bare torch.Tensor, serializing it via pickle uses print format, not bytes. I agree something more generic that can just do this in zero-copy fashion (and without pickles) would be preferable, I will have a look at your PR, updated mine to handle NestedTensors properly in case it helps. |
I completed a simple benchmark to quantify the problems along the impact of PRs affecting this. Results below; All times Best of 2 runs each, memory averaged Note 32 images has 50% duplication, 64 images has 75% duplication (the image files are repeated 2/4 times). Each image is about 1Mpix, and generates ~1000 tokens. Model used: Qwen/Qwen2.5-VL-7B-Instruct-AWQ. Output limited to 1024 tokens. Times marked with * mean the model hallucinates and actually hits the limit.
Note that disabling cache now effectively disables the effect of #16273 when multi-processing. All of these changes have no impact on single-process mode. #16279 is built on top of #13790 now so can't be deployed separately. with combined communication improvements, this is now usable but there is still something generating a ton of memory use that it does not in the single-threaded version. |
Just to be sure, what are you referring to by "baseline" here? Is it main branch? |
@DarkLight1337 baseline = main branch @24f6b9a71397539a3d02c801963220b0e9a2aef9 (=yesterday, before #16273 was applied) |
Thanks @p88h, this is great! It's expected that #13790 by itself doesn't change much but I expect that the current combination of #13790 with your PR should give a significant reduction over your original PR without it. I understand that to test that you'd have to revert to an older version of the PR though. An easy way to measure the "zero copy" benefits would be to disable that by changing this line in if not obj.shape or obj.nbytes < INLINE_BUF_SIZE_THRESHOLD: to if True: |
Interesting... With inline mode, I do get warnings about read-only Tensors. But that's not the interesting bit. It seems that the overall memory usage is significantly lower with inline mode, and performance is back to the single-thread levels, at least for the one test case I've ran so far. I'll add some benchmark results to the PR and figure out a way to make this work efficiently by default. |
I added benchmark results in the PR. |
Your current environment
The output of `python collect_env.py`
🐛 Describe the bug
This should be reproducible with QWEN VL 2.5 and using
vision_language_multi_image.py
offline inference.When configured to use several images as input (e.g. just multiply
IMAGE_URLS
8-16 times), the CPU memory usage spikes dramatically. With just around 20 or so images, VLLM will try to consume around20-30GB
of RAM, with 40 it will get into50-60GB
range.It's interesting that the number of passed elements seem to be a problem, while the size causes less problems (ie. when merging multiple images into one, without scaling, it's possible to send ~60 images via 4 collages and still fit within 30GB). This happens regardless of whether qwen-vl-utils are installed to resize the images.
At the same time request preprocessing starts to slow down, and while profiling was not super useful due to multiprocessing it did help as the next step was disabling that. With
VLLM_ENABLE_V1_MULTIPROCESSING=0
the issue completely disappears, processing delays are gone and even with ~100 input files the memory usage is in low GBs.I haven't tried profiling MessageQueue yet, but perhaps someone will also run into this, or maybe already ran into?
BTW - the environment above was collected running within WSL, but the OS doesn't seem to be a significant factor here - while WSL does run into memory issues earlier than raw Linux, running on raw Linux exhibits the same behavior.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: