-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Qwen2-VL incoherent output with OpenAI API #9732
Comments
@alex-jw-brooks can you add this model to your test suite to check whether the current model implementation is ok? And try to debug any issues (see if your test architecture can be easily debugged in practice) |
I tried to follow Qwen's repo instructions: pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830
pip install accelerate
pip install qwen-vl-utils
# Change to your CUDA version
CUDA_VERSION=cu121
pip install 'vllm==0.6.1' --extra-index-url https://download.pytorch.org/whl/${CUDA_VERSION} (installing the qwen-vl-utils for latest vLLM did not resolve this current issue)
Although it 'works', it's extremely slow, for the 7B it took 3.5s to generate Maybe this insight helps here, I'd love to use qwen2-vl with the latest vLLM version with fastest inference :) |
Edit: Only large images (like 2k or 4k) take several seconds of processing (8 seconds for a 4k image), smaller images take under 0.2s. I don't know if this is due to the vision encoder running sequentially or something else and if that's also true for their HF implementation |
The majority of the time is spent on HF preprocessing. We have plans to move preprocessing out of the critical path to improve the performance. |
Cool! Would this mean one would get almost the same performance as with smaller images or something like a 2x performance gain? |
You can see in #9238 that preprocessing dominates the overall execution time. Even if we move it out of the critical path so that other processes in vLLM can run at the same time as this preprocessing step, we still have to wait for preprocessing to finish before the inputs can be fed into the model. So probably not much gain in this particular case. The best case is when the preprocessing takes around the same time as the other processes. |
@DarkLight1337 Thank you for the insight. What I mean is, would it be worth it to spend time optimizing the preprocessor? Because if so, I'd like to tackle it |
Looking at the profiler output in #9238, you can see that much time is taken up by the preprocessor, so speeding that up would definitely help. However, since most of the preprocessing code is defined inside HuggingFace repo, this is outside our control. See huggingface/transformers#34272 |
Right, sorry for the oversight, it also seems that huggingface/transformers#33810 is working on a fix |
I just posted a similar issue but with totally different params. I wonder if related at all: issue |
@osilverstein I don't think so as any other VLM works for me and it only happens with image input |
@DarkLight1337 Regarding the incoherence of Qwen: Even when supplying 4 smaller images which would easily add up to more tokens than the single big image, it works flawlessly. Something seems to be off with large image processing |
I hear you, but it seems coincidental both issues occur with large inputs and only on the latest version. Too coincidental? I'll ask you this, if you feed in 8k context and ask it how its doing without image input, is it coherent? Then try the same on openrouter. Would help isolate the issue |
@osilverstein Your hypothesis seems to be correct. Throwing an 11k token code (qwens
Very similar incoherence. I'd still like to further discuss this specific problem in your issue Although I am not sure what this implies as my original issue is not context length depended, 5 smaller images (which have way more tokens than one big) works, but one single slightly bigger image produces incoherent output. |
It might not help at all, but the incoherence looks similar to when I was apply the wrong ROPE scaling (at least that was the case when experimenting with exllamav2) |
Interesting finding: |
@SinanAkkoyun What does python 3.10.15 should install, seemly I meet the same issue, thanks a lot!! |
@Wiselnn570 I installed it in python 3.11, I commented in your issue but I am uncertain why you can't build xformers |
Your current environment
The output of `python collect_env.py`
Model Input Dumps
No response
🐛 Describe the bug
When running OpenAI API inference, Qwen2-vl-7B-instruct and 2B produce incoherent output as soon as an image is attached. Other VLM models seem to be working fine.
text-only seems to work with Qwen2VL, but introducing images results in
output like this:
``` Q: What is this? Model output: ThisNam: delimited
't screenshot
The
Or: The a is a isScreenshot
]
.py
clipse]
’m are tool):_p this a (ed:
_rate a is_F
V_on/}
ol,}" are can_t lot
The text was updated successfully, but these errors were encountered: