-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Regression ~~for AWQ marlin kernels~~ from v0.6.2 to v0.6.3 when using CUDA Graphs #9417
Comments
We got the same issue, but it works with enforce-eager too. |
cc @mgoin to route |
Thanks for reporting, @ElizaWszola do you have any ideas? We added awq_marlin for dense models quite a while ago (#6612), so I wouldn't expect changes between 0.6.2 and 0.6.3 It is possible that some of the binary size reduction work introduced a dynamic case, but I don't remember this applying to non-MoE. @joennlae @leangermany any minimal reproducible example would be appreciated! I tried the following with and without enforce_eager using Llama 3.1 8B AWQ, and found that the output was completely unaffected. Script: from vllm import LLM, SamplingParams
llm = LLM(model="hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4", enforce_eager=True)
messages = [{"role": "user", "content": "Prove that the difference between two consecutive cubes cannot be divisible by 5, using the fact that the only possible remainders when a cube is divided by 5 are 0, 1, and -1."}]
outputs = llm.chat(messages, sampling_params=SamplingParams(temperature=0.0, max_tokens=1000))
print(outputs[0].outputs[0].text) Output:
|
So, unfortunately, I cannot share the exact prompt as we use sensitive user data.
|
Maybe it is not a marlin issue (sorry for the call out): #9448 |
Quantization config:
Not working example, that answers not correct:
Answer: Based onfigured SF- SF- SF- SF- SF- <| | | [... Working example, with slightly less tokens in the context:
Answer: Based on the provided content, the most common cancer type at baseline is Breast cancer, with 314 participants (31.9%) having this type of cancer. |
If I understand correctly passing Personally I’ve run into some odd repetitions when using AWQ models with flashinfer backend during old vs new beam search tests. That attention backend even seemed to impact the generation for |
Closing as fixed by #9549 |
Your current environment
First of all: fantastic project :-) Thank you for everything.
I would like to fix this bug. But I just do not have the capacity now. So I just thought I would try to make a good bug report.
Model Input Dumps
No response
🐛 Describe the bug
If I run this model in
v0.6.2
:All works well and good :-)
If I run it in
v0.6.3
All works well and good with enforce eager :-)
If I drop the
enforce-eager
I get random repetition on large prompts 6000+ token. Or if I do multiple request in parallel I get
CUDA: illegal memory access
My guess is that there is something dynamic in the updated
awq_marlin
kernels.My hunch (this is untested): #8973 but I am not fully understanding how my non MoE should be affected by this.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: