Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The outpout of given model(mit-han-lab/Llama-3-8B-QServe-g128) is mistaken #21

Open
haichuan1221 opened this issue Jul 2, 2024 · 4 comments

Comments

@haichuan1221
Copy link

Here is the output I get! I check the comment on the code, it seems that the code is unfinished, right?

{'id': 0, 'text': 'Hello, my name is\tmsgfinished': True}
{'id': 1, 'text': 'The president of the United States is\tmsgfinished': True}
{'id': 2, 'text': 'The capital of France is\tmsg!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!', 'finished': True}
{'id': 3, 'text': 'The future of AI is\tmsgfinished': True}

@ys-2020
Copy link
Contributor

ys-2020 commented Jul 13, 2024

Hi @haichuan1221 , thanks for your interest in QServe. This is not the expected behavior. The code is finished for Llama-3. Could you please provide more details like how did you launch the e2e generation script?

@Patrick-Lew
Copy link

Patrick-Lew commented Aug 7, 2024

[UPDATE] I think I have found the reason, the e2e benchmark only support Llama-3-8B-Instrcut-QServe model, it doesn't work on Llama-3-8B-QServe
Hi @ys-2020 , I met the same situation while doing the e2e test on Llama-3-8B-QServe model. The output obeyed the same pattern like '{one valid token}!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!'. And I found the the attn_output of attn_output = fused_attention.single_query_attention was like tensor([[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]]], device='cuda:0',
dtype=torch.float16)

@YudiZh
Copy link

YudiZh commented Jan 11, 2025

hi @Patrick-Lew have you solve this problem? i met the same situation.

@Patrick-Lew
Copy link

hi @Patrick-Lew have you solve this problem? i met the same situation.

hi @YudiZh , I haven't solved that yet, actually I've moved to another project to try my ideas shortly after I met this situation.
I didn't dive deeply into this project, and I'm very happy to learn the reasons if you can find them.

Hope you can solve that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants