Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add qwen2 support #1679

Open
pass-lin opened this issue Jun 28, 2024 · 6 comments
Open

Add qwen2 support #1679

pass-lin opened this issue Jun 28, 2024 · 6 comments
Assignees
Labels
help wanted Extra attention is needed stat:contributions welcome Add this label to feature request issues so they are separated out from bug reporting issues type:feature New feature or request

Comments

@pass-lin
Copy link

The qwen2 model is the sota of the hf leaderboard. And compared with the llama model, there is only one more bias in the qkv dense of mha. Therefore, only a few modifications are required to achieve compatibility with this high-quality model.
Similarly, the Yi model is also a powerful Chinese LLM. Its performance is comparable to that of qwen2, and it fully adopts the llama architecture.
Therefore, in theory keras_nlp compatible with these two models does not take a lot of time. Hope to achieve compatibility with them in the future
https://huggingface.co/Qwen
https://huggingface.co/01-ai

@mattdangerw mattdangerw added help wanted Extra attention is needed stat:contributions welcome Add this label to feature request issues so they are separated out from bug reporting issues labels Jul 18, 2024
@mattdangerw
Copy link
Member

This is open for contributions if anyone would like to take this up!

@mattdangerw mattdangerw changed the title Will keras_nlp support qwen2 model in future? Add qwen2 support Sep 17, 2024
@SAIREDDY07
Copy link

Hi @mattdangerw, @pass-lin, @sachinprasadhs,

I would love to contribute to this issue. I'm curious about the integration process and eager to help implement the Qwen 2 model in the Keras Hub. Could you please provide more details or guidance on how I can get started?

Thank you!

@pass-lin
Copy link
Author

pass-lin commented Dec 2, 2024

Hi @mattdangerw, @pass-lin, @sachinprasadhs,

I would love to contribute to this issue. I'm curious about the integration process and eager to help implement the Qwen 2 model in the Keras Hub. Could you please provide more details or guidance on how I can get started?

Thank you!

Thank you, the Qwen model and the Llama model have almost the exact same network structure. In fact, the only difference is that the Q, K, and V weight in the attention layer have three additional bias . Moreover, the input embedding and output embedding of Qwen 0.5B and Qwen 1.8B are using same layer. The more troublesome part is actually the Tokenizer and weight loading. The Tokenizer could be summarized by doing a secondary encapsulation on the HF implementation. Weight loading can refer to how the Llama model reads HF weights.
This implementation is not complex. The question is whether the Qwen model should be classified together with the Llama3 model, distinguished by config and keywords, or set up as a separate category. Personally, I lean towards the former because Qwen and Llama are so similar that it seems unnecessary to divide them into two class.
I plan to work on this implementation during the Chinese New Year holiday, which is about two months away. But if you are interested in contributing, you can take on this task.

@pass-lin
Copy link
Author

pass-lin commented Dec 3, 2024

Hi @mattdangerw, @pass-lin, @sachinprasadhs,

I would love to contribute to this issue. I'm curious about the integration process and eager to help implement the Qwen 2 model in the Keras Hub. Could you please provide more details or guidance on how I can get started?

Thank you!

Additionally, I would not recommend adding support for the Qwen model in the near future, as I have found that there are some issues with the Llama implementation in Keras Hub.
#1993

@SAIREDDY07
Copy link

Hi @pass-lin, thank you for sharing your insights and for pointing out the #1993 issue early.
I’d like to understand more about the challenges you’ve identified. Could you elaborate on the specific reasons causing the issue with the Llama implementation in Keras Hub? Do you think this problem might stem from the way attention mechanisms are implemented in Llama, or are there other architectural or integration factors at play? Your guidance would help clarify the situation and assist in addressing the root cause effectively.

@pass-lin
Copy link
Author

pass-lin commented Dec 3, 2024

Hi @pass-lin, thank you for sharing your insights and for pointing out the #1993 issue early. I’d like to understand more about the challenges you’ve identified. Could you elaborate on the specific reasons causing the issue with the Llama implementation in Keras Hub? Do you think this problem might stem from the way attention mechanisms are implemented in Llama, or are there other architectural or integration factors at play? Your guidance would help clarify the situation and assist in addressing the root cause effectively.

Actually, I’m not sure why there is such a significant difference in accuracy, because in owner implementation of the llama, the average accuracy difference is clearly less than that of the keras_hub implementation. It’s roughly on the order of 1e-2. Even so, it still affects the actual performance of the model, such as being very prone to decoding repetition. Forgive me for not having the ability and energy to pinpoint the exact differences in implementation layer by layer.
The comparison of the differences between my implementation and the HF implementation is probably as follows:

tensor(3.9688, device='cuda:0', dtype=torch.bfloat16, grad_fn=<MaxBackward1>)
tensor(0., device='cuda:0', dtype=torch.bfloat16, grad_fn=<MinBackward1>)
tensor(0.1221, device='cuda:0', dtype=torch.bfloat16,
       grad_fn=<ToCopyBackward0>)
tensor(0.1689, device='cuda:0', dtype=torch.bfloat16, grad_fn=<StdBackward0>)
tensor([[0.6172, 0.5156, 3.9688, 1.3750, 1.1875, 1.1094, 1.2031, 0.5625, 0.6406,
         0.4375, 0.6172, 1.6328, 0.8438, 0.5703, 0.9453, 0.7109, 0.6641, 0.9062,
         0.6875, 0.3906, 0.4062, 0.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AmaxBackward0>)
tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AminBackward0>)
tensor([[0.1079, 0.0747, 0.6094, 0.1748, 0.1484, 0.1562, 0.1099, 0.0796, 0.0835,
         0.0684, 0.0854, 0.1167, 0.0962, 0.0776, 0.0889, 0.1006, 0.0767, 0.0723,
         0.1138, 0.0547, 0.0515, 0.1348]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<ToCopyBackward0>)
tensor([[0.0688, 0.0569, 0.4688, 0.1367, 0.1143, 0.1240, 0.0889, 0.0618, 0.0659,
         0.0532, 0.0664, 0.0972, 0.0762, 0.0613, 0.0742, 0.0811, 0.0598, 0.0591,
         0.0859, 0.0422, 0.0400, 0.1055]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<StdBackward0>)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed stat:contributions welcome Add this label to feature request issues so they are separated out from bug reporting issues type:feature New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants