Add qwen2 support #1679

pass-lin · 2024-06-28T01:28:56Z

The qwen2 model is the sota of the hf leaderboard. And compared with the llama model, there is only one more bias in the qkv dense of mha. Therefore, only a few modifications are required to achieve compatibility with this high-quality model.
Similarly, the Yi model is also a powerful Chinese LLM. Its performance is comparable to that of qwen2, and it fully adopts the llama architecture.
Therefore, in theory keras_nlp compatible with these two models does not take a lot of time. Hope to achieve compatibility with them in the future
https://huggingface.co/Qwen
https://huggingface.co/01-ai

mattdangerw · 2024-07-18T21:29:34Z

This is open for contributions if anyone would like to take this up!

SAIREDDY07 · 2024-12-02T06:32:20Z

Hi @mattdangerw, @pass-lin, @sachinprasadhs,

I would love to contribute to this issue. I'm curious about the integration process and eager to help implement the Qwen 2 model in the Keras Hub. Could you please provide more details or guidance on how I can get started?

Thank you!

pass-lin · 2024-12-02T07:12:43Z

Hi @mattdangerw, @pass-lin, @sachinprasadhs,

I would love to contribute to this issue. I'm curious about the integration process and eager to help implement the Qwen 2 model in the Keras Hub. Could you please provide more details or guidance on how I can get started?

Thank you!

Thank you, the Qwen model and the Llama model have almost the exact same network structure. In fact, the only difference is that the Q, K, and V weight in the attention layer have three additional bias . Moreover, the input embedding and output embedding of Qwen 0.5B and Qwen 1.8B are using same layer. The more troublesome part is actually the Tokenizer and weight loading. The Tokenizer could be summarized by doing a secondary encapsulation on the HF implementation. Weight loading can refer to how the Llama model reads HF weights.
This implementation is not complex. The question is whether the Qwen model should be classified together with the Llama3 model, distinguished by config and keywords, or set up as a separate category. Personally, I lean towards the former because Qwen and Llama are so similar that it seems unnecessary to divide them into two class.
I plan to work on this implementation during the Chinese New Year holiday, which is about two months away. But if you are interested in contributing, you can take on this task.

pass-lin · 2024-12-03T04:44:33Z

Hi @mattdangerw, @pass-lin, @sachinprasadhs,

I would love to contribute to this issue. I'm curious about the integration process and eager to help implement the Qwen 2 model in the Keras Hub. Could you please provide more details or guidance on how I can get started?

Thank you!

Additionally, I would not recommend adding support for the Qwen model in the near future, as I have found that there are some issues with the Llama implementation in Keras Hub.
#1993

SAIREDDY07 · 2024-12-03T05:57:04Z

Hi @pass-lin, thank you for sharing your insights and for pointing out the #1993 issue early.
I’d like to understand more about the challenges you’ve identified. Could you elaborate on the specific reasons causing the issue with the Llama implementation in Keras Hub? Do you think this problem might stem from the way attention mechanisms are implemented in Llama, or are there other architectural or integration factors at play? Your guidance would help clarify the situation and assist in addressing the root cause effectively.

pass-lin · 2024-12-03T11:26:24Z

Hi @pass-lin, thank you for sharing your insights and for pointing out the #1993 issue early. I’d like to understand more about the challenges you’ve identified. Could you elaborate on the specific reasons causing the issue with the Llama implementation in Keras Hub? Do you think this problem might stem from the way attention mechanisms are implemented in Llama, or are there other architectural or integration factors at play? Your guidance would help clarify the situation and assist in addressing the root cause effectively.

Actually, I’m not sure why there is such a significant difference in accuracy, because in owner implementation of the llama, the average accuracy difference is clearly less than that of the keras_hub implementation. It’s roughly on the order of 1e-2. Even so, it still affects the actual performance of the model, such as being very prone to decoding repetition. Forgive me for not having the ability and energy to pinpoint the exact differences in implementation layer by layer.
The comparison of the differences between my implementation and the HF implementation is probably as follows:

tensor(3.9688, device='cuda:0', dtype=torch.bfloat16, grad_fn=<MaxBackward1>)
tensor(0., device='cuda:0', dtype=torch.bfloat16, grad_fn=<MinBackward1>)
tensor(0.1221, device='cuda:0', dtype=torch.bfloat16,
       grad_fn=<ToCopyBackward0>)
tensor(0.1689, device='cuda:0', dtype=torch.bfloat16, grad_fn=<StdBackward0>)
tensor([[0.6172, 0.5156, 3.9688, 1.3750, 1.1875, 1.1094, 1.2031, 0.5625, 0.6406,
         0.4375, 0.6172, 1.6328, 0.8438, 0.5703, 0.9453, 0.7109, 0.6641, 0.9062,
         0.6875, 0.3906, 0.4062, 0.9062]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<AmaxBackward0>)
tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<AminBackward0>)
tensor([[0.1079, 0.0747, 0.6094, 0.1748, 0.1484, 0.1562, 0.1099, 0.0796, 0.0835,
         0.0684, 0.0854, 0.1167, 0.0962, 0.0776, 0.0889, 0.1006, 0.0767, 0.0723,
         0.1138, 0.0547, 0.0515, 0.1348]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<ToCopyBackward0>)
tensor([[0.0688, 0.0569, 0.4688, 0.1367, 0.1143, 0.1240, 0.0889, 0.0618, 0.0659,
         0.0532, 0.0664, 0.0972, 0.0762, 0.0613, 0.0742, 0.0811, 0.0598, 0.0591,
         0.0859, 0.0422, 0.0400, 0.1055]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<StdBackward0>)

github-actions bot assigned sachinprasadhs Jun 28, 2024

sachinprasadhs added type:feature New feature or request keras-team-review-pending labels Jul 12, 2024

mattdangerw added help wanted Extra attention is needed stat:contributions welcome Add this label to feature request issues so they are separated out from bug reporting issues labels Jul 18, 2024

mattdangerw removed the keras-team-review-pending label Jul 18, 2024

mattdangerw changed the title ~~Will keras_nlp support qwen2 model in future?~~ Add qwen2 support Sep 17, 2024

mattdangerw mentioned this issue Sep 17, 2024

🗺️ KerasHub Roadmap 🗺️ #1836

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add qwen2 support #1679

Add qwen2 support #1679

pass-lin commented Jun 28, 2024

mattdangerw commented Jul 18, 2024

SAIREDDY07 commented Dec 2, 2024

pass-lin commented Dec 2, 2024

pass-lin commented Dec 3, 2024 •

edited

Loading

SAIREDDY07 commented Dec 3, 2024

pass-lin commented Dec 3, 2024 •

edited

Loading

Add qwen2 support #1679

Add qwen2 support #1679

Comments

pass-lin commented Jun 28, 2024

mattdangerw commented Jul 18, 2024

SAIREDDY07 commented Dec 2, 2024

pass-lin commented Dec 2, 2024

pass-lin commented Dec 3, 2024 • edited Loading

SAIREDDY07 commented Dec 3, 2024

pass-lin commented Dec 3, 2024 • edited Loading

pass-lin commented Dec 3, 2024 •

edited

Loading

pass-lin commented Dec 3, 2024 •

edited

Loading