-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add qwen2 support #1679
Comments
This is open for contributions if anyone would like to take this up! |
Hi @mattdangerw, @pass-lin, @sachinprasadhs, I would love to contribute to this issue. I'm curious about the integration process and eager to help implement the Qwen 2 model in the Keras Hub. Could you please provide more details or guidance on how I can get started? Thank you! |
Thank you, the Qwen model and the Llama model have almost the exact same network structure. In fact, the only difference is that the Q, K, and V weight in the attention layer have three additional bias . Moreover, the input embedding and output embedding of Qwen 0.5B and Qwen 1.8B are using same layer. The more troublesome part is actually the Tokenizer and weight loading. The Tokenizer could be summarized by doing a secondary encapsulation on the HF implementation. Weight loading can refer to how the Llama model reads HF weights. |
Additionally, I would not recommend adding support for the Qwen model in the near future, as I have found that there are some issues with the Llama implementation in Keras Hub. |
Hi @pass-lin, thank you for sharing your insights and for pointing out the #1993 issue early. |
Actually, I’m not sure why there is such a significant difference in accuracy, because in owner implementation of the llama, the average accuracy difference is clearly less than that of the keras_hub implementation. It’s roughly on the order of 1e-2. Even so, it still affects the actual performance of the model, such as being very prone to decoding repetition. Forgive me for not having the ability and energy to pinpoint the exact differences in implementation layer by layer. tensor(3.9688, device='cuda:0', dtype=torch.bfloat16, grad_fn=<MaxBackward1>)
tensor(0., device='cuda:0', dtype=torch.bfloat16, grad_fn=<MinBackward1>)
tensor(0.1221, device='cuda:0', dtype=torch.bfloat16,
grad_fn=<ToCopyBackward0>)
tensor(0.1689, device='cuda:0', dtype=torch.bfloat16, grad_fn=<StdBackward0>)
tensor([[0.6172, 0.5156, 3.9688, 1.3750, 1.1875, 1.1094, 1.2031, 0.5625, 0.6406,
0.4375, 0.6172, 1.6328, 0.8438, 0.5703, 0.9453, 0.7109, 0.6641, 0.9062,
0.6875, 0.3906, 0.4062, 0.9062]], device='cuda:0',
dtype=torch.bfloat16, grad_fn=<AmaxBackward0>)
tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
device='cuda:0', dtype=torch.bfloat16, grad_fn=<AminBackward0>)
tensor([[0.1079, 0.0747, 0.6094, 0.1748, 0.1484, 0.1562, 0.1099, 0.0796, 0.0835,
0.0684, 0.0854, 0.1167, 0.0962, 0.0776, 0.0889, 0.1006, 0.0767, 0.0723,
0.1138, 0.0547, 0.0515, 0.1348]], device='cuda:0',
dtype=torch.bfloat16, grad_fn=<ToCopyBackward0>)
tensor([[0.0688, 0.0569, 0.4688, 0.1367, 0.1143, 0.1240, 0.0889, 0.0618, 0.0659,
0.0532, 0.0664, 0.0972, 0.0762, 0.0613, 0.0742, 0.0811, 0.0598, 0.0591,
0.0859, 0.0422, 0.0400, 0.1055]], device='cuda:0',
dtype=torch.bfloat16, grad_fn=<StdBackward0>) |
The qwen2 model is the sota of the hf leaderboard. And compared with the llama model, there is only one more bias in the qkv dense of mha. Therefore, only a few modifications are required to achieve compatibility with this high-quality model.
Similarly, the Yi model is also a powerful Chinese LLM. Its performance is comparable to that of qwen2, and it fully adopts the llama architecture.
Therefore, in theory keras_nlp compatible with these two models does not take a lot of time. Hope to achieve compatibility with them in the future
https://huggingface.co/Qwen
https://huggingface.co/01-ai
The text was updated successfully, but these errors were encountered: