-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error when using Qwen-14B #24
Comments
Will download the model and try and reproduce this, but I'm noticing that
|
sorry, I have add the |
The 14B model is downloading, but it will take a while. Until then, this is my output with your script with
For reference, I am using But I'll try with QWen-14B too to see if I can reproduce the problem. |
This is my output for QWen-14B:
It seems to work just fine for me. Perhaps you can 1) verify that you have the right
|
I see now that you're using Flash Attention. The current Attention Sinks implementation for QWen doesn't work with FA. I'll try to see if I can extend the implementation so it does work, but I'm still in the process of getting FA installed, so it's not easy to test. |
I'll be testing here: https://github.com/tomaarsen/attention_sinks/tree/model/qwen_fa |
Sadly, I can't reasonably test this without investing some more time into WSL or dualboot, as I'm on Windows. Colab also doesn't work:
and check if it works. It would be very helpful. |
thank you very much for your detailed answer. I will firstly try you method and if does not work I will stop use Flash Attention and test. |
|
alse I want to know if Chat model can use attention_sink as Qwen-14B-Chat. and how to use it in chat model |
Awesome! I'm glad.
That's a shame. There must be a bug there somewhere. I made #25 to add an error when flash attention is used. Perhaps in the future I can try to fix the support for flash attention with QWen.
You can indeed, see for example this script: https://github.com/tomaarsen/attention_sinks/blob/main/demo/streaming.py attention_sinks/demo/streaming.py Lines 37 to 45 in 1f17f70
Note: The |
ok,thank you very much , I will try it. I try other method and has output, this is my code, this method is right or wrong? model_id = "Qwen/Qwen-14B-Chat" text = "保持身体健康的几种方式" |
I'm afraid not. It's important to pass the old |
I see. I will try your method, thank you for quick reply. |
I use Qwen-14B-Chat and some script in demo import torch model_id = "Qwen/Qwen-14B-Chat" prompt = "保持身体健康有多种方式" prompt = tokenizer.apply_chat_template([{"role": "user", "content": prompt}], tokenize=False) max_new_tokens=256 print(output) |
Hello,
When using attention sink with Qwen-14B, I get the following error: TypeError: 'NoneType' object is not subscriptable
my script as is:
import torch
from transformers import AutoTokenizer, TextStreamer, GenerationConfig
from attention_sinks import AutoModelForCausalLM
model_id = "Qwen/Qwen-14B"
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
# for efficiency:
device_map="auto",
torch_dtype=torch.float16,
attention_sink_size=4,
attention_sink_window_size=252, # <- Low for the sake of faster generation
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_id,trust_remote_code=True)
tokenizer.pad_token_id = tokenizer.eos_token_id
text = "保持身体健康有多种方式"
input_ids = tokenizer.encode(text, return_tensors="pt").to(model.device)
with torch.no_grad():
streamer = TextStreamer(tokenizer)
generation_config=GenerationConfig(
use_cache=True,
min_new_tokens=100_000,
max_new_tokens=1_000_000,
penalty_alpha=0.6,
top_k=5,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
generated_tokens = model.generate(
input_ids,
generation_config,
streamer=streamer,
)
# Decode the final generated text
output_text = tokenizer.decode(generated_tokens[0], skip_special_tokens=True)
Error appear in model.generate(), I want to know why this happen
The text was updated successfully, but these errors were encountered: