Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when using Qwen-14B #24

Open
sun1092469590 opened this issue Oct 23, 2023 · 16 comments
Open

Error when using Qwen-14B #24

sun1092469590 opened this issue Oct 23, 2023 · 16 comments

Comments

@sun1092469590
Copy link

sun1092469590 commented Oct 23, 2023

Hello,

When using attention sink with Qwen-14B, I get the following error: TypeError: 'NoneType' object is not subscriptable

my script as is:

import torch
from transformers import AutoTokenizer, TextStreamer, GenerationConfig
from attention_sinks import AutoModelForCausalLM

model_id = "Qwen/Qwen-14B"
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
# for efficiency:
device_map="auto",
torch_dtype=torch.float16,
attention_sink_size=4,
attention_sink_window_size=252, # <- Low for the sake of faster generation
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_id,trust_remote_code=True)
tokenizer.pad_token_id = tokenizer.eos_token_id

text = "保持身体健康有多种方式"
input_ids = tokenizer.encode(text, return_tensors="pt").to(model.device)

with torch.no_grad():
streamer = TextStreamer(tokenizer)
generation_config=GenerationConfig(
use_cache=True,
min_new_tokens=100_000,
max_new_tokens=1_000_000,
penalty_alpha=0.6,
top_k=5,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
generated_tokens = model.generate(
input_ids,
generation_config,
streamer=streamer,
)
# Decode the final generated text
output_text = tokenizer.decode(generated_tokens[0], skip_special_tokens=True)

Error appear in model.generate(), I want to know why this happen

@tomaarsen
Copy link
Owner

Will download the model and try and reproduce this, but I'm noticing that trust_remote_code=True is not added in the AutoModelForCausalLM.from_pretrained, which means that the model should not be loaded at all, as it's a model with remote code. So, perhaps not having trust_remote_code=True causes the issue?

  • Tom Aarsen

@sun1092469590
Copy link
Author

Will download the model and try and reproduce this, but I'm noticing that trust_remote_code=True is not added in the AutoModelForCausalLM.from_pretrained, which means that the model should not be loaded at all, as it's a model with remote code. So, perhaps not having trust_remote_code=True causes the issue?

  • Tom Aarsen

sorry, I have add the trust_remote_code=True in AutoModelForCausalLM.from_pretrained and AutoTokenizer.from_pretrained, but Error is also happend

@tomaarsen
Copy link
Owner

The 14B model is downloading, but it will take a while. Until then, this is my output with your script with QWen/QWen-7B:

保持身体健康有多种方式,以下是一些建议:

1. 均衡饮食:饮食应包括五大类食物,即谷物、蔬菜、水果、蛋白质和脂肪。避免高糖、高盐、高脂肪和加工食品。

2. 锻炼身体:每周至少进行150分钟的中等强度有氧运动,如快走、跑步、游泳等。此外,还应进行力量训练,如举重、俯卧撑等。

3. 充足睡眠:每晚应保证7-8小时的睡眠时间,以帮助身体恢复和修复。

4. 减少压力:压力是导致许多健康问题的主要原因之一。可以通过冥想、瑜伽、深呼吸等方式来减轻压力。

5. 戒烟限酒:吸烟和过量饮酒都会对身体健康造成负面影响。应尽量避免吸烟和过量饮酒。

6. 定期体检:定期进行体检可以帮助发现潜在的健康问题,并及早进行治疗。

希望这些建议对您有所帮助。如果您有任何其他问题,请随时问我。

<more text>

For reference, I am using transformers==4.34.0, maybe that's the issue?

But I'll try with QWen-14B too to see if I can reproduce the problem.

@tomaarsen
Copy link
Owner

This is my output for QWen-14B:

保持身体健康有多种方式,包括饮食、运动和睡眠。饮食方面,我们应该多吃水果、蔬菜和全谷类食品,少吃高热量、高脂肪和高糖分的食品。运动方面,每周至少进行150分钟的中等强度
有氧运动,如快走、跑步、游泳等。此外,还应该进行力量训练,以增强肌肉和骨骼。睡眠方面,每晚应该保证7-8 小时的睡眠时间,避免熬夜和过度使用电子设备。<more text>

It seems to work just fine for me. Perhaps you can 1) verify that you have the right transformers version and 2) post here the full Traceback, so I can see where it actually throws an error.

  • Tom Aarsen

@sun1092469590
Copy link
Author

thank you very much. my current transformers version is also 4.34.0 and I can run QWen-14B normaly when attention_sink is not added.
7bb2533b13469abc3e4d8da3e71316a
958d65a94de3a3210ba7bd9d2b9bd43
142019338f1b62a5869b09b3de2a6a4
101ec095115660499df401fac840ddc

@tomaarsen
Copy link
Owner

I see now that you're using Flash Attention. The current Attention Sinks implementation for QWen doesn't work with FA. I'll try to see if I can extend the implementation so it does work, but I'm still in the process of getting FA installed, so it's not easy to test.

@tomaarsen
Copy link
Owner

@tomaarsen
Copy link
Owner

Sadly, I can't reasonably test this without investing some more time into WSL or dualboot, as I'm on Windows. Colab also doesn't work: RuntimeError: FlashAttention only supports Ampere GPUs or newer. Perhaps you can run

pip install git+https://github.com/tomaarsen/attention_sinks.git@model/qwen_fa

and check if it works. It would be very helpful.

@sun1092469590
Copy link
Author

thank you very much for your detailed answer. I will firstly try you method and if does not work I will stop use Flash Attention and test.

@sun1092469590
Copy link
Author

sun1092469590 commented Oct 24, 2023

  1. I stop use Flash Attention by add parameter "use_flash_attn=False" in AutoModelForCausalLM.from_pretrained(), and result is normal as you show me. As is :
    model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16,
    attention_sink_size=4,
    attention_sink_window_size=252,
    use_flash_attn=False
    )
  2. I download new Branch version :https://github.com/tomaarsen/attention_sinks/tree/model/qwen_fa ,and test it. code can run without any Error, but result is not very good, maybe falsh_atten is not install correctly ,text = "保持身体健康有多种方式", I show the result as follows,
    image

@sun1092469590
Copy link
Author

alse I want to know if Chat model can use attention_sink as Qwen-14B-Chat. and how to use it in chat model

@tomaarsen
Copy link
Owner

  1. I stop use Flash Attention by add parameter "use_flash_attn=False" in AutoModelForCausalLM.from_pretrained(), and result is normal as you show me. As is :
    model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16,
    attention_sink_size=4,
    attention_sink_window_size=252,
    use_flash_attn=False
    )

Awesome! I'm glad.

2. I download new  Branch version :https://github.com/tomaarsen/attention_sinks/tree/model/qwen_fa ,and test it. code can run without  any Error, but result is not very good, maybe falsh_atten is not install correctly ,text = "保持身体健康有多种方式",  I show the result as follows,
   ![image](https://user-images.githubusercontent.com/19388387/277531786-90fc7232-6953-4dc6-a1c2-1f3b72a733ff.png)

That's a shame. There must be a bug there somewhere. I made #25 to add an error when flash attention is used. Perhaps in the future I can try to fix the support for flash attention with QWen.

alse I want to know if Chat model can use attention_sink as Qwen-14B-Chat. and how to use it in chat model

You can indeed, see for example this script: https://github.com/tomaarsen/attention_sinks/blob/main/demo/streaming.py
In this file, the LLM is continuously given a prompt from a dataset of prompts. In practice, you could wait and receive these prompts from the user on the fly. Then you can generate tokens with this loop:

for _ in range(max_new_tokens):
outputs = model(input_ids, past_key_values=past_key_values, use_cache=True)
past_key_values = outputs.past_key_values
pred_token_idx = outputs.logits[:, -1, :].argmax(dim=-1).unsqueeze(1)
streamer.put(pred_token_idx)
input_ids = pred_token_idx
if pred_token_idx == tokenizer.eos_token_id:
break

Note: The streamer just writes the text to a file and the terminal, that line is optional.

@sun1092469590
Copy link
Author

sun1092469590 commented Oct 24, 2023

ok,thank you very much , I will try it. I try other method and has output, this is my code, this method is right or wrong?
import torch
from transformers import AutoTokenizer, TextStreamer, GenerationConfig
from attention_sinks import AutoModelForCausalLM

model_id = "Qwen/Qwen-14B-Chat"
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.float16,
attention_sink_size=4,
attention_sink_window_size=252,
use_flash_att=False,
trust_remote_code=True
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_id,trust_remote_code=True)
tokenizer.pad_token_id = tokenizer.eos_token_id

text = "保持身体健康的几种方式"
response, history = model.chat(tokenizer,text,history=None)
print(response)

@tomaarsen
Copy link
Owner

I'm afraid not. It's important to pass the old past_key_values to every forward call, which isn't done with model.chat.

@sun1092469590
Copy link
Author

I see. I will try your method, thank you for quick reply.

@sun1092469590
Copy link
Author

sun1092469590 commented Oct 25, 2023

I use Qwen-14B-Chat and some script in demo
/streaming.py to get result , but is very easy to appear OOM,here max_new_tokens=256 and is not very large, my GPU is 4*80G. this is my script and Error log:

import torch
from transformers import AutoTokenizer, TextStreamer, GenerationConfig
from attention_sinks import AutoModelForCausalLM
from typing import Any, Dict, List

model_id = "Qwen/Qwen-14B-Chat"
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.float16,
attention_sink_size=4,
attention_sink_window_size=256,
use_flash_att=False,
trust_remote_code=True
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_id,trust_remote_code=True)
tokenizer.pad_token_id = tokenizer.eos_token_id

prompt = "保持身体健康有多种方式"
past_key_values = None
new_line_tokens = tokenizer("\n\n", return_tensors="pt", add_special_tokens=False).input_ids

prompt = tokenizer.apply_chat_template([{"role": "user", "content": prompt}], tokenize=False)
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to(‘cuda:3’)

max_new_tokens=256
output=""
for _ in range(max_new_tokens):
outputs = model(input_ids, past_key_values=past_key_values, use_cache=True)
past_key_values = outputs.past_key_values
pred_token_idx = outputs.logits[:, -1, :].argmax(dim=-1).unsqueeze(1)
output+=tokenizer.decode(pred_token_idx.cpu()[0],skip_special_tokens=True)
input_ids = pred_token_idx
if pred_token_idx == tokenizer.eos_token_id:
break

print(output)

Here is some Error log:
d56baf0fdac7a38d85befb663e750c9
9f9a74c375fcf78a7bc425ab02ffa4d
4e0e1fbc3d9df7bd2b2e692e851435b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants