Error when using Qwen-14B #24

sun1092469590 · 2023-10-23T06:57:02Z

Hello,

When using attention sink with Qwen-14B, I get the following error: TypeError: 'NoneType' object is not subscriptable

my script as is:

import torch
from transformers import AutoTokenizer, TextStreamer, GenerationConfig
from attention_sinks import AutoModelForCausalLM

model_id = "Qwen/Qwen-14B"
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
# for efficiency:
device_map="auto",
torch_dtype=torch.float16,
attention_sink_size=4,
attention_sink_window_size=252, # <- Low for the sake of faster generation
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_id,trust_remote_code=True)
tokenizer.pad_token_id = tokenizer.eos_token_id

text = "保持身体健康有多种方式"
input_ids = tokenizer.encode(text, return_tensors="pt").to(model.device)

with torch.no_grad():
streamer = TextStreamer(tokenizer)
generation_config=GenerationConfig(
use_cache=True,
min_new_tokens=100_000,
max_new_tokens=1_000_000,
penalty_alpha=0.6,
top_k=5,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
generated_tokens = model.generate(
input_ids,
generation_config,
streamer=streamer,
)
# Decode the final generated text
output_text = tokenizer.decode(generated_tokens[0], skip_special_tokens=True)

Error appear in model.generate(), I want to know why this happen

tomaarsen · 2023-10-23T07:36:53Z

Will download the model and try and reproduce this, but I'm noticing that trust_remote_code=True is not added in the AutoModelForCausalLM.from_pretrained, which means that the model should not be loaded at all, as it's a model with remote code. So, perhaps not having trust_remote_code=True causes the issue?

Tom Aarsen

sun1092469590 · 2023-10-23T07:49:51Z

Will download the model and try and reproduce this, but I'm noticing that trust_remote_code=True is not added in the AutoModelForCausalLM.from_pretrained, which means that the model should not be loaded at all, as it's a model with remote code. So, perhaps not having trust_remote_code=True causes the issue?

Tom Aarsen

sorry, I have add the trust_remote_code=True in AutoModelForCausalLM.from_pretrained and AutoTokenizer.from_pretrained, but Error is also happend

tomaarsen · 2023-10-23T07:56:08Z

The 14B model is downloading, but it will take a while. Until then, this is my output with your script with QWen/QWen-7B:

保持身体健康有多种方式，以下是一些建议：

1. 均衡饮食：饮食应包括五大类食物，即谷物、蔬菜、水果、蛋白质和脂肪。避免高糖、高盐、高脂肪和加工食品。

2. 锻炼身体：每周至少进行150分钟的中等强度有氧运动，如快走、跑步、游泳等。此外，还应进行力量训练，如举重、俯卧撑等。

3. 充足睡眠：每晚应保证7-8小时的睡眠时间，以帮助身体恢复和修复。

4. 减少压力：压力是导致许多健康问题的主要原因之一。可以通过冥想、瑜伽、深呼吸等方式来减轻压力。

5. 戒烟限酒：吸烟和过量饮酒都会对身体健康造成负面影响。应尽量避免吸烟和过量饮酒。

6. 定期体检：定期进行体检可以帮助发现潜在的健康问题，并及早进行治疗。

希望这些建议对您有所帮助。如果您有任何其他问题，请随时问我。

<more text>

For reference, I am using transformers==4.34.0, maybe that's the issue?

But I'll try with QWen-14B too to see if I can reproduce the problem.

tomaarsen · 2023-10-23T08:31:10Z

This is my output for QWen-14B:

保持身体健康有多种方式，包括饮食、运动和睡眠。饮食方面，我们应该多吃水果、蔬菜和全谷类食品，少吃高热量、高脂肪和高糖分的食品。运动方面，每周至少进行150分钟的中等强度
有氧运动，如快走、跑步、游泳等。此外，还应该进行力量训练，以增强肌肉和骨骼。睡眠方面，每晚应该保证7-8 小时的睡眠时间，避免熬夜和过度使用电子设备。<more text>

It seems to work just fine for me. Perhaps you can 1) verify that you have the right transformers version and 2) post here the full Traceback, so I can see where it actually throws an error.

Tom Aarsen

sun1092469590 · 2023-10-23T08:53:45Z

thank you very much. my current transformers version is also 4.34.0 and I can run QWen-14B normaly when attention_sink is not added.

tomaarsen · 2023-10-23T11:23:07Z

I see now that you're using Flash Attention. The current Attention Sinks implementation for QWen doesn't work with FA. I'll try to see if I can extend the implementation so it does work, but I'm still in the process of getting FA installed, so it's not easy to test.

tomaarsen · 2023-10-23T12:13:47Z

I'll be testing here: https://github.com/tomaarsen/attention_sinks/tree/model/qwen_fa

tomaarsen · 2023-10-23T12:50:45Z

Sadly, I can't reasonably test this without investing some more time into WSL or dualboot, as I'm on Windows. Colab also doesn't work: RuntimeError: FlashAttention only supports Ampere GPUs or newer. Perhaps you can run

pip install git+https://github.com/tomaarsen/attention_sinks.git@model/qwen_fa

and check if it works. It would be very helpful.

sun1092469590 · 2023-10-24T01:27:25Z

thank you very much for your detailed answer. I will firstly try you method and if does not work I will stop use Flash Attention and test.

sun1092469590 · 2023-10-24T02:32:56Z

I stop use Flash Attention by add parameter "use_flash_attn=False" in AutoModelForCausalLM.from_pretrained(), and result is normal as you show me. As is :
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.float16,
attention_sink_size=4,
attention_sink_window_size=252,
use_flash_attn=False
)
I download new Branch version ：https://github.com/tomaarsen/attention_sinks/tree/model/qwen_fa ，and test it. code can run without any Error, but result is not very good, maybe falsh_atten is not install correctly ,text = "保持身体健康有多种方式", I show the result as follows,

sun1092469590 · 2023-10-24T02:37:23Z

alse I want to know if Chat model can use attention_sink as Qwen-14B-Chat. and how to use it in chat model

tomaarsen · 2023-10-24T08:16:07Z

I stop use Flash Attention by add parameter "use_flash_attn=False" in AutoModelForCausalLM.from_pretrained(), and result is normal as you show me. As is :
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.float16,
attention_sink_size=4,
attention_sink_window_size=252,
use_flash_attn=False
)

Awesome! I'm glad.

2. I download new  Branch version ：https://github.com/tomaarsen/attention_sinks/tree/model/qwen_fa ，and test it. code can run without  any Error, but result is not very good, maybe falsh_atten is not install correctly ,text = "保持身体健康有多种方式",  I show the result as follows,
   ![image](https://user-images.githubusercontent.com/19388387/277531786-90fc7232-6953-4dc6-a1c2-1f3b72a733ff.png)

That's a shame. There must be a bug there somewhere. I made #25 to add an error when flash attention is used. Perhaps in the future I can try to fix the support for flash attention with QWen.

alse I want to know if Chat model can use attention_sink as Qwen-14B-Chat. and how to use it in chat model

You can indeed, see for example this script: https://github.com/tomaarsen/attention_sinks/blob/main/demo/streaming.py
In this file, the LLM is continuously given a prompt from a dataset of prompts. In practice, you could wait and receive these prompts from the user on the fly. Then you can generate tokens with this loop:

attention_sinks/demo/streaming.py

Lines 37 to 45 in 1f17f70

    
           for _ in range(max_new_tokens): 
        
               outputs = model(input_ids, past_key_values=past_key_values, use_cache=True) 
        
               past_key_values = outputs.past_key_values 
        
               pred_token_idx = outputs.logits[:, -1, :].argmax(dim=-1).unsqueeze(1) 
        
               streamer.put(pred_token_idx) 
        
               input_ids = pred_token_idx 
        
               if pred_token_idx == tokenizer.eos_token_id: 
        
                   break

Note: The streamer just writes the text to a file and the terminal, that line is optional.

sun1092469590 · 2023-10-24T09:12:43Z

ok,thank you very much , I will try it. I try other method and has output, this is my code, this method is right or wrong?
import torch
from transformers import AutoTokenizer, TextStreamer, GenerationConfig
from attention_sinks import AutoModelForCausalLM

model_id = "Qwen/Qwen-14B-Chat"
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.float16,
attention_sink_size=4,
attention_sink_window_size=252,
use_flash_att=False,
trust_remote_code=True
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_id,trust_remote_code=True)
tokenizer.pad_token_id = tokenizer.eos_token_id

text = "保持身体健康的几种方式"
response, history = model.chat(tokenizer,text,history=None)
print(response)

tomaarsen · 2023-10-24T09:17:26Z

I'm afraid not. It's important to pass the old past_key_values to every forward call, which isn't done with model.chat.

sun1092469590 · 2023-10-24T09:32:41Z

I see. I will try your method, thank you for quick reply.

sun1092469590 · 2023-10-25T04:01:53Z

I use Qwen-14B-Chat and some script in demo
/streaming.py to get result , but is very easy to appear OOM,here max_new_tokens=256 and is not very large, my GPU is 4*80G. this is my script and Error log：

import torch
from transformers import AutoTokenizer, TextStreamer, GenerationConfig
from attention_sinks import AutoModelForCausalLM
from typing import Any, Dict, List

model_id = "Qwen/Qwen-14B-Chat"
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.float16,
attention_sink_size=4,
attention_sink_window_size=256,
use_flash_att=False,
trust_remote_code=True
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_id,trust_remote_code=True)
tokenizer.pad_token_id = tokenizer.eos_token_id

prompt = "保持身体健康有多种方式"
past_key_values = None
new_line_tokens = tokenizer("\n\n", return_tensors="pt", add_special_tokens=False).input_ids

prompt = tokenizer.apply_chat_template([{"role": "user", "content": prompt}], tokenize=False)
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to(‘cuda:3’)

max_new_tokens=256
output=""
for _ in range(max_new_tokens):
outputs = model(input_ids, past_key_values=past_key_values, use_cache=True)
past_key_values = outputs.past_key_values
pred_token_idx = outputs.logits[:, -1, :].argmax(dim=-1).unsqueeze(1)
output+=tokenizer.decode(pred_token_idx.cpu()[0],skip_special_tokens=True)
input_ids = pred_token_idx
if pred_token_idx == tokenizer.eos_token_id:
break

print(output)

Here is some Error log:

tomaarsen mentioned this issue Oct 24, 2023

Add exception for when FA is used with QWen #25

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when using Qwen-14B #24

Error when using Qwen-14B #24

sun1092469590 commented Oct 23, 2023 •

edited

Loading

tomaarsen commented Oct 23, 2023

sun1092469590 commented Oct 23, 2023

tomaarsen commented Oct 23, 2023

tomaarsen commented Oct 23, 2023

sun1092469590 commented Oct 23, 2023

tomaarsen commented Oct 23, 2023

tomaarsen commented Oct 23, 2023

tomaarsen commented Oct 23, 2023

sun1092469590 commented Oct 24, 2023

sun1092469590 commented Oct 24, 2023 •

edited

Loading

sun1092469590 commented Oct 24, 2023

tomaarsen commented Oct 24, 2023

sun1092469590 commented Oct 24, 2023 •

edited

Loading

tomaarsen commented Oct 24, 2023

sun1092469590 commented Oct 24, 2023

sun1092469590 commented Oct 25, 2023 •

edited

Loading

Error when using Qwen-14B #24

Error when using Qwen-14B #24

Comments

sun1092469590 commented Oct 23, 2023 • edited Loading

tomaarsen commented Oct 23, 2023

sun1092469590 commented Oct 23, 2023

tomaarsen commented Oct 23, 2023

tomaarsen commented Oct 23, 2023

sun1092469590 commented Oct 23, 2023

tomaarsen commented Oct 23, 2023

tomaarsen commented Oct 23, 2023

tomaarsen commented Oct 23, 2023

sun1092469590 commented Oct 24, 2023

sun1092469590 commented Oct 24, 2023 • edited Loading

sun1092469590 commented Oct 24, 2023

tomaarsen commented Oct 24, 2023

sun1092469590 commented Oct 24, 2023 • edited Loading

tomaarsen commented Oct 24, 2023

sun1092469590 commented Oct 24, 2023

sun1092469590 commented Oct 25, 2023 • edited Loading

sun1092469590 commented Oct 23, 2023 •

edited

Loading

sun1092469590 commented Oct 24, 2023 •

edited

Loading

sun1092469590 commented Oct 24, 2023 •

edited

Loading

sun1092469590 commented Oct 25, 2023 •

edited

Loading