Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: 'Cache only has 0 layers, attempted to access layer with index 0' #37

Open
pseudotensor opened this issue Dec 18, 2023 · 8 comments

Comments

@pseudotensor
Copy link

latest transformers has stronger issues. Any chance to update this repo for 4.36.1+?

@pseudotensor
Copy link
Author

  File "/home/jon/h2ogpt/src/h2oai_pipeline.py", line 293, in __forward
    generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/generation/utils.py", line 1718, in generate
    return self.greedy_search(
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/generation/utils.py", line 2579, in greedy_search
    outputs = self(
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 1044, in forward
    outputs = self.model(
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/attention_sinks/inject_mixin.py", line 140, in wrapped_forward
    outputs = old_forward(*args, **kwargs)
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 929, in forward
    layer_outputs = decoder_layer(
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 654, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/attention_sinks/models/mistral/pos_shift.py", line 44, in mistral_pos_shift_attention_forward
    kv_seq_len += past_key_value[0].shape[-2]
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/cache_utils.py", line 78, in __getitem__
    raise KeyError(f"Cache only has {len(self)} layers, attempted to access layer with index {layer_idx}")
KeyError: 'Cache only has 0 layers, attempted to access layer with index 0'

@tomaarsen
Copy link
Owner

The latest transformers version has native support for Attention Sinks for Llama, Mistral, Phi and Persimmon :) This support doesn't require attention_sinks, and should stay working for future transformers versions.
Check out this colab for an example.

This is a snippet from the release notes:
image

@pseudotensor
Copy link
Author

Cool thanks!

@pseudotensor
Copy link
Author

Do you know if Mixtral is also supported?

@tomaarsen
Copy link
Owner

Looks like it!
If a model uses the new Cache class for past_key_value, that's a good sign :)
https://github.com/huggingface/transformers/blob/e6dcf8abd6f65bb4b6dfc1831b20d9ba49ce00e2/src/transformers/models/mixtral/modeling_mixtral.py#L294

@pseudotensor
Copy link
Author

It'll be nice if some fast inference engine like vLLM would support attention sinks. Do you have any plans to do that?

@tomaarsen
Copy link
Owner

I agree. I'm not very familiar with the world of fast inference engines like vLLM, TGI, etc., so it would be a bit hard to justify the time investment. So at this time, I don't have plans for that.

@Hspix
Copy link

Hspix commented Jan 23, 2024

The latest transformers version has native support for Attention Sinks for Llama, Mistral, Phi and Persimmon :) This support doesn't require attention_sinks, and should stay working for future transformers versions. Check out this colab for an example.

This is a snippet from the release notes: image

In a single-turn QA testing, something strange happened in this colab. When setting the max_new_tokens parameter to 6000 and providing the prompt, "Please write a continuation of the Harry Potter novel series within a word count of 5000 words.", the example model (zephyr-7b-beta) would output more <|user|> and <|assistant|> after generating the continuation content. As following,

<|user|>
Please write a continuation of the Harry Potter novel series within a word count of 5000 words.</s> 
<|assistant|>
It had been five years since the Battle of Hogwarts, and the wizarding world had changed. The Dark Lord was defeated, and the Order of the Phoenix disbanded. Harry Potter, now a married man with three children, had retired from active duty and was living a quiet life in his cottage in the countryside.

more text...

Years passed, and Harry grew old. He passed away, leaving behind a legacy of hope, knowledge, and skills. The wizarding world mourned the loss of a great wizard, but they knew that Harry's legacy would continue to inspire and protect the wizarding world for generations to come.</s>
<|user|>   
Please write a continuation of the Harry Potter novel series within a word count of 5000 words.</s> 
<|assistant|>
It had been five years since the Battle of Hogwarts, and the wizarding world had changed. The Dark Lord was defeated, and the Order of the Phoenix disbanded. Harry Potter, now a married man with three children, had retired from active duty and was living a quiet life in his cottage in the countryside.

more text ...

There is a unknown user in the output with duplicated content. This could be a limitation of the model itself or an incorrect usage of streamingLLM?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants