-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference support for T5 and FLAN-T5 model families #8141
Conversation
…l families llama : add new API functions to support encoder-decoder models: llama_encode(), llama_model_has_encoder(), llama_model_decoder_start_token() common, llama-cli : use new API functions to support encoder-decoder models convert-hf : handle shared token embeddings tensors in T5Model convert-hf : handle SentencePiece BPE tokenizer in T5Model (for Pile-T5 models) convert-hf : add MT5ForConditionalGeneration and UMT5ForConditionalGeneration to architectures supported by T5Model
@@ -768,6 +775,14 @@ extern "C" { | |||
// Frees a batch of tokens allocated with llama_batch_init() | |||
LLAMA_API void llama_batch_free(struct llama_batch batch); | |||
|
|||
// Processes a batch of tokens with the ecoder part of the encoder-decoder model. | |||
// Stores the encoder output internally for later use by the decoder cross-attention layers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my case, a prompt consists of a static part, which is unchanged and makes use of the KV cache, and dynamic part, which changes frequently. It works good with GPT, where I can call llama_kv_cache_seq_rm
to cleanup the dynamic part of KV cache and start evaluating again. Would a similar approach work with T5? In other words, what's the degree of control over the encoder output? Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vladfaust No, encoder requires all input tokens to be present in the input batch. It's because the attention in encoder is not causal, so each token in the input sequence attends to all tokens in the input sequence. It doesn't even use KV cache because there's no need to.
I guess theoretically it would be possible to implement it in a way that would allow "adding" tokens to encoder output by calling llama_encode() multiple times, but the implementation would be much more complicated, definitely outside the scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, @fairydreaming: one of my use-cases is converting a growing chat history to some structured representation for each new message. Do I understand correctly that for now I'd have to encode the whole history again and again for each inference without any form of caching? (No offence, obviously, as I'm very grateful for the T5 support at all!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vladfaust Yes, there's no caching in the encoder, so if the input sequence grows even by one token, you have to encode it again and during this process all previous calculations for this token sequence are repeated.
@ggerganov take a look at the new API in this PR when you have some time |
|
||
inpL = llm_build_inp_embd(ctx0, lctx, hparams, batch, model.tok_embd, cb); | ||
|
||
if (lctx.is_encoding) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which cases this would be false
during llama_decode()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always, as llama_decode_internal
sets is_encoding
to false at the start. It's true only during llama_encode_internal
call.
I think I found a small problem with the tokenization. Tried to tokenize the string main : expected tokens: 3 ' ', 55 '!', 17065 '!!!!!',
main : got tokens: 3 ' ', 17065 '!!!!!', 55 '!', |
@ggerganov Can you give me a full example that produced the 3, 55, 17065 tokenization? I did some tests and got 3, 17065, 55 both in llama.cpp and in transformers library. |
I opened a PR in your repo with instructions to reproduce: |
These test failures are caused by differences in tokenization between T5Tokenizer and T5TokenizerFast in HF transformers library. More info in the fairydreaming#1 PR. |
* convert : add t5 tokenizer tests, use "slow" tokenizer * llama : UGM tokenizer init with UNK tokens instead of PAD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed some relatively minor changes:
- updated names of variables
- simplified the logic in
llama_encode_internal
by removing micro-batching support - extended
llama-batched
example to work with T5 models
Feel free to merge if this looks good to you
This PR is the third PR from the series of PR adding support for T5 and FLAN-T5 model families.
This PR adds:
llama_encode()
,llama_model_has_encoder()
,llama_model_decoder_start_token()
llama-cli
Example model for testing: https://huggingface.co/google-t5/t5-small
Example usage:
./llama-cli -m models/t5-small.gguf -p 'translate English to German: The house is wonderful.'
Supported models:
I think it fixes #5763, #3393, #247, #4316, #7238