-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Add vocabulary type for token-free models that work on raw bytes #7763
Comments
I don't have the time or bandwidth to go through this at the moment, but I am curious. Considering that raw bytes have been considered multiple times in the past and that it increases the time complexity of the computations to train and inference a model, how is this any different? Using raw UTF-8 sequencing would be nice because it would technically be "vocabulary free". However, the rationale for using merged tokens (BPE) is typically that it's a compromise between using full words (vocab size) and raw characters (UTF-8). Character based vs sub-word tokenization isn't novel and these issues are well documented and known already. |
Thanks for bringing that up. Those are all valid and true points. I'd still like to provide some more context. Concerning the extra computational effort, the disadvantage could be compensated by the advancements in speculative decoding or using using multiple decoding heads which at least improves things when it comes to inference. Imo it feels like we're stuck in a local optimum with tokenization methods like BPE. It's the best we have at the moment, but it's still fundamentally flawed. Think of current LLMs failing at tasks such as reversing words or counting letters and so on. It's all mostly due to subword tokens. Brittleness in the face of typos would be another issue that comes to mind. The ByT5 paper explicitly addresses how byte level LLMs handle this way better. |
I agree on the brittleness of current models and the issues are well known. There's PR #7187 for token healing to handle cases where incomplete tokens cause issues. Even if all of these issues are solved, it doesn't solve the larger issue of the embedding space for the vocabulary. The vocabulary still needs to be mapped to values; e.g. the encoder and decoder translate between the input and the representative numerical values of the input and output. In this context, those numerical values would represent language(s). I have skimmed the ByT5 and Bytes are all you need papers before, though I haven't dug into them as much as I'd like. Not sure if Medusa is really the answer, although reducing MatMul operations might help. Ideally, reducing dependency upon other models (such as augmentation or speculation) would be ideal. I'd prefer to simplify components instead of compounding them. There's always value in exploring any of these avenues, so I don't say this to deter you. There's added value in removing any uncertainty in situational awareness. I think it's worth mentioning discussion #7732 here as well as it is relevant. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Prerequisites
Feature Description
I think it would be useful if llama.cpp supported a vocabulary type that doesn't really have tokens but only works on raw bytes. Something like
LLAMA_VOCAB_TYPE_RAW_BYTES
would be added toenum llama_vocab_type
but I don't know what kind of changes that would imply elsewhere. That kind of vocabulary would still require special tokens of course.Motivation
There's already some interesting research about making token-free LLMs work:
And I think this is going to become even more relevant in the future. To quote Andrej Karpathy: "I would love nothing more than to be able to feed raw byte sequences into language models".
Possible Implementation
No response
The text was updated successfully, but these errors were encountered: