Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Add vocabulary type for token-free models that work on raw bytes #7763

Closed
4 tasks done
uwu-420 opened this issue Jun 5, 2024 · 4 comments
Closed
4 tasks done
Labels
enhancement New feature or request stale

Comments

@uwu-420
Copy link

uwu-420 commented Jun 5, 2024

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

I think it would be useful if llama.cpp supported a vocabulary type that doesn't really have tokens but only works on raw bytes. Something like LLAMA_VOCAB_TYPE_RAW_BYTES would be added to enum llama_vocab_type but I don't know what kind of changes that would imply elsewhere. That kind of vocabulary would still require special tokens of course.

Motivation

There's already some interesting research about making token-free LLMs work:

And I think this is going to become even more relevant in the future. To quote Andrej Karpathy: "I would love nothing more than to be able to feed raw byte sequences into language models".

Possible Implementation

No response

@uwu-420 uwu-420 added the enhancement New feature or request label Jun 5, 2024
@teleprint-me
Copy link
Contributor

I don't have the time or bandwidth to go through this at the moment, but I am curious.

Considering that raw bytes have been considered multiple times in the past and that it increases the time complexity of the computations to train and inference a model, how is this any different? Using raw UTF-8 sequencing would be nice because it would technically be "vocabulary free".

However, the rationale for using merged tokens (BPE) is typically that it's a compromise between using full words (vocab size) and raw characters (UTF-8). Character based vs sub-word tokenization isn't novel and these issues are well documented and known already.

@uwu-420
Copy link
Author

uwu-420 commented Jun 10, 2024

Thanks for bringing that up. Those are all valid and true points. I'd still like to provide some more context.

Concerning the extra computational effort, the disadvantage could be compensated by the advancements in speculative decoding or using using multiple decoding heads which at least improves things when it comes to inference.

Imo it feels like we're stuck in a local optimum with tokenization methods like BPE. It's the best we have at the moment, but it's still fundamentally flawed. Think of current LLMs failing at tasks such as reversing words or counting letters and so on. It's all mostly due to subword tokens. Brittleness in the face of typos would be another issue that comes to mind. The ByT5 paper explicitly addresses how byte level LLMs handle this way better.

@teleprint-me
Copy link
Contributor

teleprint-me commented Jun 11, 2024

I agree on the brittleness of current models and the issues are well known. There's PR #7187 for token healing to handle cases where incomplete tokens cause issues.

Even if all of these issues are solved, it doesn't solve the larger issue of the embedding space for the vocabulary. The vocabulary still needs to be mapped to values; e.g. the encoder and decoder translate between the input and the representative numerical values of the input and output.

In this context, those numerical values would represent language(s).

I have skimmed the ByT5 and Bytes are all you need papers before, though I haven't dug into them as much as I'd like.

Not sure if Medusa is really the answer, although reducing MatMul operations might help.

Ideally, reducing dependency upon other models (such as augmentation or speculation) would be ideal. I'd prefer to simplify components instead of compounding them.

There's always value in exploring any of these avenues, so I don't say this to deter you. There's added value in removing any uncertainty in situational awareness.

I think it's worth mentioning discussion #7732 here as well as it is relevant.

@github-actions github-actions bot added the stale label Jul 12, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

2 participants