-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer merging overhaul #334
Conversation
trust_remote_code=options.trust_remote_code, | ||
add_tokens=tuple(token_cfg.keys()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this where tokens are added from a prompt template that might have extras? Is the tricky part making sure these don't shift the tokenizer for other "cross-known" tokens?
has_token = [p[token_id] >= 0 for p in permutation_list] | ||
num_present = sum(int(x) for x in has_token) | ||
if num_present == 1: | ||
donor_model = models[has_token.index(True)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🏥
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you get donated a token that may not have been in your understanding?
|
||
if num_present == 0: | ||
token_configs[token] = TokenEmbeddingConfig(source=ZeroEmbedding()) | ||
logging.warning(f"Token {repr(token)} not found in any model") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what world might one encounter this?
Default behavior for embedding discovery seems reasonable as stated! |
This seems good to me! As a bonus for other reviewers, here's an algorithm description I asked GPT-4o to write based on the PR: Embedding Merging AlgorithmGiven:
The objective is to produce a unified embedding matrix Tokenizer Merging Algorithm
|
@cg123 I forgot you had this one opened a while ago! |
Rewrite the tokenizer merging logic to support all merge methods and allow more customization of behavior.
The previous implementation of tokenizer merging always used either linear or slerp to combine the embedding/LM head parameters. This was to avoid the complexity that would be required to make all merge methods support tensors that potentially have invalid or masked out values. It works okay for some cases but wasn't a general solution.
In this implementation, instead of overriding the merge method for embed/lm_head a preprocessing step remaps them to the vocabulary used by the output model. These (now appropriately sized and ordered) tensors are then merged normally.
The selection of embedding values for tokens not normally present in a model is where things get slightly tricky. By default a set of heuristics that I think are sane are applied. For a given token and model, if the token is not present in the model's original tokenizer:
This can also be overridden on a per-token level. For example:
A practical example would be for merging two Llama 3 models, one using the Llama 3 Instruct prompt format and one using chatml, trying to preserve the ability to use both formats: