-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mamba2 conversion script for original models #32580
Conversation
src/transformers/models/mamba2/convert_mamba2_ssm_checkpoint_to_pytorch.py
Outdated
Show resolved
Hide resolved
src/transformers/models/mamba2/convert_mamba2_ssm_checkpoint_to_pytorch.py
Outdated
Show resolved
Hide resolved
src/transformers/models/mamba2/convert_mamba2_ssm_checkpoint_to_pytorch.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for working on this ! Made a couple comments and I think we need to open another PR for norm_before_gate
# Flag for codestral model | ||
is_not_codestral = "dim" not in config_ssm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually the converter would have a dictionary of model types and values, like
MAMBA2_MODELS_DICT = {
"codestral":
{
"hidden_size": config_ssm["dim"]
}
}
So that we can extend it if another format comes along!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be added now.
Still using is_not_codestral = "dim" not in config_ssm
to identify which model type is currently at hand. Might be changed later with a better strategy when more models come along 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has been further changed with 13889f1. Supporting codestral and the original mamba ssm models. So, it's user-dependent now what type of model is selected.
src/transformers/models/mamba2/convert_mamba2_ssm_checkpoint_to_pytorch.py
Outdated
Show resolved
Hide resolved
src/transformers/models/mamba2/convert_mamba2_ssm_checkpoint_to_pytorch.py
Outdated
Show resolved
Hide resolved
src/transformers/models/mamba2/convert_mamba2_ssm_checkpoint_to_pytorch.py
Outdated
Show resolved
Hide resolved
Tests seem unrelated to me. I'll rebase after resolving everything in the review. |
Yes there are issues with tests on main right now, on it! unrelated to your changes. To choose the model to convert I think it should be user-dependent - meaning, it could simply be a |
Oh yea for sure, I kinda just went ahead and tried to automate everything 🤦 Lemme change it in the evening when I have proper time! |
src/transformers/models/mamba2/convert_mamba2_ssm_checkpoint_to_pytorch.py
Outdated
Show resolved
Hide resolved
parser.add_argument( | ||
"-c", | ||
"--tokenizer_model_path", | ||
"-m", | ||
"--mamba2_model_type", | ||
type=str, | ||
default="mamba_ssm", | ||
const="mamba_ssm", | ||
required=True, | ||
choices=("codestral", "mamba_ssm"), | ||
help="The model type the conversion will be performed on. Can choose from either `codestral` or `mamba_ssm`.", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User-dependent choice on conversion. For now, defaults to mamba_ssm
as codestral
has a repo which can be used so I'd expect more people to use it locally for the paper models.
…e original mamba ssm
- add util mamba2 tokenizer (gptneox with left padding) - add models dict
32d8333
to
52ca549
Compare
With all the fixes/patches in previous PRs, this PR should be ready now. I think there are still two core things to consider:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks fo the PR, let's try to aim for simplicity wherever we can!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this. Overall this should still be up to the user, and the tokenizers should be set to left when converting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally wanted to use the gpt neox tokenizer of mamba1 and override the padding side but I wasn't aware how to do it. So when saving and reloading it is back to the non-overriden padding side. (see #32580 (comment))
That's how this separate tokenizer came to be. Is there an easy way to override a padding side? So loading the tokenizer, override the padding side, and on reload the padding side should persist 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this:
self.padding_side = kwargs.pop("padding_side", self.padding_side) |
padding_side
as input and a test making sure loading / saving does not overwrite it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow, it's so simple 😆
Removed the mamba2 tokenizer now and added padding_side
as kwarg to the gptneox tokenizer. I tested locally to see if the padding side persists and it indeed does so. There is no test file for the gptneox tokenizer, should I create one or is there somewhere else I should write this test in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the kwarg again from the tokenizer file as it's passed even without it. Testing still remains 👀
hf_config = Mamba2Config() | ||
|
||
# Switch to a different dict depending on model type | ||
config_dict = mamba2_model_dict | ||
|
||
# Set important values from config and recalculate other resulting entries | ||
hf_config.hidden_size = config_ssm[config_dict["hidden_size"]] | ||
hf_config.num_heads = (hf_config.hidden_size * hf_config.expand) // hf_config.head_dim | ||
hf_config.num_hidden_layers = config_ssm[config_dict["num_hidden_layers"]] | ||
hf_config.n_groups = config_ssm.get(config_dict["n_groups"], 1) | ||
hf_config.residual_in_fp32 = config_ssm[config_dict["residual_in_fp32"]] | ||
hf_config.tie_word_embeddings = config_ssm[config_dict["tie_word_embeddings"]] | ||
hf_config.bos_token_id = config_dict["bos_token_id"] | ||
hf_config.pad_token_id = config_dict["pad_token_id"] | ||
hf_config.eos_token_id = config_dict["eos_token_id"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could also just init with the correct values from the config_dict here. MOst of the names match so **config_dict might be simpler for things that match one to one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, not so sure if I'd like that since it would get convoluted which attributes would be overriden, e.g. n_groups would be overriden in codestral but not the original paper models.
_MAMBA2_MODELS_DICT = { | ||
"codestral": { | ||
"hidden_size": "dim", | ||
"num_hidden_layers": "n_layers", | ||
"n_groups": "n_groups", | ||
"residual_in_fp32": "residual_in_fp32", | ||
"tie_word_embeddings": "tie_embeddings", | ||
"vocab_size": "vocab_size", | ||
"pad_vocab_size_multiple": "pad_vocab_size_multiple", | ||
"bos_token_id": 0, | ||
"pad_token_id": 1, | ||
"eos_token_id": 2, | ||
"config_name": "params.json", | ||
"load_state_dict": partial(load_state_dict_from_safetensors, ckpt_name="consolidated.safetensors"), | ||
"load_and_save_tokenizer": partial(load_and_save_tokenizer, "codestral"), | ||
}, | ||
"mamba_ssm": { | ||
"hidden_size": "d_model", | ||
"num_hidden_layers": "n_layer", | ||
"n_groups": "ngroups", | ||
"residual_in_fp32": "residual_in_fp32", | ||
"tie_word_embeddings": "tie_embeddings", | ||
"vocab_size": "vocab_size", | ||
"pad_vocab_size_multiple": "pad_vocab_size_multiple", | ||
"bos_token_id": 0, | ||
"pad_token_id": 0, | ||
"eos_token_id": 0, | ||
"config_name": "config.json", | ||
"load_state_dict": partial(load_state_dict_from_torch, ckpt_name="pytorch_model.bin"), | ||
"load_and_save_tokenizer": partial(load_and_save_tokenizer, "mamba_ssm"), | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for any parameters that have similar names, IMO it's useless to have them in the mapping. (like bocab size, residual etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True that! I will change that later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplified it now, and removed residual_in_fp32
entirely as the default values are shared across all of them.
…n as kwarg to gptneox one and reuse it for the conversion script
Feel free to ping me for another review once ready! |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@ArthurZucker Should be good for review 👀 just checked locally to make sure it works. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it's a bit breaking but good for me!
* first attempt at allowing both conversions from codestral and from the original mamba ssm * allow fp16, seems default for mamba2 * dtype fix * simplify codestral check, dont overwrite pad/eos/bos when codestral * change file -> directory * use path join to be safe * style * apply code review - add util mamba2 tokenizer (gptneox with left padding) - add models dict * fix copies * add tokenizer to docs * empty commit to check for weird err * make conversion user dependent on model type, defaults for original paper models * small comment nit * remove norm_before_gate in conversion * simplify model dict by using shared keys directly + remove unnecessary attributes * fix tokenization: remove separate mamba2 tokenizer, add padding option as kwarg to gptneox one and reuse it for the conversion script * simplify even further as we pass padding side via **kwargs already
* first attempt at allowing both conversions from codestral and from the original mamba ssm * allow fp16, seems default for mamba2 * dtype fix * simplify codestral check, dont overwrite pad/eos/bos when codestral * change file -> directory * use path join to be safe * style * apply code review - add util mamba2 tokenizer (gptneox with left padding) - add models dict * fix copies * add tokenizer to docs * empty commit to check for weird err * make conversion user dependent on model type, defaults for original paper models * small comment nit * remove norm_before_gate in conversion * simplify model dict by using shared keys directly + remove unnecessary attributes * fix tokenization: remove separate mamba2 tokenizer, add padding option as kwarg to gptneox one and reuse it for the conversion script * simplify even further as we pass padding side via **kwargs already
* first attempt at allowing both conversions from codestral and from the original mamba ssm * allow fp16, seems default for mamba2 * dtype fix * simplify codestral check, dont overwrite pad/eos/bos when codestral * change file -> directory * use path join to be safe * style * apply code review - add util mamba2 tokenizer (gptneox with left padding) - add models dict * fix copies * add tokenizer to docs * empty commit to check for weird err * make conversion user dependent on model type, defaults for original paper models * small comment nit * remove norm_before_gate in conversion * simplify model dict by using shared keys directly + remove unnecessary attributes * fix tokenization: remove separate mamba2 tokenizer, add padding option as kwarg to gptneox one and reuse it for the conversion script * simplify even further as we pass padding side via **kwargs already
* first attempt at allowing both conversions from codestral and from the original mamba ssm * allow fp16, seems default for mamba2 * dtype fix * simplify codestral check, dont overwrite pad/eos/bos when codestral * change file -> directory * use path join to be safe * style * apply code review - add util mamba2 tokenizer (gptneox with left padding) - add models dict * fix copies * add tokenizer to docs * empty commit to check for weird err * make conversion user dependent on model type, defaults for original paper models * small comment nit * remove norm_before_gate in conversion * simplify model dict by using shared keys directly + remove unnecessary attributes * fix tokenization: remove separate mamba2 tokenizer, add padding option as kwarg to gptneox one and reuse it for the conversion script * simplify even further as we pass padding side via **kwargs already
* first attempt at allowing both conversions from codestral and from the original mamba ssm * allow fp16, seems default for mamba2 * dtype fix * simplify codestral check, dont overwrite pad/eos/bos when codestral * change file -> directory * use path join to be safe * style * apply code review - add util mamba2 tokenizer (gptneox with left padding) - add models dict * fix copies * add tokenizer to docs * empty commit to check for weird err * make conversion user dependent on model type, defaults for original paper models * small comment nit * remove norm_before_gate in conversion * simplify model dict by using shared keys directly + remove unnecessary attributes * fix tokenization: remove separate mamba2 tokenizer, add padding option as kwarg to gptneox one and reuse it for the conversion script * simplify even further as we pass padding side via **kwargs already
What does this PR do?
Extends the Mamba2 conversion script to be compatible with the paper models and codestral. I need some help handling the tokenizer or more specifically how I can overwrite the padding side of a pretrained tokenizer and then save it with the new side.
Fixes #32496
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@ArthurZucker @molbap