Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mamba2 conversion script for original models #32580

Merged
merged 17 commits into from
Aug 29, 2024

Conversation

vasqu
Copy link
Contributor

@vasqu vasqu commented Aug 10, 2024

What does this PR do?

Extends the Mamba2 conversion script to be compatible with the paper models and codestral. I need some help handling the tokenizer or more specifically how I can overwrite the padding side of a pretrained tokenizer and then save it with the new side.

Fixes #32496

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker @molbap

Copy link
Contributor

@molbap molbap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this ! Made a couple comments and I think we need to open another PR for norm_before_gate

Comment on lines 33 to 34
# Flag for codestral model
is_not_codestral = "dim" not in config_ssm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually the converter would have a dictionary of model types and values, like

MAMBA2_MODELS_DICT = {
"codestral":
    {
    "hidden_size": config_ssm["dim"]
     }
}

So that we can extend it if another format comes along!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be added now.

Still using is_not_codestral = "dim" not in config_ssm to identify which model type is currently at hand. Might be changed later with a better strategy when more models come along 👀

Copy link
Contributor Author

@vasqu vasqu Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has been further changed with 13889f1. Supporting codestral and the original mamba ssm models. So, it's user-dependent now what type of model is selected.

@vasqu
Copy link
Contributor Author

vasqu commented Aug 12, 2024

Tests seem unrelated to me. I'll rebase after resolving everything in the review.

@molbap
Copy link
Contributor

molbap commented Aug 13, 2024

Yes there are issues with tests on main right now, on it! unrelated to your changes. To choose the model to convert I think it should be user-dependent - meaning, it could simply be a choice argument in the parser and that's it, right?

@vasqu
Copy link
Contributor Author

vasqu commented Aug 13, 2024

Oh yea for sure, I kinda just went ahead and tried to automate everything 🤦 Lemme change it in the evening when I have proper time!

Comment on lines 166 to +175
parser.add_argument(
"-c",
"--tokenizer_model_path",
"-m",
"--mamba2_model_type",
type=str,
default="mamba_ssm",
const="mamba_ssm",
required=True,
choices=("codestral", "mamba_ssm"),
help="The model type the conversion will be performed on. Can choose from either `codestral` or `mamba_ssm`.",
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User-dependent choice on conversion. For now, defaults to mamba_ssm as codestral has a repo which can be used so I'd expect more people to use it locally for the paper models.

@vasqu vasqu mentioned this pull request Aug 14, 2024
5 tasks
@vasqu vasqu requested a review from molbap August 14, 2024 15:30
@vasqu vasqu force-pushed the base-mamba2-conversion branch from 32d8333 to 52ca549 Compare August 20, 2024 20:05
@vasqu
Copy link
Contributor Author

vasqu commented Aug 20, 2024

With all the fixes/patches in previous PRs, this PR should be ready now.

I think there are still two core things to consider:

  • Is the addition of the Mamba2Tokenizer fine?
  • Is the model dictionary _MAMBA2_MODELS_DICT fine with the partial functions or is this overengineered? 👀

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks fo the PR, let's try to aim for simplicity wherever we can!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this. Overall this should still be up to the user, and the tokenizers should be set to left when converting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally wanted to use the gpt neox tokenizer of mamba1 and override the padding side but I wasn't aware how to do it. So when saving and reloading it is back to the non-overriden padding side. (see #32580 (comment))

That's how this separate tokenizer came to be. Is there an easy way to override a padding side? So loading the tokenizer, override the padding side, and on reload the padding side should persist 👀

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this:

self.padding_side = kwargs.pop("padding_side", self.padding_side)
I think we should update the GPTTokenizerFast to take padding_side as input and a test making sure loading / saving does not overwrite it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow, it's so simple 😆

Removed the mamba2 tokenizer now and added padding_side as kwarg to the gptneox tokenizer. I tested locally to see if the padding side persists and it indeed does so. There is no test file for the gptneox tokenizer, should I create one or is there somewhere else I should write this test in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the kwarg again from the tokenizer file as it's passed even without it. Testing still remains 👀

Comment on lines 46 to 60
hf_config = Mamba2Config()

# Switch to a different dict depending on model type
config_dict = mamba2_model_dict

# Set important values from config and recalculate other resulting entries
hf_config.hidden_size = config_ssm[config_dict["hidden_size"]]
hf_config.num_heads = (hf_config.hidden_size * hf_config.expand) // hf_config.head_dim
hf_config.num_hidden_layers = config_ssm[config_dict["num_hidden_layers"]]
hf_config.n_groups = config_ssm.get(config_dict["n_groups"], 1)
hf_config.residual_in_fp32 = config_ssm[config_dict["residual_in_fp32"]]
hf_config.tie_word_embeddings = config_ssm[config_dict["tie_word_embeddings"]]
hf_config.bos_token_id = config_dict["bos_token_id"]
hf_config.pad_token_id = config_dict["pad_token_id"]
hf_config.eos_token_id = config_dict["eos_token_id"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could also just init with the correct values from the config_dict here. MOst of the names match so **config_dict might be simpler for things that match one to one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, not so sure if I'd like that since it would get convoluted which attributes would be overriden, e.g. n_groups would be overriden in codestral but not the original paper models.

Comment on lines 91 to 121
_MAMBA2_MODELS_DICT = {
"codestral": {
"hidden_size": "dim",
"num_hidden_layers": "n_layers",
"n_groups": "n_groups",
"residual_in_fp32": "residual_in_fp32",
"tie_word_embeddings": "tie_embeddings",
"vocab_size": "vocab_size",
"pad_vocab_size_multiple": "pad_vocab_size_multiple",
"bos_token_id": 0,
"pad_token_id": 1,
"eos_token_id": 2,
"config_name": "params.json",
"load_state_dict": partial(load_state_dict_from_safetensors, ckpt_name="consolidated.safetensors"),
"load_and_save_tokenizer": partial(load_and_save_tokenizer, "codestral"),
},
"mamba_ssm": {
"hidden_size": "d_model",
"num_hidden_layers": "n_layer",
"n_groups": "ngroups",
"residual_in_fp32": "residual_in_fp32",
"tie_word_embeddings": "tie_embeddings",
"vocab_size": "vocab_size",
"pad_vocab_size_multiple": "pad_vocab_size_multiple",
"bos_token_id": 0,
"pad_token_id": 0,
"eos_token_id": 0,
"config_name": "config.json",
"load_state_dict": partial(load_state_dict_from_torch, ckpt_name="pytorch_model.bin"),
"load_and_save_tokenizer": partial(load_and_save_tokenizer, "mamba_ssm"),
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for any parameters that have similar names, IMO it's useless to have them in the mapping. (like bocab size, residual etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True that! I will change that later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified it now, and removed residual_in_fp32 entirely as the default values are shared across all of them.

@ArthurZucker
Copy link
Collaborator

Feel free to ping me for another review once ready!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@vasqu
Copy link
Contributor Author

vasqu commented Aug 28, 2024

@ArthurZucker Should be good for review 👀 just checked locally to make sure it works.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it's a bit breaking but good for me!

@ArthurZucker ArthurZucker merged commit 92a75ff into huggingface:main Aug 29, 2024
8 checks passed
@vasqu vasqu deleted the base-mamba2-conversion branch August 29, 2024 09:49
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Aug 30, 2024
* first attempt at allowing both conversions from codestral and from the original mamba ssm

* allow fp16, seems default for mamba2

* dtype fix

* simplify codestral check, dont overwrite pad/eos/bos when codestral

* change file -> directory

* use path join to be safe

* style

* apply code review
- add util mamba2 tokenizer (gptneox with left padding)
- add models dict

* fix copies

* add tokenizer to docs

* empty commit to check for weird err

* make conversion user dependent on model type, defaults for original paper models

* small comment nit

* remove norm_before_gate in conversion

* simplify model dict by using shared keys directly + remove unnecessary attributes

* fix tokenization: remove separate mamba2 tokenizer, add padding option as kwarg to gptneox one and reuse it for the conversion script

* simplify even further as we pass padding side via **kwargs already
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Aug 30, 2024
* first attempt at allowing both conversions from codestral and from the original mamba ssm

* allow fp16, seems default for mamba2

* dtype fix

* simplify codestral check, dont overwrite pad/eos/bos when codestral

* change file -> directory

* use path join to be safe

* style

* apply code review
- add util mamba2 tokenizer (gptneox with left padding)
- add models dict

* fix copies

* add tokenizer to docs

* empty commit to check for weird err

* make conversion user dependent on model type, defaults for original paper models

* small comment nit

* remove norm_before_gate in conversion

* simplify model dict by using shared keys directly + remove unnecessary attributes

* fix tokenization: remove separate mamba2 tokenizer, add padding option as kwarg to gptneox one and reuse it for the conversion script

* simplify even further as we pass padding side via **kwargs already
itazap pushed a commit to NielsRogge/transformers that referenced this pull request Sep 20, 2024
* first attempt at allowing both conversions from codestral and from the original mamba ssm

* allow fp16, seems default for mamba2

* dtype fix

* simplify codestral check, dont overwrite pad/eos/bos when codestral

* change file -> directory

* use path join to be safe

* style

* apply code review
- add util mamba2 tokenizer (gptneox with left padding)
- add models dict

* fix copies

* add tokenizer to docs

* empty commit to check for weird err

* make conversion user dependent on model type, defaults for original paper models

* small comment nit

* remove norm_before_gate in conversion

* simplify model dict by using shared keys directly + remove unnecessary attributes

* fix tokenization: remove separate mamba2 tokenizer, add padding option as kwarg to gptneox one and reuse it for the conversion script

* simplify even further as we pass padding side via **kwargs already
BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
* first attempt at allowing both conversions from codestral and from the original mamba ssm

* allow fp16, seems default for mamba2

* dtype fix

* simplify codestral check, dont overwrite pad/eos/bos when codestral

* change file -> directory

* use path join to be safe

* style

* apply code review
- add util mamba2 tokenizer (gptneox with left padding)
- add models dict

* fix copies

* add tokenizer to docs

* empty commit to check for weird err

* make conversion user dependent on model type, defaults for original paper models

* small comment nit

* remove norm_before_gate in conversion

* simplify model dict by using shared keys directly + remove unnecessary attributes

* fix tokenization: remove separate mamba2 tokenizer, add padding option as kwarg to gptneox one and reuse it for the conversion script

* simplify even further as we pass padding side via **kwargs already
BernardZach pushed a commit to innovationcore/transformers that referenced this pull request Dec 6, 2024
* first attempt at allowing both conversions from codestral and from the original mamba ssm

* allow fp16, seems default for mamba2

* dtype fix

* simplify codestral check, dont overwrite pad/eos/bos when codestral

* change file -> directory

* use path join to be safe

* style

* apply code review
- add util mamba2 tokenizer (gptneox with left padding)
- add models dict

* fix copies

* add tokenizer to docs

* empty commit to check for weird err

* make conversion user dependent on model type, defaults for original paper models

* small comment nit

* remove norm_before_gate in conversion

* simplify model dict by using shared keys directly + remove unnecessary attributes

* fix tokenization: remove separate mamba2 tokenizer, add padding option as kwarg to gptneox one and reuse it for the conversion script

* simplify even further as we pass padding side via **kwargs already
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error when converting "state-spaces/mamba2-130m" weights to huggingface-compatible format
5 participants