Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RemBERT model code to huggingface #10692

Merged
merged 808 commits into from
Jul 24, 2021

Conversation

Iwontbecreative
Copy link
Contributor

@Iwontbecreative Iwontbecreative commented Mar 12, 2021

Add RemBERT model to Huggingface ( https://arxiv.org/abs/2010.12821 ).

This adds code to support the RemBERT model in Huggingface.

In terms of implementation, this is roughly a scaled up version of mBERT with ALBERT-like factorized embeddings and tokenizer.

Still needs to be done:

  • Check results validity
  • Upload model to model hub
  • FastTokenizer version
  • More testing
  • TF code

Fixes #9711

Before submitting

Who can review?

@LysandreJik seems appropriate here.

@Iwontbecreative
Copy link
Contributor Author

Iwontbecreative commented Mar 13, 2021

@LysandreJik I've been mostly following: https://huggingface.co/transformers/add_new_model.html so far.
Trying to add the tokenizer. I think I am good for the slow one but not sure what to do for the fast one, in particular how to generate the tokenizer.json style files (e.g: https://huggingface.co/t5-small/resolve/main/tokenizer.json ). Do you have any pointers to that?

I also see that the doc mentions that there is no fast version for sentencepiece, which this model uses. Is that the case given T5 seems to have one?

Edit: Seem to have found a way to add a FastTokenizer version, doc still seems out of sync.

@Iwontbecreative
Copy link
Contributor Author

For the TF code, I'm struggling a bit to initialize an output embedding layer in TFRemBertLMPredictionHead() that interacts well with _resize_token_embeddings. Welcoming any suggestions as what the right approach there is.

@Iwontbecreative Iwontbecreative changed the title [WIP] Add RemBERT model code to huggingface Add RemBERT model code to huggingface Mar 17, 2021
@Iwontbecreative Iwontbecreative marked this pull request as ready for review March 17, 2021 02:21
@Iwontbecreative
Copy link
Contributor Author

Iwontbecreative commented Mar 17, 2021

@LysandreJik : They are still some things to iron out but I think this is ready for a first look.

What is missing:

  • Model hub upload
  • Minor discrepancy between model and original tf implementation

What I'd like some input on:

  • I'm having some issue on the TFRemBertLMPredictionHead implementation. I'd like to initialize a new projection from hidden_size to vocab_size (embeddings are decoupled) but I'm struggling to find how to make my implementation compatible with all the get_bias, set_bias details so that it's resize_embeddings friendly. Any chance you could help here? This is the culprit for tests failing AFAICT.
  • Model hub upload: should this be done on top level (how) or on the Google org model hub?
  • I'm finding a discrepancy between this implementation and the original tf one. Results are exactly equal up to the first hidden layer (so embeddings and upprojection). On the first layer it differs but by small amounts (~0.002), difference eventually increases up to 0.007. Any idea what are common culprits here? This is just the standard BERT model and differences are small so maybe numerical stability?

@julien-c
Copy link
Member

julien-c commented Mar 17, 2021

Model hub upload: should this be done on top level (how) or on the Google org model hub?

Toplevel models were for legacy (historical) integrations and we now namespace all models. If this work was conducted at Google yes google is the right namespace! Do you want us to add you to the google org?

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Iwontbecreative, thanks a lot for your contribution - In addition to what Julien said:

I'm having some issue on the TFRemBertLMPredictionHead implementation.

We'd be happy to help you on that front, we'll take a look ASAP

I'm finding a discrepancy between this implementation and the original tf one.

A difference of *e-3 doesn't look too bad, but looking at the integration test you have provided, it seems that the difference is noticeable. Is it possible that a bias is missing, or something to do with attention masks?

If it proves impossible to get the two implementations closer to each other, then we'll rely on a fine-tuning tests: if we can obtain similar results on a same dataset with the two implementations, then we'll be good to go.

src/transformers/models/rembert/__init__.py Outdated Show resolved Hide resolved
src/transformers/models/rembert/modeling_rembert.py Outdated Show resolved Hide resolved
@Iwontbecreative
Copy link
Contributor Author

A difference of *e-3 doesn't look too bad, but looking at the integration test you have provided, it seems that the difference is noticeable. Is it possible that a bias is missing, or something to do with attention masks?

Not impossible but given the transformer section is simply Bert I doubt it. Also does seem like the results would change more.

If it proves impossible to get the two implementations closer to each other, then we'll rely on a fine-tuning tests: if we can obtain similar results on a same dataset with the two implementations, then we'll be good to go.

I've tried to do that for a bit, unfortunately hard to fine-tune this model on a colab on XNLI (training gets interrupted too early on). Will try to see if I can get a better finetuning setup.

@Iwontbecreative
Copy link
Contributor Author

Model hub upload: should this be done on top level (how) or on the Google org model hub?

Toplevel models were for legacy (historical) integrations and we now namespace all models. If this work was conducted at Google yes google is the right namespace! Do you want us to add you to the google org?

That would be helpful, though I'm no longer affiliated with Google so not sure what the usual policy is there. If it is ok that will be easier than having to send the checkpoints to @hwchung so he uploads them.

@julien-c
Copy link
Member

That would be helpful, though I'm no longer affiliated with Google so not sure what the usual policy is there.

Ultimately the org admins should decide, but for now I think it's perfectly acceptable if you're a part of the org. I added you manually.

@LysandreJik
Copy link
Member

@Iwontbecreative I opened a PR on your branch that should fix all the failing tests here: Iwontbecreative#1

I've separated each test suite (TF, style, docs) in three commits if you want to have a look at smaller portions at a time.

@Iwontbecreative
Copy link
Contributor Author

Thanks Lysandre. Have not forgotten about this issue, just need one more approval from Google to open source the checkpoint so waiting for this.

@huggingface huggingface deleted a comment from github-actions bot May 3, 2021
@LysandreJik
Copy link
Member

Sure @Iwontbecreative, let us know when all is good!

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this Jun 6, 2021
@Iwontbecreative
Copy link
Contributor Author

Reopen: this is still in progress, just got the open sourcing done from google last Friday. Travelling for a few days and then can tackle this. See: https://github.com/google-research/google-research/tree/master/rembert

@julien-c julien-c reopened this Jun 7, 2021
@LysandreJik LysandreJik added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Jun 7, 2021
@LysandreJik
Copy link
Member

That's great news @Iwontbecreative!

@Iwontbecreative
Copy link
Contributor Author

(Finally) some updates on this now that the Google open sourcing is done.

  • Updated the PR for the lates transformers
  • Included @LysandreJik's change (manually instead of merging your PR since I messed up the order, sorry about that)
  • Uploaded the model to the model hub under iwontbecreative/rembert

Main issue that still exists: the discrepancy between my implementation and the tf one. Welcoming ideas on this one. The code should now be easier for everyone to test now that the tf version is available and mine is uploaded to the model hub.

@LysandreJik think this is ready for another look. Thanks for the patience here!

@LysandreJik
Copy link
Member

Welcome back! We'll take a look shortly. Do you have an example of code to run to quickly test this version with the TF one?

@Iwontbecreative
Copy link
Contributor Author

Sadly the tensorflow code to run the model is not available externally.
I tried to replicate for some time this morning but it is hard because it relies on https://github.com/google-research/bert which has dependencies that are not available on pip anymore...

The main change to the modelling code is here:
https://github.com/google-research/bert/blob/master/modeling.py#L813-L823
needing to be replaced with:
https://github.com/google-research/albert/blob/master/modeling.py#L1083-L1088
on the tokenization front, it is mainly replacing BertTokenizer with AlbertTokenizer

I do however have example inputs and outputs run by my coauthor:

Model outputs

Example modelling outputs at several layers for different input_ids:
https://pastebin.com/t9bPFmeM
This is the [batch, length, :3] section of the [batch, length, hidden] outputs.

Tokenization outputs

https://pastebin.com/j6D6YE1e

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! In that case, we'll try and reproduce the fin-tuning results to ensure that the implementation is indeed correct.

There are a few # Copied from ... statements missing, could you add them? I've put some proposals where some of them should be, there should be more in the PyTorch/TensorFlow implementations.

Thank you!

src/transformers/convert_slow_tokenizer.py Show resolved Hide resolved
src/transformers/modeling_tf_utils.py Outdated Show resolved Hide resolved
src/transformers/models/auto/modeling_auto.py Outdated Show resolved Hide resolved
src/transformers/models/rembert/configuration_rembert.py Outdated Show resolved Hide resolved
src/transformers/models/rembert/configuration_rembert.py Outdated Show resolved Hide resolved
@Iwontbecreative
Copy link
Contributor Author

Fine-tuning was, here is what I was able to run, comnparing performance on XNLI:

https://docs.google.com/spreadsheets/d/1gWWSLo7XxEZkXpX272tQoZBXTgs96IFvh-fwqVqihM0/edit#gid=0

Performance matches in English but does seem to be lower on other languages. We used more hyperparam tuning at Google but I do not think that explains the whole difference for those languages. I think there might be a subtle difference that is both causing the results to differ slightly and the worse fine-tuning outcomes. The model is still much better than random so most of it should be there.

@LysandreJik
Copy link
Member

Performance does look pretty similar, and good enough for me to merge it.

There are a few # Copied from statements missing though as said in my previous message, in both the PyTorch and TensorFlow implementations. Do you mind adding them? If you're lacking time let me know and I'll do it myself.

ctheodoris and others added 4 commits July 15, 2021 13:14
…uggingface#11825)

get_length_grouped_indices() in LengthGroupedSampler and DistributedLengthGroupedSampler
is prohibitively slow for large number of megabatches (in test case takes hours for ~270k
megabatches with 100 items each) due to slow list concatenation with sum(megabatches, []).

Resolves: huggingface#11795

Co-authored-by: ctheodoris <cvtheodo@ds.dfci.harvard.edu>
* fix_torch_device_generate_test

* remove @

* change pytorch import to flax import
@Iwontbecreative
Copy link
Contributor Author

Hi @LysandreJik

  • Added copy statements
  • Merged with last master
  • Uploaded model to google org

Seems like it is mostly ready, though tests fail at the utils/check_copies.py stage of make quality.
I am actually not sure what the issue is in this instance, any chance you could help investigate/fix/merge after?

@Iwontbecreative
Copy link
Contributor Author

Iwontbecreative commented Jul 16, 2021

Actually, managed to find the issue

utils/check_copies.py crashes without a helpful error message if the "Copied from" statement is before the decorator. I was just overeager with my copied from statements.

Also renamed rembert-large to rembert since this is the only version we are open-sourcing at this time.

Edit: Not sure why the model templates check is failing, but think this should be ready for merge with one last review.

@LysandreJik
Copy link
Member

Fantastic, thanks a lot @Iwontbecreative! I'll take a final look, fix the model templates issue and ping another reviewer.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks for adding all the # Copied from statements. I just checked locally and I believe the model templates error is unrelated to your PR - I'll have to check but running the model templates locally yields no error so I think we can merge this PR without this test passing and we'll make sure it does pass on master (or fix if it doesn't).

Pinging other reviewers to give your PR a second look.

Thanks for the great work!

src/transformers/modeling_tf_utils.py Show resolved Hide resolved
src/transformers/models/rembert/tokenization_rembert.py Outdated Show resolved Hide resolved
self.dropout = nn.Dropout(config.hidden_dropout_prob)

# position_ids (1, len position emb) is contiguous in memory and exported when serialized
self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
Copy link
Contributor

@patrickvonplaten patrickvonplaten Jul 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) This also applies to other models, but I'm not really a fan of registering the position ids here as a parameter. I think if they aren't passed they should be generated on the fly. Logically, it just doesn't make too much sense to have them in the weight dictionary in PyTocrh IMO. In Flax, one would never register this input as a parameter so that at the moment this parameters are reported missing when loading such a model in Flax cc @LysandreJik @sgugger

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For newer additions, this buffer could be registered as a non-persistent buffer like it was done in BERT (only for torch 1.6+):

if version.parse(torch.__version__) > version.parse("1.6.0"):
self.register_buffer(
"token_type_ids",
torch.zeros(self.position_ids.size(), dtype=torch.long, device=self.position_ids.device),
persistent=False,
)

Without this the models cannot be traced with torchscript and I believe cannot be used for other inference engines such as onnxruntime if I recall (cc @mfuntowicz)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just making sure, is @LysandreJik's pointer what I should do for the code (replacing with torch.arange)?

def __init__(self, config):
super().__init__(config)

config.num_labels = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think it's better to not change the config here silently, but rather just set self.num_labels=2. A model that is loaded with this class and then saved again will have its config changed silently no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, and I have removed this row.

However, I think the issue comes from the cookie_cutter template (see: modelling_{{cookiecutter.lowercase_modelnam}}.py lines 1467 and 3010). As such, mbart, roformer, led, bat, big_bird, big_bird_pegasus are all impacted by this.

]
)

# Running on the original tf implementation gives slightly different results here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it would be nice to have an integration test with 1e-3 tolerance only. IMO, it would be important to find out what the difference is before merging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree and have tried a few things to bridge the gap, including running things layer by layer.

We know that the discrepancy starts from the first layer onwards. Results are equal for the tf and hf version after we upproject the embeddings, but differ after the first layer.

Unfortunately, I no longer have access to the tf codebase inside google and it has been hard to get sublayer outputs to understand which op exactly explains the difference. I also tried to run a home-baked version of the tf codebase but had trouble installing a old enough version of tf + running it. Ultimately, I mostly looked at the fine-tuning results, which are good, though there is still a gap there. @hwchung might be able to help but he has been pretty busy lately.

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good to me already! Thanks a lot for your work @Iwontbecreative !
I'm a bit concerned about the integration test not giving identical results and IMO, it would be worth finding out why this is exactly before merging.
One other thing, that might be cleaner would be to remove the position_embedding_type logic in all modules at the cost of having to remove # Copied from statements (interested in hearning @LysandreJik and @sgugger's opinion on this)

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for adding those new models. It looks like the template you used was a bit outdated, left quite a few suggestions to fix everything and have it up to date :-)

Also, the checkpoint used everywhere was rembert when it should be google/rembert.

docs/source/index.rst Outdated Show resolved Hide resolved
docs/source/model_doc/rembert.rst Outdated Show resolved Hide resolved
src/transformers/models/auto/configuration_auto.py Outdated Show resolved Hide resolved
src/transformers/models/rembert/modeling_rembert.py Outdated Show resolved Hide resolved
src/transformers/models/rembert/modeling_rembert.py Outdated Show resolved Hide resolved
src/transformers/models/rembert/tokenization_rembert.py Outdated Show resolved Hide resolved
@Iwontbecreative
Copy link
Contributor Author

@patrickvonplaten Thanks for the helpful feedback. Incorporated most of it. See the comment on possible needed changes to the cookiecutter templates to address on of your comments in the future.
For the discrepancy in results, see my answer above.

@sgugger Regarding older template: Yes, this PR ended up being delayed due to slow open-sourcing process at Google, so the templates were a bit out of date. Thanks for catching most of the mistakes.

@Iwontbecreative
Copy link
Contributor Author

Hi @LysandreJik, any last remaining steps before this can be merged? Would like to get this in to avoid further rebases if possible.

@LysandreJik
Copy link
Member

I think this can be merged - thanks for your effort @Iwontbecreative, fantastic addition!

@LysandreJik LysandreJik merged commit 434022a into huggingface:master Jul 24, 2021
@LysandreJik
Copy link
Member

I'm not entirely sure why there was 88 authors involved or 250 commits squashed into a single one - but I did verify only your changes were merged.

Could you let me know how you handled the merge/rebasing of this branch so that I may understand what happened w.r.t the number of commits included?

@Iwontbecreative
Copy link
Contributor Author

I think I just merged the master's changes into my branch to ensure it was up to date with upstream. Maybe I needed to rebase?

@aapot
Copy link

aapot commented Jul 31, 2021

Hi @Iwontbecreative thanks for adding the RemBERT model! Do you have a list of the 110 languages used in the pretraining of the model?

@Iwontbecreative
Copy link
Contributor Author

Sure, here's the list:

['af', 'am', 'ar', 'az', 'be', 'bg', 'bg-Latn', 'bn', 'bs', 'ca', 'ceb', 'co', 'cs', 'cy', 'da', 'de', 'el', 'el-Latn', 'en', 'eo', 'es', 'et', 'eu', 'fa', 'fi', 'fil', 'fr', 'fy', 'ga', 'gd', 'gl', 'gu', 'ha', 'haw', 'hi', 'hi-Latn', 'hmn', 'hr', 'ht', 'hu', 'hy', 'id', 'ig', 'is', 'it', 'iw', 'ja', 'ja-Latn', 'jv', 'ka', 'kk', 'km', 'kn', 'ko', 'ku', 'ky', 'la', 'lb', 'lo', 'lt', 'lv', 'mg', 'mi', 'mk', 'ml', 'mn', 'mr', 'ms', 'mt', 'my', 'ne', 'nl', 'no', 'ny', 'pa', 'pl', 'ps', 'pt', 'ro', 'ru', 'ru-Latn', 'sd', 'si', 'sk', 'sl', 'sm', 'sn', 'so', 'sq', 'sr', 'st', 'su', 'sv', 'sw', 'ta', 'te', 'tg', 'th', 'tr', 'uk', 'ur', 'uz', 'vi', 'xh', 'yi', 'yo', 'zh', 'zh-Hans', 'zh-Hant', 'zh-Latn', 'zu']

cf https://github.com/google-research/google-research/tree/master/rembert

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for RemBERT