Skip to content

Releases: huggingface/transformers

Bug fixes related to input shape in TensorFlow and tokenization messages

03 Dec 16:23
Compare
Choose a tag to compare

Input shapes

This patch fixes a bug related to the input shape in several models in TensorFlow.

Tokenization message

A tokenization message was too present and overloaded the output, hiding the relevant information. It was removed.

ALBERT, CamemBERT, DistilRoberta, GPT-2 XL, and Encoder-Decoder architectures

26 Nov 19:26
Compare
Choose a tag to compare

New model architectures: ALBERT, CamemBERT, GPT2-XL, DistilRoberta

Four new models have been added in v2.2.0

  • ALBERT (Pytorch & TF) (from Google Research and the Toyota Technological Institute at Chicago) released with the paper ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
  • CamemBERT (Pytorch) (from Facebook AI Research, INRIA, and La Sorbonne Université), as the first large-scale Transformer language model. Released alongside the paper CamemBERT: a Tasty French Language Model by Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suarez, Yoann Dupont, Laurent Romary, Eric Villemonte de la Clergerie, Djame Seddah, and Benoît Sagot. It was added by @louismartin with the help of @julien-c.
  • DistilRoberta (Pytorch & TF) from @VictorSanh as the third distilled model after DistilBERT and DistilGPT-2.
  • GPT-2 XL (Pytorch & TF) as the last GPT-2 checkpoint released by OpenAI

Encoder-Decoder architectures

We welcome the possibility to create fully seq2seq models by incorporating Encoder-Decoder architectures using a PreTrainedEncoderDecoder class that can be initialized from pre-trained models. The base BERT class has be modified so that it may behave as a decoder.

Furthermore, a Model2Model class that simplifies the definition of an encoder-decoder when both encoder and decoder are based on the same model has been added. @rlouf

Benchmarks and performance improvements

Works by @tlkh and @LysandreJik aiming to benchmark the library models with different technologies: with TensorFlow and Pytorch, with mixed precision (AMP and FP-16) and with model tracing (Torchscript and XLA). A new section was created in the documentation: benchmarks pointing to Google sheets with the results.

Breaking changes

Tokenizers now add special tokens by default. @LysandreJik

New model templates

Model templates to ease the addition of new models to the library have been added. @thomwolf

Inputs Embeddings

A new input has been added to all models' forward (for Pytorch) and call (for TensorFlow) methods. These inputs_embeds are a direct embedded representation. This is useful as it gives more control over how to convert input_ids indices into associated vectors than the model's internal embedding lookup matrix. @julien-c

Getters and setters for input and output embeddings

A new API for the input and output embeddings are available. These methods are model-independent and allow easy acquisition/modification of the models' embeddings. @thomwolf

Additional architectures

New model architectures are available, namely: DistilBertForTokenClassification, CamembertForTokenClassification @stefan-it

Community additions/bug-fixes/improvements

  • The Fairseq RoBERTa model conversion script has been patched. @louismartin
  • einsum now runs in FP-16 in the library's examples @slayton58
  • In-depth work on the squad script for XLNet to reproduce the original paper's results @hlums
  • Additional improvements on the run_squad script by @WilliamTambellini, @orena1
  • The run_generation script has seen several improvements by @leo-du
  • The RoBERTaTensorFlow model has been patched for several use-cases: TPU and keras.fit @LysandreJik
  • The documentation is now versioned, links are available on the github readme @LysandreJik
  • The run_ner script has seen several improvements @mmaybeno, @oneraghavan, @manansanghi
  • The run_tf_glue script now works for all GLUE tasks @LysandreJik
  • The run_lm_finetuning script now correctly evaluates perplexity on MLM tasks @altsoph
  • An issue related to the XLM TensorFlow implementation's training has been fixed @tlkh
  • run_bertology has been updated to be closer to the run_glue example @adrianbg
  • Fixed added special tokens in decoded sequences @LysandreJik
  • Several performance improvements have been done to the tokenizers @iedmrc
  • A memory leak has been identified and patched in the library's schedulers @rlouf
  • Correct warning when encoding a sequence too long while specifying a maximum length @LysandreJik
  • Resizing the token embeddings now works as expected in the run_lm_finetuning script @iedmrc
  • The difference in versions between Pypi/source in order to run the examples has been clarified @rlouf

CTRL, DistilGPT-2, Pytorch TPU, tokenizer enhancements, guideline requirements

11 Oct 14:50
Compare
Choose a tag to compare

New model architectures: CTRL, DistilGPT-2

Two new models have been added since release 2.0.

Distillation

Several updates have been made to the distillation script, including the possibility to distill GPT-2 and to distill on the SQuAD task. By @VictorSanh.

Pytorch TPU support

The run_glue.py example script can now run on a Pytorch TPU.

Updates to example scripts

Several example scripts have been improved and refactored to use the full potential of the new tokenizer functions:

QOL enhancements on the tokenizer

Enhancements have been made on the tokenizers. Two new methods have been added: get_special_tokens_mask and truncate_sequences .

The former returns a mask indicating which tokens are special tokens in a token list, and which are tokens from the initial sequences. The latter truncate sequences according to a strategy.

Both of those methods are called by the encode_plus method, which itself is called by the encode method. The encode_plus now returns a larger dictionary which holds information about the special tokens, as well as the overflowing tokens.

Thanks to @julien-c, @thomwolf, and @LysandreJik for these additions.

New German BERT models

Breaking changes

  • The two methods add_special_tokens_single_sequence and add_special_tokens_sequence_pair have been removed. They have been replaced by the single method build_inputs_with_special_tokens which has a more comprehensible name and manages both sequence singletons and pairs.

  • The boolean parameter truncate_first_sequence has been removed in tokenizers' encode and encode_plus methods, being replaced by a strategy in the form of a string: 'longest_first', 'only_second', 'only_first' or 'do_not_truncate' are accepted strategies.

  • When the encode or encode_plus methods are called with a specified max_length, the sequences will now always be truncated or throw an error if overflowing.

Guidelines and requirements

New contributing guidelines have been added, alongside library development requirements by @rlouf, the newest member of the HuggingFace team.

Community additions/bug-fixes/improvements

  • GLUE Processors have been refactored to handle inputs for all tasks coming from the tensorflow_datasets. This work has been done by @agrinh and @philipp-eisen.
  • The padding_idx is now correctly initialized to 1 in randomly initialized RoBERTa models. @ikuyamada
  • The documentation CSS has been adapted to work on older browsers. @TimYagan
  • An addition concerning the management of hidden states has been added to the README by @BramVanroy.
  • Integration of TF 2.0 models with other Keras modules @thomwolf
  • Past values can be opted-out @thomwolf

Superseded by v2.1.1

11 Oct 14:47
Compare
Choose a tag to compare
v2.1.0

Adds version 2.1.0 for PyPi

v2.0.0 - TF 2.0/PyTorch interoperability, improved tokenizers, improved torchscript support

26 Sep 11:48
Compare
Choose a tag to compare

Name change: welcome 🤗 Transformers

Following the extension to TensorFlow 2.0, pytorch-transformers => transformers

Install with pip install transformers

Also, note that PyTorch is no longer in the requirements so don't forget to install TensorFlow 2.0 and/or PyTorch to be able to use (and load) the models.

TensorFlow 2.0 - PyTorch

All the PyTorch nn.Module classes now have their counterpart in TensorFlow 2.0 as tf.keras.Model classes. TensorFlow 2.0 classes have the same name as their PyTorch counterparts prefixed with TF.

The interoperability between TensorFlow and PyTorch is actually a lot deeper than what is usually meant when talking about libraries with multiple backends:

  • each model (not just the static computation graph) can be seamlessly moved from one framework to the other during the lifetime of the model for training/evaluation/usage (from_pretrained can load weights saved from models saved in one or the other framework),
  • an example is given in the quick-tour on TF 2.0 and PyTorch in the readme in which a model is trained using keras.fit before being opened in PyTorch for quick debugging/inspection.

Remaining unsupported operations in TF 2.0 (to be added later):

  • resizing input embeddings to add new tokens
  • pruning model heads

TPU support

Training on TPU using free TPUs provided in the TensorFlow Research Cloud (TFRC) program is possible but requires to implement a custom training loop (not possible with keras.fit at the moment).
We will add an example of such a custom training loop soon.

Improved tokenizers

Tokenizers have been improved to provide extended encoding methods encoding_plus and additional arguments to encoding. Please refer to the doc for detailed usage of the new options.

Breaking changes

Positional order of some model keywords inputs changed (better TorchScript support)

To be able to better use Torchscript both on CPU and GPUs (see #1010, #1204 and #1195) the specific order of some models keywords inputs (attention_mask, token_type_ids...) has been changed.

If you used to call the models with keyword names for keyword arguments, e.g. model(inputs_ids, attention_mask=attention_mask, token_type_ids=token_type_ids), this should not cause any breaking change.

If you used to call the models with positional inputs for keyword arguments, e.g. model(inputs_ids, attention_mask, token_type_ids), you should double-check the exact order of input arguments.

Dependency requirements have changed

PyTorch is no longer in the requirements so don't forget to install TensorFlow 2.0 and/or PyTorch to be able to use (and load) the models.

Renamed method

The method add_special_tokens_sentence_pair has been renamed to the more appropriate name add_special_tokens_sequence_pair.
The same holds true for the method add_special_tokens_single_sentence which has been changed to add_special_tokens_single_sequence.

Community additions/bug-fixes/improvements

DistilBERT, GPT-2 Large, XLM multilingual models, torch.hub, bug fixes

04 Sep 12:18
Compare
Choose a tag to compare

New model architecture: DistilBERT

Huggingface's new transformer architecture, DistilBERT described in Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT by Victor Sanh, Lysandre Debut and Thomas Wolf.

This new model architecture comes with two pretrained checkpoints:

  • distilbert-base-uncased: the base DistilBert model
  • distilbert-base-uncased-distilled-squad : DistilBert model fine-tuned with distillation on SQuAD.

New GPT2 checkpoint: GPT-2 large (774M parameters)

The third OpenAI GPT-2 checkpoint is available in the library: 774M parameters, 36 layers, and 20 heads.

New XLM multilingual checkpoints: 17 & 100 languages

We have added two new XLM models in 17 and 100 languages which obtain better performance than multilingual BERT on the XNLI cross-lingual classification task.

Back on torch.hub with all the architectures

Pytorch-Transformers torch.hub interface is based on Auto-Models which are generic classes designed to be instantiated using from_pretrained() in a model architecture guessed from the pretrained checkpoint name (ex AutoModel.from_pretrained('bert-base-uncased') will instantiate a BertModeland load the 'bert-case-uncased' checkpoint in it). They are currently 4 classes of Auto-Models:AutoModel, AutoModelWithLMHead, AutoModelForSequenceClassificationandAutoModelForQuestionAnswering`.

New dependency: sacremoses

Support for XLM is improved by carefully reproducing the original tokenization workflow (work by @shijie-wu in #1092). We now rely on sacremoses, a python port of Moses tokenizer, truecaser and normalizer by @alvations, for XLM word tokenization.

In a few languages (Thai, Japanese and Chinese) XLM tokenizer will require additional dependencies. These additional dependencies are optional at the library level. Using XLM tokenizer in these languages without the additional dependency will raise an error message with installation instructions. The additional optional dependencies are:

  • pythainlp: Thai tokenizer
  • kytea: Japanese tokenizer, wrapper of KyTea (Need external C++ compilation), used by the newly release XLM-17 & XLM-100
  • jieba: Chinese tokenizer *

* XLM used Stanford Segmenter. However, the wrapper (nltk.tokenize.stanford_segmenter) are slow due to JVM overhead, and it will be deprecated. Jieba is a lot faster and pip-installable. But there is some mismatch with the Stanford Segmenter. A workaround could be having an argument to allow users to segment the sentence by themselves and bypass the segmenter. As a reference, I also include nltk.tokenize.stanford_segmenter in this PR.

Bug fixes and improvements to the library modules

  • Bertology script has seen major improvements (@tuvuumass )
  • Iterative tokenization now faster and accept arbitrary numbers of added tokens (@samvelyan)
  • Added RoBERTa to AutoModels and AutoTokenizers (@LysandreJik )
  • Added GPT-2 Large 774M model (@thomwolf )
  • Added language model fine-tuning with GPT/GPT-2 (CLM), BERT/RoBERTa (MLM) (@LysandreJik @thomwolf )
  • Multi-GPU training has been patched (@FeiWang96 )
  • Scripts are updated to reflect Pytorch 1.1.0 changes (scheduler, optimizer) (@Morizeyao, @adai183 )
  • Updated the in-depth BERT fine-tuning scripts to pytorch-transformers (@Morizeyao )
  • Models saved with pruned heads are now saved and reloaded correctly (implemented for GPT, GPT-2, BERT, RoBERTa, XLM) (@LysandreJik @thomwolf)
  • Add proxies and force_download options to from_pretrained() method to be able to use proxies and update cached models/tokenizers (@thomwolf)
  • Add shortcut to each special tokens with _id properties (e.g. tokenizer.cls_token_id for the id in the vocabulary of tokenizer.cls_token) (@thomwolf)
  • Fix GPT2 and RoBERTa tokenizer so that sentences to be tokenized always begins with at least one space (see note by fairseq authors) (@thomwolf)
  • Fix and clean up byte-level BPE tests (@thomwolf)
  • Update the test classes for OpenAI GPT and GPT-2 so that these models are tested against common tests. (@LysandreJik )
  • Fix a warning raised when the decode method is called for a model with no sep_token like GPT-2 (@LysandreJik )
  • Updated the tokenizers saving method (@boy2000-007man)
  • SpaCy tokenizers have been updated in the tokenizers (@GuillemGSubies )
  • Stable EnvironmentErrors have been added to utility files (@abhishekraok )
  • Fixed distributed barrier hang (@VictorSanh )
  • Encoding functions now return the input tokens instead of throwing an error when not implemented in child class (@LysandreJik )
  • Change layer norm code to PyTorch's native layer norm (@dhpollack)
  • Improved tokenization for XLM for multilingual inputs (@shijie-wu)
  • Add language input and access to language to id conversion in XLM tokenizer (@thomwolf)
  • Add pretrained configuration properties for tokenizers with serialization logic (saving/reloading tokenizer configuration) (@thomwolf)
  • Added new AutoModels: AutoModelWithLMHead, AutoModelForSequenceClassification, AutoModelForQuestionAnswering (@LysandreJik)
  • Torch.hub is now based on AutoModels (@LysandreJik @thomwolf)
  • Fix Transformer-XL attention mask dtype to be bool (@CrafterKolyan)
  • Adding DistilBert model architecture and checkpoints (@VictorSanh @LysandreJik @thomwolf)
  • Fixes to DistilBert configuration and training script (@stefan-it)
  • Fix XLNet attention mask for fp16 (@ziliwang)
  • Documentation auto-deploy (@LysandreJik)
  • Fix to add a tuple of tokens (@epwalsh)
  • Update fp16 apex implementation in scripts (@anhnt170489)
  • Fix XLNet bias resizing when adding/removing tokens (@LysandreJik)
  • Fix tokenizer reloading in example scripts (@rabeehk)
  • Fix byte-level decoding error when using added tokens (@thomwolf @LysandreJik)
  • Fix epsilon value in RoBERTa pretrained checkpoints (@julien-c)

New model: RoBERTa, tokenizer sequence pair handling for sequence classification models.

15 Aug 15:31
Compare
Choose a tag to compare

New model: RoBERTa

RoBERTa (from Facebook), a Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du et al.

Thanks to Myle Ott from Facebook for his help.

Tokenizer sequence pair handling

Tokenizers get two new methods:

tokenizer.add_special_tokens_single_sentence(token_ids)

and

tokenizer.add_special_tokens_sentences_pair(token_ids_0, token_ids_1)

These methods add the model-specific special tokens to sequences. The sentence pair creates a list of tokens with the cls and sep tokens according to the way the model was trained.

Sequence pair examples:

For BERT:

[CLS] SEQUENCE_0 [SEP] SEQUENCE_1 [SEP]

For RoBERTa:

<s> SEQUENCE_0 </s></s> SEQUENCE_1 </s>

Tokenizer encoding function

The tokenizer encode function gets two new arguments:

tokenizer.encode(text, text_pair=None, add_special_tokens=False)

If the text_pair is specified, encode will return a tuple of encoded sequences. If the add_special_tokens is set to True, the sequences will be built with the models' respective special tokens using the previously described methods.

AutoConfig, AutoModel and AutoTokenizer

There are three new classes with this release that instantiate one of the base model classes of the library from a pre-trained model configuration: AutoConfig, AutoModel, and AutoTokenizer.

Those classes take as input a pre-trained model name or path and instantiate one of the corresponding classes. The input string indicates to the class which architecture should be instantiated. If the string contains "bert", AutoConfig instantiates a BertConfig, AutoModel instantiates a BertModel and AutoTokenizer instantiates a BertTokenizer.

The same can be done for all the library's base models. The Auto classes check for the associated strings: "openai-gpt", "gpt2", "transfo-xl", "xlnet", "xlm" and "roberta". The documentation associated with this change can be found here.

Examples

Some examples have been refactored to better reflect the current library. Those are: simple_lm_finetuning.py, finetune_on_pregenerated.py, as well as run_glue.py that has been adapted to the RoBERTa model. The examples run_squad and run_glue.py have better dataset processing with caching.

Bug fixes and improvements to the library modules

  • Fixed multi-gpu training when using FP16 (@zijunsun)
  • Re-added the possibility to import BertPretrainedModel (@thomwolf)
  • Improvements to tensorflow -> pytorch checkpoints (@dhpollack)
  • Fixed save_pretrained to save the correct added tokens (@joelgrus)
  • Fixed version issues in run_openai_gpt (@rabeehk)
  • Fixed issue with line return with Chinese BERT (@Yiqing-Zhou)
  • Added more flexibility regarding the PretrainedModel.from_pretrained (@xanlsh)
  • Fixed issues regarding backward compatibility to Pytorch 1.0.0 (@thomwolf)
  • Added the unknown token to GPT-2 (@thomwolf)

v1.0.0 - Name change, new models (XLNet, XLM), unified API for models and tokenizer, access to models internals, torchscript

16 Jul 14:50
b33a385
Compare
Choose a tag to compare

Name change: welcome PyTorch-Transformers 👾

pytorch-pretrained-bert => pytorch-transformers

Install with pip install pytorch-transformers

New models

New pretrained weights

We went from ten (in pytorch-pretrained-bert 0.6.2) to twenty-seven (in pytorch-transformers 1.0) pretrained model weights.

The newly added model weights are, in summary:

  • Two Whole-Word-Masking weights for Bert (cased and uncased)
  • Three Fine-tuned models for Bert (on SQuAD and MRPC)
  • One German model for Bert provided and trained by Deepset.ai (@tholor and @Timoeller) as detailed in their nice blogpost
  • One OpenAI GPT-2 model (medium size model)
  • Two models (base and large) for the newly added XLNet model
  • Eight models for the newly added XLM model

The documentation lists all the models with the shortcut names and we are currently adding full details of the associated pretraining/fine-tuning parameters.

New documentation

New documentation is currently being created at https://huggingface.co/pytorch-transformers/ and should be finalized over the coming days.

Standard API across models

See the readme for a quick tour of the API.

Main points:

  • All models now return tuples with various elements depending on the model and the configuration. The docstrings and documentation list all the expected outputs in order.
  • All models can now return the full list of hidden-states (embeddings output + the output hidden-states of each layer)
  • All models can now return the full list of attention weights (one tensor of attention weights for each layer)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
                                    output_hidden_states=True,
                                    output_attentions=True)
input_ids = torch.tensor([tokenizer.encode("Let's see all hidden-states and attentions on this text")])
all_hidden_states, all_attentions = model(input_ids)[-2:]

Standard API to add tokens to the vocabulary and the model

Using tokenizer.add_tokens() and tokenizer.add_special_tokens(), one can now easily add tokens to each model vocabulary. The model's input embeddings can be resized accordingly to add associated word embeddings (to be trained) using model.resize_token_embeddings(len(tokenizer))

tokenizer.add_tokens(['[SPECIAL_TOKEN_1]', '[SPECIAL_TOKEN_2]'])
model.resize_token_embeddings(len(tokenizer))

Serialization

The serialization methods have been standardized and you probably should switch to the new method save_pretrained(save_directory) if you were using any other serialization method before.

model.save_pretrained('./my_saved_model_directory/')
tokenizer.save_pretrained('./my_saved_model_directory/')

### Reload the model and the tokenizer
model = BertForSequenceClassification.from_pretrained('./my_saved_model_directory/')
tokenizer = BertTokenizer.from_pretrained('./my_saved_model_directory/')

Torchscript

All models are now compatible with Torchscript.

model = model_class.from_pretrained(pretrained_weights, torchscript=True)
traced_model = torch.jit.trace(model, (input_ids,))

Examples scripts

The examples scripts have been refactored and gathered in three main examples (run_glue.py, run_squad.py and run_generation.py) which are common to several models and are designed to offer SOTA performances on the respective tasks while being clean starting point to design your own scripts.

Other examples scripts (like run_bertology.py) will be added in the coming weeks.

Breaking-changes

The migration section of the readme lists the breaking changes when switching from pytorch-pretrained-bert to pytorch-transformers.

The main breaking change is that all models now returns a tuple of results.

Better model/tokenizer serialization, relax network connection requirements, new scripts and bug fixes

25 Apr 19:47
Compare
Choose a tag to compare

General updates:

  • Better serialization for all models and tokenizers (BERT, GPT, GPT-2 and Transformer-XL) with best practices for saving/loading in readme and examples.
  • Relaxing network connection requirements (fallback on the last downloaded model in the cache when we can't reach AWS to check eTag)

Breaking changes:

  • warmup_linear method in OpenAIAdam and BertAdam is now replaced by flexible schedule classes for linear, cosine and multi-cycles schedules.

Bug fixes and improvements to the library modules:

  • add a flag in BertTokenizer to skip basic tokenization (@john-hewitt)
  • Allow tokenization of sequences > 512 (@CatalinVoss)
  • clean up and extend learning rate schedules in BertAdam and OpenAIAdam (@lukovnikov)
  • Update GPT/GPT-2 Loss computation (@CatalinVoss, @thomwolf)
  • Make the TensorFlow conversion tool more robust (@marpaia)
  • fixed BertForMultipleChoice model init and forward pass (@dhpollack)
  • Fix gradient overflow in GPT-2 FP16 training (@SudoSharma)
  • catch exception if pathlib not installed (@potatochip)
  • Use Dropout Layer in OpenAIGPTMultipleChoiceHead (@pglock)

New scripts and improvements to the examples scripts:

v0.6.1 - Small install tweak release

18 Feb 11:01
8f46cd1
Compare
Choose a tag to compare

Add regex to the requirements for OpenAI GPT-2 tokenizer.