Skip to content

Releases: huggingface/transformers

Patch release: v4.33.3

27 Sep 15:09
Choose a tag to compare

A patch release was made for the following three commits:

  • DeepSpeed ZeRO-3 handling when resizing embedding layers (#26259)
  • [doc] Always call it Agents for consistency (#25958)
  • deepspeed resume from ckpt fixes and adding support for deepspeed optimizer and HF scheduler (#25863)

Patch release: v4.33.2

15 Sep 20:24
Choose a tag to compare

A patch release was done for these two commits:

  • Fix pad to multiple of (#25732)
  • fix _resize_token_embeddings will set lm head size to 0 when enabled deepspeed zero3 (#26024)

Falcon, Code Llama, ViTDet, DINO v2, VITS

06 Sep 21:14
Choose a tag to compare


Falcon is a class of causal decoder-only models built by TII. The largest Falcon checkpoints have been trained on >=1T tokens of text, with a particular emphasis on the RefinedWeb corpus. They are made available under the Apache 2.0 license.

Falcon’s architecture is modern and optimized for inference, with multi-query attention and support for efficient attention variants like FlashAttention. Both ‘base’ models trained only as causal language models as well as ‘instruct’ models that have received further fine-tuning are available.

Code Llama

Code Llama, is a family of large language models for code based on Llama 2, providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks.


ViTDet reuses the ViT model architecture, adapted to object detection.


DINO v2 is the next iteration of the DINO model. It is added as a backbone class, allowing it to be re-used in downstream models.


VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) is an end-to-end speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a conditional variational autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior.

Breaking changes:

  • 🚨🚨🚨 [Refactor] Move third-party related utility files into integrations/ folder 🚨🚨🚨 by @younesbelkada in #25599

Moves all third party libs (outside HF ecosystem) related utility files inside integrations/ instead of having them in transformers directly.

In order to get the previous usage you should be changing your call to the following:

- from transformers.deepspeed import HfDeepSpeedConfig
+ from transformers.integrations import HfDeepSpeedConfig

Bugfixes and improvements

Read more

Patch release: v4.32.1

28 Aug 12:48
Choose a tag to compare

Patch release including several patches from v4.31.0, listed below:

  • Put IDEFICS in the right section of the doc (#25650)
  • removing unnecesssary extra parameter (#25643)
  • [SPM] Patch spm Llama and T5 (#25656)
  • Fix bloom add prefix space (#25652)
  • Generate: add missing logits processors docs (#25653)
  • [idefics] small fixes (#25764)

IDEFICS, GPTQ Quantization

22 Aug 13:11
Choose a tag to compare


The IDEFICS model was proposed in OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh

IDEFICS is the first open state-of-the-art visual language model at the 80B scale!

The model accepts arbitrary sequences of image and text and produces text, similarly to a multimodal ChatGPT.

Playground: HuggingFaceM4/idefics_playground



MPT has been added and is now officially supported within Transformers. The repositories from MosaicML have been updated to work best with the model integration within Transformers.

GPTQ Integration

GPTQ quantization is now supported in Transformers, through the optimum library. The backend relies on the auto_gptq library, from which we use the GPTQ and QuantLinear classes.

See below for an example of the API, quantizing a model using the new GPTQConfig configuration utility.

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
model_name = "facebook/opt-125m"

tokenizer = AutoTokenizer.from_pretrained(model_name)
config = GPTQConfig(bits=4, dataset = "c4", tokenizer=tokenizer,  group_size=128, desc_act=False)
# works also with device_map (cpu offload works but not disk offload)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, quantization_config=config)

Most models under TheBloke namespace with the suffix GPTQ should be supported, for example, to load a GPTQ quantized model on TheBloke/Llama-2-13B-chat-GPTQ simply run (after installing latest optimum and auto-gptq libraries):

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "TheBloke/Llama-2-13B-chat-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

For more information about this feature, we recommend taking a look at the following announcement blogpost:


A new pipeline, dedicated to text-to-audio and text-to-speech models, has been added to Transformers. It currently supports the 3 text-to-audio models integrated into transformers: SpeechT5ForTextToSpeech, MusicGen and Bark.

See below for an example:

from transformers import pipeline

classifier = pipeline(model="suno/bark")
output = pipeline("Hey it's HuggingFace on the phone!")

audio = output["audio"]
sampling_rate = output["sampling_rate"]

Classifier-Free Guidance decoding

Classifier-Free Guidance decoding is a text generation technique developed by EleutherAI, announced in this paper. With this technique, you can increase prompt adherence in generation. You can also set it up with negative prompts, ensuring your generation doesn't go in specific directions. See its docs for usage instructions.

Task guides

A new task guide going into Visual Question Answering has been added to Transformers.

Model deprecation

We continue the deprecation of models that was introduced in #24787.

By deprecating, we indicate that we will stop maintaining such models, but there is no intention of actually removing those models and breaking support for them (they might one day move into a separate repo/on the Hub, but we would still add the necessary imports to make sure backward compatibility stays). The main point is that we stop testing those models. The usage of the models drives this choice and aims to ease the burden on our CI so that it may be used to focus on more critical aspects of the library.

Translation Efforts

There are ongoing efforts to translate the transformers' documentation in other languages. These efforts are driven by groups independent to Hugging Face, and their work is greatly appreciated further to lower the barrier of entry to ML and Transformers.

If you'd like to kickstart such an effort or help out on an existing one, please feel free to reach out by opening an issue.

Explicit input data format for image processing

Addition of input_data_format argument to image transforms and ImageProcessor methods, allowing the user to explicitly set the data format of the images being processed. This enables processing of images with non-standard number of channels e.g. 4 or removes error which occur when the data format was inferred but the channel dimension was ambiguous.

import numpy as np
from transformers import ViTImageProcessor

img = np.random.randint(0, 256, (4, 6, 3))
image_processor = ViTImageProcessor()
inputs = image_processor(img, image_mean=0, image_std=1, input_data_format="channels_first")

Documentation clarification about efficient inference through torch.scaled_dot_product_attention & Flash Attention

Users are not aware that it is possible to force dispatch torch.scaled_dot_product_attention method from torch to use Flash Attention kernels. This leads to considerable speedup and memory saving, and is also compatible with quantized models. We decided to make this explicit to users in the documentation.

  • [Docs / BetterTransformer ] Added more details about flash attention + SDPA : #25265

In a nutshell, one can just run:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m").to("cuda")

# convert the model to BetterTransformer

input_text = "Hello my dog is cute and"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

+ with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
    outputs = model.generate(**inputs)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

to enable Flash-attenion in their model. However, this feature does not support padding yet.

FSDP and DeepSpeed Changes

Users will no longer encounter CPU RAM OOM when using FSDP to train very large models in multi-gpu or multi-node multi-gpu setting.
Users no longer have to pass fsdp_transformer_layer_cls_to_wrap as the code now use _no_split_modules by default which is available for most of the popular models. DeepSpeed Z3 init now works properly with Accelerate Launcher + Trainer.

Breaking changes

Default optimizer in the Trainer class

The defaul...

Read more

v4.31.0: Llama v2, MusicGen, Bark, MMS, EnCodec, InstructBLIP, Umt5, MRa, vIvIt

18 Jul 20:16
Choose a tag to compare

New models

Llama v2

Llama 2 was proposed in LLaMA: Open Foundation and Fine-Tuned Chat Models by Hugo Touvron et al. It builds upon the Llama architecture adding Grouped Query Attention for efficient inference.


The MusicGen model was proposed in the paper Simple and Controllable Music Generation by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.

MusicGen is a single stage auto-regressive Transformer model capable of generating high-quality music samples conditioned on text descriptions or audio prompts. The text descriptions are passed through a frozen text encoder model to obtain a sequence of hidden-state representations. MusicGen is then trained to predict discrete audio tokens, or audio codes, conditioned on these hidden-states. These audio tokens are then decoded using an audio compression model, such as EnCodec, to recover the audio waveform.

Through an efficient token interleaving pattern, MusicGen does not require a self-supervised semantic representation of the text/audio prompts, thus eliminating the need to cascade multiple models to predict a set of codebooks (e.g. hierarchically or upsampling). Instead, it is able to generate all the codebooks in a single forward pass.


Bark is a transformer-based text-to-speech model proposed by Suno AI in suno-ai/bark.


The MMS model was proposed in Scaling Speech Technology to 1,000+ Languages by Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli


The EnCodec neural codec model was proposed in High Fidelity Neural Audio Compression by Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi.


The InstructBLIP model was proposed in InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. InstructBLIP leverages the BLIP-2 architecture for visual instruction tuning.


The UMT5 model was proposed in UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining by Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant.


The MRA model was proposed in Multi Resolution Analysis (MRA) for Approximate Self-Attention by Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, and Vikas Singh.


The Vivit model was proposed in ViViT: A Video Vision Transformer by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. The paper proposes one of the first successful pure-transformer based set of models for video understanding.

Python 3.7

The last version to support Python 3.7 was 4.30.x, as it reached end-of-life on June 27, 2023 and is no longer supported by the Python Software Foundation.

PyTorch 1.9

The last version to support PyTorch 1.9 was 4.30.x. As it has been more than 2 years, and we're looking forward to using features available in PyTorch 1.10 and up, we do not support PyTorch 1.9 for v4.31 and up.

RoPE scaling

This PR adds RoPE scaling to the LLaMa and GPTNeoX families of models. It allows us to extrapolate and go beyond the original maximum sequence length (e.g. 2048 tokens on LLaMA), without fine-tuning. It offers two strategies:

  • Linear scaling
  • Dynamic NTK scaling


Tools now return a type that is specific to agents. This type can return a serialized version of itself (a string), that either points to a file on-disk or to the object's content. This should make interaction with text-based systems much simpler.

Tied weights load

Models with potentially tied weights dropped off some keys from the state dict even when the weights were not tied. This has now been fixed and more generally, the whole experience of loading a model with state dict that don't match exactly should be improved in this release.

Whisper word-level timestamps

This PR adds a method of predicting timestamps at the word (or even token) level, by analyzing the cross-attentions and applying dynamic time warping.

Auto model addition

A new auto model is added, AutoModelForTextEncoding. It is to be used when you want to extract the text encoder from an encoder-decoder architecture.

Model deprecation

Transformers is growing a lot and to ease a bit the burden of maintenance on our side, we have taken the decision to deprecate models that are not used a lot. Those models will never actually disappear from the library, but we will stop testing them or accepting PRs modifying them.
(enfin ça
The criteria to identify models to deprecate was less than 1,000 unique downloads in the last 30 days for models that are at least one year old. The list of deprecated models is:

  • BORT
  • M-CTC-T
  • MMBT
  • RetriBERT
  • Trajectory Transformer
  • VAN

Breaking changes

Fixes an issue with stripped spaces for the T5 family tokenizers. If this impacts negatively inference/training with your models, please let us know by opening an issue.

Bugfixes and improvements

Read more

v4.30.2: Patch release

13 Jun 19:29
Choose a tag to compare

v4.30.1 Patch release

09 Jun 15:58
Choose a tag to compare

v4.30.0: 100k, Agents improvements, Safetensors core dependency, Swiftformer, Autoformer, MobileViTv2, timm-as-a-backbone

08 Jun 18:07
Choose a tag to compare


Transformers has just reached 100k stars on GitHub, and to celebrate we wanted to highlight 100 projects in the vicinity of transformers and we have decided to create an awesome-transformers page to do just that.

We accept PRs to add projects to the list!

4-bit quantization and QLoRA

By leveraging the bitsandbytes library by @TimDettmers, we add 4-bit support to transformers models!


The Agents framework has been improved and continues to be stabilized. Among bug fixes, here are the important new features that were added:

  • Local agent capabilities, to load a generative model directly from transformers instead of relying on APIs.
  • Prompts are now hosted on the Hub, which means that anyone can fork the prompts and update them with theirs, to let other community contributors re-use them
  • We add an AzureOpenAiAgent class to support Azure OpenAI agents.


The safetensors library is a safe serialization framework for machine learning tensors. It has been audited and will become the default serialization framework for several organizations (Hugging Face, EleutherAI, Stability AI).

It has now become a core dependency of transformers.

New models


The SwiftFormer paper introduces a novel efficient additive attention mechanism that effectively replaces the quadratic matrix multiplication operations in the self-attention computation with linear element-wise multiplications. A series of models called ‘SwiftFormer’ is built based on this, which achieves state-of-the-art performance in terms of both accuracy and mobile inference speed. Even their small variant achieves 78.5% top-1 ImageNet1K accuracy with only 0.8 ms latency on iPhone 14, which is more accurate and 2× faster compared to MobileViT-v2.


This model augments the Transformer as a deep decomposition architecture, which can progressively decompose the trend and seasonal components during the forecasting process.


MobileViTV2 is the second version of MobileViT, constructed by replacing the multi-headed self-attention in MobileViT with separable self-attention.


PerSAM proposes a minimal modification to SAM to allow dreambooth-like personalization, enabling to segment concepts in new images using just one example.

Timm backbone

We add support for loading timm weights within the AutoBackbone API in transformers. timm models can be instantiated through the TimmBackbone class, and then used with any vision model that needs a backbone.

Image to text pipeline conditional support

We add conditional text generation to the image to text pipeline; allowing the model to continue generating an initial text prompt according to an image.

  • [image-to-text pipeline] Add conditional text support + GIT by @NielsRogge in #23362

TensorFlow implementations

Accelerate Migration

A major rework of the internals of the Trainer is underway, leveraging accelerate instead of redefining them in transformers. This should unify both framework and lead to increased interoperability and more efficient development.

Bugfixes and improvements

Read more

v4.29.2: Patch release

16 May 19:47
Choose a tag to compare

Fixes the package so non-Python files (like CUDA kernels) are properly included.