Releases: huggingface/optimum-neuron
v0.0.7: Stable diffusion, `transformers` pipeline and cache fix
Stable diffusion
Supports stable diffusion compilation with neuronx-cc
for inference with inf2 / trn1.
Components chosen to be exported from StableDiffusionPipeline
are:
- CLIP text encoder
- VAE decoder
- UNet
- VAE_post_quant_conv
The export can be done with optimum-cli
as follow:
optimum-cli export neuron --model stabilityai/stable-diffusion-2-1-base --task stable-diffusion --batch_size 1 --num_channels 4 --height 64 --width 64 --sequence_length 32 sd_neuron/
Relevant PR: #101
More guide: Exporting stable diffusion to neuron
transformers
pipeline support
Pipelines running on Inferiencia instances are now supported.
It can be used with an online export as follows:
from optimum.neuron.pipelines import pipeline
clf = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english", export=True)
clf("Amazon is a great company")
# [{'label': 'POSITIVE', 'score': 0.9998538494110107}]
clf = pipeline("question-answering")
clf({"context": "This is a sample context", "question": "What is the context here?"})
# {'score': 0.4972594678401947, 'start': 8, 'end': 16, 'answer': 'a sample'}
Or with precompiled models as follows:
from transformers import AutoTokenizer
from optimum.neuron import NeuronModelForQuestionAnswering, pipeline
tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")
# Loading the PyTorch checkpoint and converting to the neuron format by providing export=True
model = NeuronModelForQuestionAnswering.from_pretrained(
"deepset/roberta-base-squad2",
export=True
)
neuron_qa = pipeline("question-answering", model=model, tokenizer=tokenizer)
question = "What's my name?"
context = "My name is Philipp and I live in Nuremberg."
pred = neuron_qa(question=question, context=context)
Relevant PR: #107
Cache repo fix
The cache repo system was broken starting from Neuron 2.11.
This release fixes that, the relevant PR is #119.
v0.0.6: Patch release
v0.0.5: NeuronModel classes and generation methods during training
NeuronModel classes
NeuronModel classes allow you to run inference on Inf1
and Inf2
instances while preserving the python interface you are used to from Transformers' auto model classses.
Example:
from transformers import AutoTokenizer
from optimum.neuron import NeuronModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained(
"optimum/distilbert-base-uncased-finetuned-sst-2-english-neuronx"
)
model = NeuronModelForSequenceClassification.from_pretrained(
"optimum/distilbert-base-uncased-finetuned-sst-2-english-neuronx"
)
inputs = tokenizer("Hamilton is considered to be the best musical of human history.", return_tensors="pt")
outputs = model(**inputs)
Supported tasks are:
- Feature extraction
- Masked language modeling
- Text classification
- Token classification
- Question answering
- Multiple choice
Relevant PR: #45
Generation methods
Two generation methods are now supported:
This allows you to perform evaluation with generation during decoder and seq2seq models training.
Misc
The Optimum CLI now provides two new commands to help managing the cache:
v0.0.4: Patch release for Neuron installation
optimum-cli neuron cache
command line
The optimum-cli
now provides two commands to work with the Trainium cache:
- Cache creation:
optimum-cli neuron cache create
- Cache setting:
optimum-cli neuron set
Documentation
- New Trainium model cache documentation page
v0.0.3: Patch release for the `huggingface_hub` library version
Pins the version of the huggingface_hub
library to be greater or equal to 0.14.0
.
Should fix errors related to #41.
v0.0.2: Compilation caching system and inference with Inferentia
Compilation caching system
Since compiling models before being able to train them can be a real bottleneck (for example on small datasets, compile-time is longer than training-time), we introduce a caching system directly connected to the Hugging Face Hub.
Before starting compilation, the TrainiumTrainer
checks if the needed compile files are on the Hub, and fetched them if that is the case, saving the user the need to do that himself.
Custom cache repo
Since each user might want to have its own cache repo to be able to push stuff and/or keep things private, we offer the possibility to do so via CUSTOM_CACHE_REPO environment variable:
CUSTOM_CACHE_REPO=michaelbenayoun/cache_test python train.py
Neuron export
Support exporting PyTorch models to serialized TorchScript Module compiled by Neuron Compiler (neuron-cc
or neuronx-cc
) that can be used on AWS INF2 or INF1.
Example: Export the BERT model with static shapes:
optimum-cli export neuron --help
optimum-cli export neuron --model bert-base-uncased --sequence_length 128 --batch_size 16 bert_neuron/
By default, on INF2, matmul
operations will be cast from fp32
to bf16
. And on INF1, all operations will be cast to bf16
. Using --auto_cast
to configure which operations to perform auto-casting and using --auto_cast_type
to define the data type for auto-casting.
Example: Auto-cast all operations (this option can potentially lower precision/accuracy) to fp16
data type:
optimum-cli export neuron --model bert-base-uncased --auto_cast all --auto_cast_type fp16 bert_neuron/
v0.0.1: Training on AWS Trainium
The following architectures can be trained on AWS Trainium instances (trn1.2xlarge and trn1.32xlarge) :
- ALBERT
- BERT
- DistilBERT
- RoBERTa
- XLM-RoBERTa
- CamemBERT
- Electra
- GPT-2
- GPT-Neo
- MarianMT
- T5
- BART
- ViT
Training examples for many tasks are provided here.