Release v0.6.1 · pytorch/ao

Highlights

We are excited to announce the 0.6.1 release of torchao! This release adds support for Auto-Round support, Float8 Axiswise scaled training, a BitNet training recipe, an implementation of AWQ and much more!

Auto-Round Support (#581)

Auto-Round is a new weight-only quantization algorithm, it has as achieved superior accuracy compared to GPTQ, AWQ, and OmniQuant across 11 tasks, particularly excelling in low-bit quantization (e.g., 2-bits and 3-bits). Auto-Round supports quantization from 2 to 8 bits, involves low tuning costs, and imposes no additional overhead during inference. Key results are summarized below, with detailed information available in our paper, GitHub repository, and Hugging Face low-bit quantization leaderboard.

from torchao.prototype.autoround.core import prepare_model_for_applying_auto_round_
from torchao.prototype.autoround.core import apply_auto_round

prepare_model_for_applying_auto_round_(
    model,
    is_target_module=is_target_module,
    bits=4,
    group_size=128,
    iters=200,
    device=device,
)

input_ids_lst = []
for data in dataloader:
    input_ids_lst.append(data["input_ids"].to(model_device))

multi_t_input_ids = MultiTensor(input_ids_lst)
out = model(multi_t_input_ids)

quantize_(model, apply_auto_round(), is_target_module)

Added float8 training axiswise scaling support with per-gemm-argument configuration (#940)

We added experimental support for rowwise scaled float8 gemm to torchao.float8, with per-gemm-input configurability to enable exploration of various recipes. Here is how a user can configure all-axiswise scaling

# all-axiswise scaling
config = torchao.float8.config.recipe_name_to_linear_config(Float8LinearRecipeName.ALL_AXISWISE)
m = torchao.float8.convert_to_float8_training(config)

# or, a custom recipe by @lw where grad_weight is left in bfloat16
config = torchao.float8.config.recipe_name_to_linear_config(Float8LinearRecipeName.LW_AXISWISE_WITH_GW_HP)
m = torchao.float8.convert_to_float8_training(config)

Early performance benchmarks show all-axiswise scaling achieve a 1.13x speedup vs bf16 on torchtitan / LLaMa 3 8B / 8 H100 GPUs (compared to 1.17x from all-tensorwise scaling in the same setup), and loss curves which match to bf16 and all-tensorwise scaling. Further performance and accuracy benchmarks will follow in future releases.

Introduced BitNet b1.58 training recipe (#930)

Adds recipe for doing BitNet b1.58](https://arxiv.org/abs/2402.17764) ternary weights clamping.

from torchao.prototype.quantized_training import bitnet_training
from torchao import quantize_

model = ...
quantize_(model, bitnet_training())

Notably: Our implementation utilizes INT8 Tensor Cores to make up for this loss in speed. In fact, our implementation is faster than BF16 training in most cases.

[Prototype] Implemented Activation Aware Weight Quantization AWQ (#743)

Perplexity and performance measured on A100 GPU:

Model	Quantization	Tokens/sec	Throughput (GB/sec)	Peak Mem (GB)	Model Size (GB)
Llama-2-7b-chat-hf	bfloat16	107.38	1418.93	13.88	13.21
	awq-hqq-int4	196.6	761.2	5.05	3.87
	awq-uint4	43.59	194.93	7.31	4.47
	int4wo-hqq	209.19	804.32	4.89	3.84
	int4wo-64	201.14	751.42	4.87	3.74

Usage:

from torchao.prototype.awq import insert_awq_observer_, awq_uintx, AWQObservedLinear
quant_dtype = torch.uint4
group_size = 64
calibration_limit = 10
calibration_seq_length = 1024
model=model.to(device)
insert_awq_observer_(model,calibration_limit, calibration_seq_length, quant_dtype=quant_dtype, group_size=group_size)
with torch.no_grad():
    for batch in calibration_data:
        model(batch.to(device))
is_observed_linear = lambda m, fqn: isinstance(m, AWQObservedLinear)
quantize_(model, awq_uintx(quant_dtype=quant_dtype, group_size = group_size), is_observed_linear)

New Features

[Prototype] Added Float8 support for AQT tensor parallel (#1003)
Added composable QAT quantizer (#938)
Introduced torchchat quantizer (#897)
Added INT8 mixed-precision training (#748)
Implemented sparse marlin AQT layout (#621)
Added a PerTensor static quant api (#787)
Introduced uintx quant to generate and eval (#811)
Added Float8 Weight Only and FP8 weight + dynamic activation (#740)
Implemented Auto-Round support (#581)
Added 2, 3, 4, 5 bit custom ops (#828)
Introduced symmetric quantization with no clipping error in the tensor subclass based API (#845)
Added int4 weight-only embedding QAT (#947)
Added support for 1-bit and 6-bit quantization for Llama in torchchat (#910, #1007)
Added a linear_observer class for doing static activation calibration (#807)
Exposed hqq through uintx_weight_only API (#786)
Added RowWise scaling option for Float8 dynamic activation quantization (#819)
Added Float8 weight only to autoquant api (#866)

Improvements

Enhanced Auto-Round functionality (#870)
Improved FSDP support for low-bit optimizers (#538)
Added support for using AffineQuantizedTensor with weights_only=True for torch.load (#630)
Optimized 3-bit packing (#1029)
Added more evaluation metrics to llama/eval.sh (#934)
Improved eager numerics for dynamic scales in float8 (#904)

Bug fixes

Fixed inference_mode issues (#885)
Fixed failing FP6 benchmark (#931)
Resolved various issues with float8 support (#918, #923)
Fixed load state dict when device is different for low-bit optim (#1021)

Performance

Added SM75 (Turing) support for FP6 kernel (#942)
Implemented int8 dynamic quant + bsr support (#821)
Added workaround to recover the perf for quantized vit in torch.compile (#926)

INT8 Mixed-Precision Training

On NVIDIA GPUs, INT8 Tensor Cores is approximately 2x faster than their BF16/FP16 counterparts. In mixed-precision training, we can down-cast activations and weights dynamically to INT8 to leverage faster matmuls. However, since INT8 has very limited range [-128,127], we perform row-wise quantization, similar to how INT8 post-training quantization (PTQ) is done. Weight is still in original precision.

from torchao.prototype.quantized_training import int8_mixed_precision_training, Int8MixedPrecisionTrainingConfig
from torchao.quantization import quantize_

model = ...

# apply INT8 matmul to all 3 matmuls
quantize_(model, int8_mixed_precision_training())

# customize which matmul is left in original precision.
config = Int8MixedPrecisionTrainingConfig(
    output=True,
    grad_input=True,
    grad_weight=False,
)
quantize_(model, int8_mixed_precision_training(config))

End2end speed benchmark using benchmarks/quantized_training/pretrain_llama2.py

Model & GPU	bs x seq_len	Config	Tok/s	Peak mem (GB)
Llama2-7B, A100	8 x 2048	BF16 (baseline)	~4400	59.69
Llama2-7B, A100	8 x 2048	INT8 mixed-precision	~6100 (+39%)	58.28
Llama2-1B, 4090	16 x 2048	BF16 (baseline)	~17,900	18.23
Llama2-1B, 4090	16 x 2048	INT8 mixed-precision	~30,700 (+72%)	18.34

Docs

Updated README with more current float8 speedup information (#816)
Added tutorial for trainable tensor subclass (#908)
Improved documentation for float8 unification and inference (#895, #896)

Devs

Added compile tests to test suite (#906)
Improved CI setup and build processes (#887)
Added M1 wheel support (#822)
Added more benchmarking and profiling tools (#1017)
Renamed fpx to floatx (#877)
Removed torchao_nightly package (#661)
Added more lint fixes (#827)
Added better subclass testing support (#839)
Added CI to catch syntax errors (#861)
Added tutorial on composing quantized subclass w/ Dtensor based TP (#785)

Security

No significant security updates in this release.

Untopiced

Added basic SAM2 AutomaticMaskGeneration example server (#1039)

New Contributors

@iseeyuan made their first contribution in #805
@YihengBrianWu made their first contribution in #860
@kshitij12345 made their first contribution in #863
@ZainRizvi made their first contribution in #887
@alexsamardzic made their first contribution in #899
@vaishnavi17 made their first contribution in #911
@tobiasvanderwerff made their first contribution in #931
@kwen2501 made their first contribution in #937
@y-sq made their first contribution in #912
@jimexist made their first contribution in #969
@danielpatrickhug made their first contribution in #914
@ramreddymounica made their first contribution in #1007
@yushangdi made their first contribution in #1006
@ringohoffman made their first contribution in #1023

Full Changelog: v0.5.0...v0.6.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.6.1

Highlights

Auto-Round Support (#581)

Added float8 training axiswise scaling support with per-gemm-argument configuration (#940)

Introduced BitNet b1.58 training recipe (#930)

[Prototype] Implemented Activation Aware Weight Quantization AWQ (#743)

New Features

Improvements

Bug fixes

Performance

INT8 Mixed-Precision Training

Docs

Devs

Security

Untopiced

New Contributors

New Contributors

Contributors