Releases · facebookresearch/xformers

Profiler: Fix computation of FLOPS for the attention when using xFormers
Profiler: Fix MFU/HFU calculation when multiple dtypes are used
Profiler: Trace analysis to compute MFU & HFU is now much faster
fMHA/splitK: Fixed nan in the output when using a torch.Tensor bias where a lot of consecutive keys are masked with -inf
Update Flash-Attention version to v2.6.3 when building from scratch
When using the most recent version of Flash-Attention, it is no longer possible to mix it with the cutlass backend. In other words, it is no longer possible to use the cutlass Fw with the flash Bw.

Removed

fMHA: Removed decoder and small_k backends
profiler: Removed DetectSlowOpsProfiler profiler
Removed compatibility with PyTorch < 2.4
Removed conda builds for python 3.9
Removed windows pip wheels for cuda 12.1 and 11.8

Assets 2

26 Jul 15:41

v0.0.27.post2

1fc661f

torch.compile support, bug fixes & more

Pre-built binary wheels require PyTorch 2.4.0

Added

fMHA: PagedBlockDiagonalGappyKeysMask
fMHA: heterogeneous queries in triton_splitk
fMHA: support for paged attention in flash
fMHA: Added backwards pass for merge_attentions
fMHA: Added torch.compile support for 3 biases (LowerTriangularMask, LowerTriangularMaskWithTensorBias and BlockDiagonalMask) - some might require PyTorch 2.4
fMHA: Added torch.compile support in memory_efficient_attention when passing the flash operator explicitely (eg memory_efficient_attention(..., op=(flash.FwOp, flash.BwOp)))
fMHA: memory_efficient_attention now expects its attn_bias argument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device.
fMHA: AttentionBias subclasses are now constructed by default on the cuda device if available - they used to be created on the CPU device
2:4 sparsity: Added xformers.ops.sp24.sparsify24_ste for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values

Improved

fMHA: Fixed out-of-bounds reading for Split-K triton implementation
Profiler: fix bug with modules that take a single tuple as argument
Profiler: Added manual trigger for a profiling step, by creating a trigger file in the profiling directory

Removed

Removed support for PyTorch version older than 2.2.0

Assets 2

25 Jul 11:59

v0.0.27.post1

b3831ea

torch.compile support, bug fixes & more

Pre-built binary wheels require PyTorch 2.4.0

Added

fMHA: PagedBlockDiagonalGappyKeysMask
fMHA: heterogeneous queries in triton_splitk
fMHA: support for paged attention in flash
fMHA: Added backwards pass for merge_attentions
fMHA: Added torch.compile support for 3 biases (LowerTriangularMask, LowerTriangularMaskWithTensorBias and BlockDiagonalMask) - some might require PyTorch 2.4
fMHA: Added torch.compile support in memory_efficient_attention when passing the flash operator explicitely (eg memory_efficient_attention(..., op=(flash.FwOp, flash.BwOp)))
fMHA: memory_efficient_attention now expects its attn_bias argument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device.
fMHA: AttentionBias subclasses are now constructed by default on the cuda device if available - they used to be created on the CPU device
2:4 sparsity: Added xformers.ops.sp24.sparsify24_ste for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values

Improved

fMHA: Fixed out-of-bounds reading for Split-K triton implementation
Profiler: fix bug with modules that take a single tuple as argument
Profiler: Added manual trigger for a profiling step, by creating a trigger file in the profiling directory

Removed

Removed support for PyTorch version older than 2.2.0

Assets 2

09 Jul 16:35

danthe3rd

v0.0.27

184b280

[v0.0.27] torch.compile support, bug fixes & more

Added

fMHA: PagedBlockDiagonalGappyKeysMask
fMHA: heterogeneous queries in triton_splitk
fMHA: support for paged attention in flash
fMHA: Added backwards pass for merge_attentions
fMHA: Added torch.compile support for 3 biases (LowerTriangularMask, LowerTriangularMaskWithTensorBias and BlockDiagonalMask) - some might require PyTorch 2.4
fMHA: Added torch.compile support in memory_efficient_attention when passing the flash operator explicitely (eg memory_efficient_attention(..., op=(flash.FwOp, flash.BwOp)))
fMHA: memory_efficient_attention now expects its attn_bias argument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device.
fMHA: AttentionBias subclasses are now constructed by default on the cuda device if available - they used to be created on the CPU device
2:4 sparsity: Added xformers.ops.sp24.sparsify24_ste for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values

Improved

fMHA: Fixed out-of-bounds reading for Split-K triton implementation
Profiler: fix bug with modules that take a single tuple as argument
Profiler: Added manual trigger for a profiling step, by creating a trigger file in the profiling directory

Removed

Removed support for PyTorch version older than 2.2.0

Assets 2

29 Apr 14:40

danthe3rd

v0.0.26.post1

fad50d4

2:4 sparsity & `torch.compile`-ing memory_efficient_attention

Pre-built binary wheels require PyTorch 2.3.0

Added

[2:4 sparsity] Added support for Straight-Through Estimator for sparsify24 gradient (GRADIENT_STE)
[2:4 sparsity] sparsify24_like now supports the cuSparseLt backend, and the STE gradient
Basic support for torch.compile for the memory_efficient_attention operator. Currently only supports Flash-Attention, and without any bias provided. We want to expand this coverage progressively.

Improved

merge_attentions no longer needs inputs to be stacked.
fMHA: triton_splitk now supports additive bias
fMHA: benchmark cleanup

Assets 2

29 Mar 14:05

danthe3rd

v0.0.25.post1

7fffd3d

`v0.0.25.post1`: Building binaries for PyTorch 2.2.2

Pre-built binary wheels require PyTorch 2.2.2

Assets 2

31 Jan 08:42

danthe3rd

v0.0.24

f7e46d5

2:4 sparsity, fused sequence parallel, torch compile & more

Pre-built binary wheels require PyTorch 2.2.0

Added

Added components for model/sequence parallelism, as near-drop-in replacements for FairScale/Megatron Column&RowParallelLinear modules. They support fusing communication and computation for sequence parallelism, thus making the communication effectively free.
Added kernels for training models with 2:4-sparsity. We introduced a very fast kernel for converting a matrix A into 24-sparse format, which can be used during training to sparsify weights dynamically, activations etc... xFormers also provides an API that is compatible with torch-compile, see xformers.ops.sparsify24.

Improved

Make selective activation checkpointing be compatible with torch.compile.

Removed

Triton kernels now require a GPU with compute capability 8.0 at least (A100 or newer). This is due to newer versions of triton not supporting older GPUs correctly
Removed support for PyTorch version older than 2.1.0

Assets 2

15 Dec 12:14

danthe3rd

v0.0.23.post1

042abc8

Binary builds for PyTorch 2.1.2

Binary wheels and conda binary builds for PyTorch 2.1.2.
For users who need to use a previous version of PyTorch, they can either:

Install a previous version of xFormers
Build from source

Assets 2

06 Dec 16:05

danthe3rd

v0.0.23

1254a16

Bugfixes/improvements in `memory_efficient_attention`

Pre-built binary wheels require PyTorch 2.1.1

Fixed

fMHA: Fixed a bug in cutlass backend forward pass where the logsumexp was not correctly calculated, resulting in wrong results in the BW pass. This would happen with MQA when one sequence has a query with length%64 == 1
fMHA: Updated Flash-Attention to v2.3.6 - this fixes a performance regression in causal backward passes, and now supports BlockDiagonalCausalWithOffsetPaddedKeysMask

Added

fMHA: Added LocalAttentionFromBottomRightMask (local)
fMHA: Added LowerTriangularFromBottomRightMask (causal)
fMHA: Added LowerTriangularFromBottomRightLocalAttentionMask (local + causal)

Removed

Removed xformers.triton.sum_strided

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[0.0.28.post1] - 2024-09-13

Added

Improved

Removed

Added

Improved

Removed

Added

Improved

Removed

Added

Improved

Removed

Added

Improved

Added

Improved

Removed

Fixed

Added

Removed

Releases: facebookresearch/xformers

`0.0.28.post1` - fixing upload for cuda 12.4 wheels

[0.0.28.post1] - 2024-09-13

FAv3, profiler update & AMD

Added

Improved

Removed

torch.compile support, bug fixes & more

Added

Improved

Removed

torch.compile support, bug fixes & more

Added

Improved

Removed

[v0.0.27] torch.compile support, bug fixes & more

Added

Improved

Removed

2:4 sparsity & `torch.compile`-ing memory_efficient_attention

Added

Improved

`v0.0.25.post1`: Building binaries for PyTorch 2.2.2

2:4 sparsity, fused sequence parallel, torch compile & more

Added

Improved

Removed

Binary builds for PyTorch 2.1.2

Bugfixes/improvements in `memory_efficient_attention`

Fixed

Added

Removed