Skip to content

Releases: facebookresearch/xformers

`0.0.28.post1` - fixing upload for cuda 12.4 wheels

13 Sep 15:52
Compare
Choose a tag to compare

[0.0.28.post1] - 2024-09-13

Properly upload wheels for cuda 12.4

FAv3, profiler update & AMD

12 Sep 15:49
Compare
Choose a tag to compare

Pre-built binary wheels require PyTorch 2.4.1

Added

  • Added wheels for cuda 12.4
  • Added conda builds for python 3.11
  • Added wheels for rocm 6.1

Improved

  • Profiler: Fix computation of FLOPS for the attention when using xFormers
  • Profiler: Fix MFU/HFU calculation when multiple dtypes are used
  • Profiler: Trace analysis to compute MFU & HFU is now much faster
  • fMHA/splitK: Fixed nan in the output when using a torch.Tensor bias where a lot of consecutive keys are masked with -inf
  • Update Flash-Attention version to v2.6.3 when building from scratch
  • When using the most recent version of Flash-Attention, it is no longer possible to mix it with the cutlass backend. In other words, it is no longer possible to use the cutlass Fw with the flash Bw.

Removed

  • fMHA: Removed decoder and small_k backends
  • profiler: Removed DetectSlowOpsProfiler profiler
  • Removed compatibility with PyTorch < 2.4
  • Removed conda builds for python 3.9
  • Removed windows pip wheels for cuda 12.1 and 11.8

torch.compile support, bug fixes & more

26 Jul 15:41
@lw lw
Compare
Choose a tag to compare

Pre-built binary wheels require PyTorch 2.4.0

Added

  • fMHA: PagedBlockDiagonalGappyKeysMask
  • fMHA: heterogeneous queries in triton_splitk
  • fMHA: support for paged attention in flash
  • fMHA: Added backwards pass for merge_attentions
  • fMHA: Added torch.compile support for 3 biases (LowerTriangularMask, LowerTriangularMaskWithTensorBias and BlockDiagonalMask) - some might require PyTorch 2.4
  • fMHA: Added torch.compile support in memory_efficient_attention when passing the flash operator explicitely (eg memory_efficient_attention(..., op=(flash.FwOp, flash.BwOp)))
  • fMHA: memory_efficient_attention now expects its attn_bias argument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device.
  • fMHA: AttentionBias subclasses are now constructed by default on the cuda device if available - they used to be created on the CPU device
  • 2:4 sparsity: Added xformers.ops.sp24.sparsify24_ste for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values

Improved

  • fMHA: Fixed out-of-bounds reading for Split-K triton implementation
  • Profiler: fix bug with modules that take a single tuple as argument
  • Profiler: Added manual trigger for a profiling step, by creating a trigger file in the profiling directory

Removed

  • Removed support for PyTorch version older than 2.2.0

torch.compile support, bug fixes & more

25 Jul 11:59
@lw lw
Compare
Choose a tag to compare

Pre-built binary wheels require PyTorch 2.4.0

Added

  • fMHA: PagedBlockDiagonalGappyKeysMask
  • fMHA: heterogeneous queries in triton_splitk
  • fMHA: support for paged attention in flash
  • fMHA: Added backwards pass for merge_attentions
  • fMHA: Added torch.compile support for 3 biases (LowerTriangularMask, LowerTriangularMaskWithTensorBias and BlockDiagonalMask) - some might require PyTorch 2.4
  • fMHA: Added torch.compile support in memory_efficient_attention when passing the flash operator explicitely (eg memory_efficient_attention(..., op=(flash.FwOp, flash.BwOp)))
  • fMHA: memory_efficient_attention now expects its attn_bias argument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device.
  • fMHA: AttentionBias subclasses are now constructed by default on the cuda device if available - they used to be created on the CPU device
  • 2:4 sparsity: Added xformers.ops.sp24.sparsify24_ste for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values

Improved

  • fMHA: Fixed out-of-bounds reading for Split-K triton implementation
  • Profiler: fix bug with modules that take a single tuple as argument
  • Profiler: Added manual trigger for a profiling step, by creating a trigger file in the profiling directory

Removed

  • Removed support for PyTorch version older than 2.2.0

[v0.0.27] torch.compile support, bug fixes & more

09 Jul 16:35
Compare
Choose a tag to compare

Added

  • fMHA: PagedBlockDiagonalGappyKeysMask
  • fMHA: heterogeneous queries in triton_splitk
  • fMHA: support for paged attention in flash
  • fMHA: Added backwards pass for merge_attentions
  • fMHA: Added torch.compile support for 3 biases (LowerTriangularMask, LowerTriangularMaskWithTensorBias and BlockDiagonalMask) - some might require PyTorch 2.4
  • fMHA: Added torch.compile support in memory_efficient_attention when passing the flash operator explicitely (eg memory_efficient_attention(..., op=(flash.FwOp, flash.BwOp)))
  • fMHA: memory_efficient_attention now expects its attn_bias argument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device.
  • fMHA: AttentionBias subclasses are now constructed by default on the cuda device if available - they used to be created on the CPU device
  • 2:4 sparsity: Added xformers.ops.sp24.sparsify24_ste for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values

Improved

  • fMHA: Fixed out-of-bounds reading for Split-K triton implementation
  • Profiler: fix bug with modules that take a single tuple as argument
  • Profiler: Added manual trigger for a profiling step, by creating a trigger file in the profiling directory

Removed

  • Removed support for PyTorch version older than 2.2.0

2:4 sparsity & `torch.compile`-ing memory_efficient_attention

29 Apr 14:40
Compare
Choose a tag to compare

Pre-built binary wheels require PyTorch 2.3.0

Added

  • [2:4 sparsity] Added support for Straight-Through Estimator for sparsify24 gradient (GRADIENT_STE)
  • [2:4 sparsity] sparsify24_like now supports the cuSparseLt backend, and the STE gradient
  • Basic support for torch.compile for the memory_efficient_attention operator. Currently only supports Flash-Attention, and without any bias provided. We want to expand this coverage progressively.

Improved

  • merge_attentions no longer needs inputs to be stacked.
  • fMHA: triton_splitk now supports additive bias
  • fMHA: benchmark cleanup

`v0.0.25.post1`: Building binaries for PyTorch 2.2.2

29 Mar 14:05
7fffd3d
Compare
Choose a tag to compare

Pre-built binary wheels require PyTorch 2.2.2

2:4 sparsity, fused sequence parallel, torch compile & more

31 Jan 08:42
Compare
Choose a tag to compare

Pre-built binary wheels require PyTorch 2.2.0

Added

  • Added components for model/sequence parallelism, as near-drop-in replacements for FairScale/Megatron Column&RowParallelLinear modules. They support fusing communication and computation for sequence parallelism, thus making the communication effectively free.
  • Added kernels for training models with 2:4-sparsity. We introduced a very fast kernel for converting a matrix A into 24-sparse format, which can be used during training to sparsify weights dynamically, activations etc... xFormers also provides an API that is compatible with torch-compile, see xformers.ops.sparsify24.

Improved

  • Make selective activation checkpointing be compatible with torch.compile.

Removed

  • Triton kernels now require a GPU with compute capability 8.0 at least (A100 or newer). This is due to newer versions of triton not supporting older GPUs correctly
  • Removed support for PyTorch version older than 2.1.0

Binary builds for PyTorch 2.1.2

15 Dec 12:14
Compare
Choose a tag to compare

Binary wheels and conda binary builds for PyTorch 2.1.2.
For users who need to use a previous version of PyTorch, they can either:

  • Install a previous version of xFormers
  • Build from source

Bugfixes/improvements in `memory_efficient_attention`

06 Dec 16:05
Compare
Choose a tag to compare

Pre-built binary wheels require PyTorch 2.1.1

Fixed

  • fMHA: Fixed a bug in cutlass backend forward pass where the logsumexp was not correctly calculated, resulting in wrong results in the BW pass. This would happen with MQA when one sequence has a query with length%64 == 1
  • fMHA: Updated Flash-Attention to v2.3.6 - this fixes a performance regression in causal backward passes, and now supports BlockDiagonalCausalWithOffsetPaddedKeysMask

Added

  • fMHA: Added LocalAttentionFromBottomRightMask (local)
  • fMHA: Added LowerTriangularFromBottomRightMask (causal)
  • fMHA: Added LowerTriangularFromBottomRightLocalAttentionMask (local + causal)

Removed

  • Removed xformers.triton.sum_strided