v1.9

ptrendx released this 16 Aug 16:27

· 96 commits to main since this release

Release Notes – Release 1.9

Key Features and Enhancements

[PyTorch] Added support for sliding window attention in the cuDNN backend.
[PyTorch] Added an experimental torch.nn.Sequential style API for automatic operation based fusions.
[C/PyTorch] Added support for bottom-right aligned diagonal causal mask.
[C/PyTorch] Added support for grouped GEMM for MoE training.
[JAX] Added support for THD attention format.
[PaddlePaddle] Added support for CUDA graphs.
[PaddlePaddle] Added support for PaddlePaddle versions >= 2.6.1.

Fixed Issues

[PyTorch] Fixed incorrect outputs when handling non-contiguous input tensors.
[PyTorch] Fixed a hang in the initialize_ub function during multi-node runs, along with miscellaneous improvements in communication-GEMM overlap with userbuffers.
[PyTorch] Fixed convergence when using CPU offloading.
[PyTorch] Fixed a crash that occurred when using MoE, when an expert receives 0 tokens.
[JAX] Fixed a crash in newer JAX versions which restricted the output format of HLO lowering.
[PaddlePaddle] Fixed a crash when using the standalone column parallel linear API.
Fixed a numerical bug in the QGeLU activation.
Fixed a compilation bug in the core library with CUDA 12.1.
Fixed a bug selecting tuned RMSNorm kernels.
Fixed performance overheads by reducing the number of calls to the CUDA driver.

Known Issues in This Release

There are no known issues in this release.

Breaking Changes in This Release

There are no breaking changes in this release.

Deprecated Features

There are no deprecated features in this release.

Assets 2