Skip to content

Flash Attention 2 pre-built wheels for Windows. Drop-in replacement for PyTorch attention providing up to 10x speedup and 20x memory reduction. Compatible with Python 3.10 and CUDA 11.7+. No build setup required - just pip install and accelerate your transformer models. Supports modern NVIDIA GPUs (RTX 30/40, A100, H100).

License

Notifications You must be signed in to change notification settings

sunsetcoder/flash-attention-windows

Repository files navigation

Flash Attention Windows Wheels (Python 3.10)

Pre-built Windows wheels for Flash-Attention 2 - The state-of-the-art efficient attention implementation for NVIDIA GPUs.

Overview

This repository was created in response to the numerous challenges Windows users face when trying to build Flash-Attention 2. After experiencing these difficulties firsthand and seeing many similar issues in the official repository (#1340, #1339, #1292), it became clear that a solution was needed to make this cutting-edge attention implementation immediately accessible in Windows environments.

Building Flash-Attention on Windows requires a delicate setup process with specific versions of multiple dependencies, careful environment configuration, and significant time investment. This repository eliminates these hurdles by providing ready-to-use pre-built wheels, making this cutting-edge attention implementation immediately accessible in Windows environments without complex and time-consuming build requirements.

Flash Attention 2, originally created by Tri Dao and Dan Fu, delivers breakthrough performance:

  • 20% faster than Flash Attention 1
  • Up to 10x faster than standard attention
  • Up to 20x memory reduction
  • Drop-in replacement for PyTorch's attention
  • Automatic CUDA kernel optimization

These wheels are tested and maintained to ensure stable deployment on Windows, saving hours of build configuration and potential compatibility issues. While not officially supported by the Flash-Attention team, they are built following the exact source specifications to maintain the original performance benefits.

Why These Wheels?

  1. Solves Common Build Issues*: Addresses widespread Windows build failures documented in multiple GitHub issues
  2. Eliminates Build Complexity: Bypasses the need for precise configuration of Visual Studio, CUDA toolkit, and build environment
  3. Saves Hours of Development Time: Replaces a fragile, time-consuming build process with simple pip installation for Immediate integration into your PyTorch projects
  4. Verified Functionality: Tested against standard benchmarks to ensure performance
  5. Regular Updates: Maintained to keep pace with Flash Attention releases
  6. Community Support: Issues and improvements handled via GitHub collaboration

Note: These wheels are community-maintained and are not officially supported by the Flash-Attention team. They are provided to support the ML community's Windows developers.

Current Release

  • Flash Attention Version: 2.7.0.post2
  • Python Version: 3.10
  • Platform: Windows 10/11 (64-bit)
  • Build Date: November 2024

Requirements

  • Windows 10/11 (64-bit)
  • Python 3.10
  • CUDA Toolkit 11.7+
  • NVIDIA GPU with Compute Capability 8.0+. Compatible with Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100).
  • PyTorch 2.0.0+
  • Minimum 8GB GPU VRAM recommended

Quick Installation

# Simply download the wheel file and install with:
pip install flash_attn-2.7.0.post2-cp310-cp310-win_amd64.whl

Verification

try:
    import torch
    import flash_attn
    from flash_attn import flash_attn_func
    
    # Verify version
    print(f"Flash Attention version: {flash_attn.__version__}")
    
    # Basic functionality test
    if torch.cuda.is_available():
        q = torch.randn(2, 8, 32, 64, device='cuda')
        k = torch.randn(2, 8, 32, 64, device='cuda')
        v = torch.randn(2, 8, 32, 64, device='cuda')
        
        output = flash_attn_func(q, k, v)
        print("Flash Attention test successful!")
    else:
        print("CUDA device not available!")
except ImportError as e:
    print(f"Import Error: {e}")
except RuntimeError as e:
    print(f"Runtime Error: {e}")

Known Issues

  • Wheels only available for Python 3.10. Python 3.12 support is in the roadmap.

Instructions for building new wheels:

Prerequisites

  • Visual Studio 2019 with C++ build tools
  • CUDA Toolkit 12.4
  • Python 3.10 development environment
  • Administrator privileges

Build Steps

  1. Prepare Environment
# Install build dependencies
pip install ninja packaging

# Set environment variables (PowerShell)
$env:FLASH_ATTENTION_FORCE_BUILD="TRUE"
$env:CUDA_HOME="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4"
  1. Build Process
# Remove existing installation
pip uninstall flash-attn -y

# Install with build flag
pip install flash-attn==2.7.0.post2 --no-build-isolation

Build Configuration

Variable Description Default
FLASH_ATTENTION_FORCE_BUILD Forces wheel rebuild FALSE
CUDA_HOME CUDA toolkit path System CUDA path
MAX_JOBS* Build parallelism CPU count
  • If your machine has less than 96GB of RAM and lots of CPU cores, ninja might run too many parallel compilation jobs that could exhaust the amount of RAM. To limit the number of parallel compilation jobs, you can set the environment variable MAX_JOBS (try 4 to start with).

Troubleshooting

Common Issues

  1. Installation Failures
    • Verify CUDA installation
    • Check Python version (which python)
    • Confirm VS2019 installation

Contributing

Contributions welcome in these areas:

  • Documentation improvements
  • Build process optimization
  • Wheels for other versions of Python and Flash Attention.

License

Distributed under the same license as Flash Attention. See Flash Attention License.

Resources

Support

  • Star this repository
  • Report issues
  • Submit pull requests (wheels for other Python versions or new versions of Flash Attention)
  • Share with others

Security

Verify downloaded wheel checksums:

# Generate checksum (Powershell)
Get-FileHash flash_attn-2.7.0.post2-cp310-cp310-win_amd64.whl -Algorithm SHA256

Compare with expected value

15e0c4af6349b66c1003bf8541487636aca0a6ad81d6593d6711409983fd616c flash_attn-2.7.0.post2-cp310-cp310-win_amd64.whl

About

Flash Attention 2 pre-built wheels for Windows. Drop-in replacement for PyTorch attention providing up to 10x speedup and 20x memory reduction. Compatible with Python 3.10 and CUDA 11.7+. No build setup required - just pip install and accelerate your transformer models. Supports modern NVIDIA GPUs (RTX 30/40, A100, H100).

Resources

License

Stars

Watchers

Forks

Packages

No packages published