Performance Optimization: Optimized TileShape Configuration for bf16 and Mixed Formats #3591

MatrixAssembler · 2025-01-20T22:47:50Z

Performance Issue with Current BF16 and mixed TileShape Configuration

The current FBGEMM bf16 kernel uses a TileShape configuration of 128x128x128,
while the optimal shape for dense bf16 tensor core on H100 is m64n256k16.
The current configuration leads to suboptimal performance for tensor cores and bandwidth usage,
as evidenced by PTX warnings about:
'wgmma.mma_async instruction serialization due to insufficient register resources'

Optimized TileShape (128x256x64) Implementation

Modification of the TileShape configuration from 128x128x128 to 128x256x64 for large GEMM
operations using a cooperative kernel, enabling optimal bandwidth and tensor cores utilization.
This configuration is notably used in Flash Attention V3 and identified by Colfax-intl
as the optimal configuration after empirical study for bf16 kernels.

Benchmark Results on H100 GPU

Benchmark configuration:

PyTorch 2.6
CUDA 12.4
CPU: AMD EPYC
GPU: NVIDIA H100
Benchmarks are configured with 30 kernel launch iterations
and averaged over 25 Benchmark calculations.

Benchmark

bf16bf16bf16_grouped

TileShape	TFlops
128-128-128	606
128-256- 64	790

bf16i4bf16_rowwise_batched

TileShape	TFlops bf16*	TFlops fp16*	TFlops float*
128-128-128	354	341	383
128-256- 64	704	727	763

bf16i4bf16_rowwise

TileShape	TFlops bf16*	TFlops fp16*	TFlops float*
128-128-128	349	351	381
128-256- 64	652	663	693

f8i4bf16_rowwise

TileShape	TFlops bf16*	TFlops fp16*	TFlops float*
128-128-128	407	542	606
128-256- 64	921	942	1088

*WEIGHT_SCALE_DTYPE

Technical Implementation

Modified TileShape from 128-128-128 to 128-256-64 for:

bf16bf16bf16_grouped
bf16i4bf16_rowwise_batched
bf16i4bf16_rowwise
f8i4bf16_rowwise

Added cooperative kernel by default for:

bf16i4bf16_rowwise_batched
bf16i4bf16_rowwise
f8i4bf16_rowwise

The modifications only affect large mode and Default kernels where N > 128.
These changes were made by modifying the minimum necessary code while respecting
existing coding practices in FBGEMM.

Test Coverage

Unit Tests Results

The unit tests in fbgemm_gpu/experimental/gen_ai/test/quantize
have been verified for the modified kernels.

- Change TileShape from 128x128x128 to 128x256x64 - Optimize bf16 and mixed format kernels - Add cooperative kernel by default for mixed kernels

netlify · 2025-01-20T22:48:11Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`dd31dda`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/678ed298fb101a0008e64b0d
😎 Deploy Preview	https://deploy-preview-3591--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

perf: optimize TileShape configuration for bf16 and mixed

dd31dda

- Change TileShape from 128x128x128 to 128x256x64 - Optimize bf16 and mixed format kernels - Add cooperative kernel by default for mixed kernels

facebook-github-bot added the cla signed label Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Optimization: Optimized TileShape Configuration for bf16 and Mixed Formats #3591

Performance Optimization: Optimized TileShape Configuration for bf16 and Mixed Formats #3591

MatrixAssembler commented Jan 20, 2025

netlify bot commented Jan 20, 2025 •

edited

Loading

Performance Optimization: Optimized TileShape Configuration for bf16 and Mixed Formats #3591

Are you sure you want to change the base?

Performance Optimization: Optimized TileShape Configuration for bf16 and Mixed Formats #3591

Conversation

MatrixAssembler commented Jan 20, 2025

Performance Issue with Current BF16 and mixed TileShape Configuration

Optimized TileShape (128x256x64) Implementation

Benchmark Results on H100 GPU

Benchmark configuration:

Benchmark

bf16bf16bf16_grouped

bf16i4bf16_rowwise_batched

bf16i4bf16_rowwise

f8i4bf16_rowwise

Technical Implementation

Test Coverage

Unit Tests Results

netlify bot commented Jan 20, 2025 • edited Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

netlify bot commented Jan 20, 2025 •

edited

Loading