feat: JIT compilation #507

yzh119 · 2024-09-25T07:03:01Z

This PR implements the JIT compilation (#170 ) of flashinfer, after this PR, flashinfer will compile kernels just-in-time for different input data types and shapes, and cached the kernels at the disk, instead of pre-compile a set of kernels in the wheel.

We also provide AOT mode (which should be installed from https://github.com/flashinfer-ai/flashinfer/tree/main/flashinfer-aot) which pre-compiles a set of flashinfer operators for production environment (see #510 ). In AOT mode, we use pre-compiled operators whenever possible, and only JIT compiles kernels that are not pre-compiled.

Motivation

The pip wheel size is exploding as we add support to more data types, more head dimensions, more attention variants and more kernel implementation. Pre-compile everything is not sustainable, and impedes development speed.

This PR refactors the codebase to use torch's JIT Compiling Extensions feature instead of pre-compile kernels in the wheel.

Attention Variants

We learned from FlexAttention and describes every attention variant as a template class, each instance of the struct can carry some closure variable defined in local memory or shared memory, below are two examples (logits soft cap and alibi attention, the programming interface is tentative and will be updated as we improve the programmability of the JIT template):

template <typename ParamsT>
struct LogitsSoftCap {
  using DTypeQ = typename ParamsT::DTypeQ;
  using DTypeKV = typename ParamsT::DTypeKV;
  using DTypeO = typename ParamsT::DTypeO;

  uint32_t qo_len, kv_len;
  uint32_t window_left;

  __device__ __host__ LogitsSoftCap(const ParamsT& params, uint32_t batch_idx, uint8_t* smem_ptr) {
    qo_len = params.get_qo_len(batch_idx);
    kv_len = params.get_kv_len(batch_idx);
    window_left = kv_len;
  }

  template <typename T>
  __device__ __forceinline__ T QueryTransform(const ParamsT& params, T q) {
    return float(q) * params.sm_scale * math::ptx_rcp(params.logits_soft_cap);
  }

  template <typename T>
  __device__ __forceinline__ T LogitsTransform(const ParamsT& params, T logits, uint32_t batch_idx,
                                               uint32_t qo_idx, uint32_t kv_idx,
                                               uint32_t qo_head_idx, uint32_t kv_head_idx) {
    return params.logits_soft_cap * math::log2e * float(math::tanh(logits));
  }

  __device__ __forceinline__ bool LogitsMask(const ParamsT& params, uint32_t batch_idx,
                                             uint32_t qo_idx, uint32_t kv_idx, uint32_t qo_head_idx,
                                             uint32_t kv_head_idx) {
    return true;
  }
};

template <typename ParamsT>
struct ALIBIAttention {
  using DTypeQ = typename ParamsT::DTypeQ;
  using DTypeKV = typename ParamsT::DTypeKV;
  using DTypeO = typename ParamsT::DTypeO;
  using IdType = typename ParamsT::IdType;

  uint32_t qo_len, kv_len;
  uint32_t window_left;

  __device__ __host__ ALIBIAttention(const ParamsT& params, uint32_t batch_idx, uint8_t* smem_ptr) {
    qo_len = params.get_qo_len(batch_idx);
    kv_len = params.get_kv_len(batch_idx);
    window_left = kv_len;
  }

  template <typename T>
  __device__ __forceinline__ T QueryTransform(const ParamsT& params, T q) {
    return float(q) * params.sm_scale * math::log2e;
  }

  template <typename T>
  __device__ __forceinline__ T LogitsTransform(const ParamsT& params, T logits, uint32_t batch_idx,
                                               uint32_t qo_idx, uint32_t kv_idx,
                                               uint32_t qo_head_idx, uint32_t kv_head_idx) {
    return logits + params.alibi_slopes[qo_head_idx] * float(int(kv_idx) - int(qo_idx));
  }

  __device__ __forceinline__ bool LogitsMask(const ParamsT& params, uint32_t batch_idx,
                                             uint32_t qo_idx, uint32_t kv_idx, uint32_t qo_head_idx,
                                             uint32_t kv_head_idx) {
    return true;
  }
};

User can customize their own ParamsT class and variants class to define their own attention variants, we hope such refactor will make the codebase more concise and extensive.

Roadmap

After this PR, we will add support for:

PyPI wheels Downloadable Package in PyPI #153
fp8 tensor cores attention: Does Flashinfer support 8-bit attention calculation? #502
different head dimensions: [Feature Request] Versatile head dimension #142 [Tentative] Adding 192 head dim (step_size = 12) #454 failed to dispatch head_dim 96 #455
flashattention3 Feature: Flash Attention 3 #369
multi-head latency attention Support MLA (Multi-Head Latent Attention) in DeepSeek-v2 #237
Generate ParamsT and Attention variants description from python dsl

The development of this features have been blocked by the limitation of wheel size (binary size >= 2GB will trigger some linking issues), I hope this PR will make development easier in the future.

upd fix

🤖 I have created a release *beep* *boop* --- ## [0.2.0](v0.1.6...v0.2.0) (2024-12-17) [Release Blog](https://flashinfer.ai/2024/12/16/flashinfer-v02-release.html). ### Features * add `rotary_dim` argument to rope APIs for partial apply rope ([#599](#599)) ([eb9bc71](eb9bc71)) * add a `use_softmax` field in variant class ([#533](#533)) ([d81af97](d81af97)) * add an option `non_blocking` to plan function ([#622](#622)) ([560af6f](560af6f)) * add gemma_rmsnorm and gemma_fused_add_rmsnorm ([#477](#477)) ([1a6b17e](1a6b17e)) * add group size 3 to GQA decode dispatch ([#558](#558)) ([6227562](6227562)) * add JIT compilation support for FA3 templates ([#672](#672)) ([d4e8d79](d4e8d79)) * allow the cascade kernels to be executed using varying sequence lenghts ([#627](#627)) ([92ac440](92ac440)) * CUDAGraph compatibility of multi-level cascade inference APIs ([#586](#586)) ([2332e8a](2332e8a)) * fix the maximal grid dimension in prefill planning with CUDA graphs ([#639](#639)) ([86ca89a](86ca89a)) * improve the precision of the FusedAddRMSNormKernel function ([#587](#587)) ([c7dc921](c7dc921)) * JIT compilation ([#507](#507)) ([3613a5b](3613a5b)) * modify group-gemm stage number ([#497](#497)) ([52dab1d](52dab1d)) * non-contiguous query with paged kv cache ([#553](#553)) ([89f2c4a](89f2c4a)) * pass a dynamic token count to the cascade kernels ([#635](#635)) ([5fe9f7d](5fe9f7d)) * simplify prefill JIT compilation ([#605](#605)) ([fe4f898](fe4f898)) * specify gemm backend ([#648](#648)) ([0cc1a51](0cc1a51)) * support cached cos/sin in rope APIs ([#585](#585)) ([83e541d](83e541d)) * support huggingface transformer style rope interface ([#568](#568)) ([4f40420](4f40420)) * support sm90 cutlass group gemm ([#509](#509)) ([794bdda](794bdda)) * torch custom_op fix for rope ([#569](#569)) ([3e104bc](3e104bc)) * torch custom_op support: norm ([#552](#552)) ([f6e0010](f6e0010)) * torch.compile and custom_op support ([#554](#554)) ([9bf916f](9bf916f)) * warmup for jit kernel tests ([#629](#629)) ([8f5f349](8f5f349)) ### Bug Fixes * AOT compiler flags on non-sm90 ([#522](#522)) ([0aa4726](0aa4726)) * batch decode kernel redundant store output to gmem ([#505](#505)) ([90e42a7](90e42a7)) * compatible with torch 2.2 ([#478](#478)) ([ac41d1b](ac41d1b)) * #452 ([b53a46f](b53a46f)) * remove redundant load ([#495](#495)) ([2de16b0](2de16b0)) * update bmm fp8 test ([#487](#487)) ([45eac04](45eac04)) ### Performance Improvements * accelerate JIT compilation speed ([#618](#618)) ([eaf73fd](eaf73fd)) * Dense and sparse customizable flashattention-3 template ([#667](#667)) ([51236c9](51236c9)) * fix prefill kernel performance degradation (step 1) ([#602](#602)) ([595cf60](595cf60)) * fix the performance issue of `append_paged_kv_cache` ([#588](#588)) ([e15f7c9](e15f7c9)) * improve parallelism in RoPE with pos_ids ([#609](#609)) ([ff05155](ff05155)) * improve plan performance by using non-blocking memcpy ([#547](#547)) ([41ebe6d](41ebe6d)) * reduce the read and write of shared memory in the FusedAddRMSNormKernel ([#592](#592)) ([2043ca2](2043ca2)) * reduce total_num_tiles_q by one ([#644](#644)) ([553ace5](553ace5)) * remove unnecessary contiguous operation in block sparse attention ([#561](#561)) ([7a7ad46](7a7ad46)) * speedup jit compilation of prefill attention kernels ([#632](#632)) ([a059586](a059586)) * use cuda-core implemention for io-bound block-sparse attention ([#560](#560)) ([3fbf028](3fbf028)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Zihao Ye <expye@outlook.com>

yzh119 force-pushed the jit branch from 95471d6 to d857705 Compare September 25, 2024 07:11

This was referenced Sep 25, 2024

pytorch 2.5 support #504

Closed

[WIP][AMDGPU] try rocm POC #491

Draft

Does Flashinfer support 8-bit attention calculation? #502

Closed

yzh119 added 26 commits September 25, 2024 08:56

upd

6d333ba

wip

c0797cd

wip

96f34b0

upd

6b721e6

upd

f3bd765

upd

41e810e

fix decode

585a720

upd

319f7f5

bugfix

8254243

upd

2663698

wip

54dfcce

upd

04288c2

fix

b70e3e0

bugfix in prefill

330075a

upd

ab64084

bugfix

b4a2eaf

upd

65c142c

remove unused code

ebfeee0

formatter

029ecbe

rename handler to scheduler

2940033

remove decode/prefill decl

143515a

simplify setup.py

316d423

upd

e890b40

upd

0c4ea17

bugfix

bbdb49b

fix sparse

66296e4

yzh119 added 5 commits September 25, 2024 09:47

upd

645bf18

formatter

2138d57

upd

752047f

bugfix

172950c

formatter

6e5a0a6

danieldk mentioned this pull request Sep 25, 2024

Will AOT compilation still be supported after JIT compilation is added? #510

Closed

yzh119 added 6 commits September 25, 2024 19:13

fix initialization of params

60316f0

rename DTypeOut to DTypeO

f555390

formatter

e2cfa89

remove unused include

b5c36d0

upd

2f5f71e

upd

7eac0ba

yzh119 mentioned this pull request Sep 30, 2024

Feature/non contiguous kv cache #513

Merged

yzh119 added 11 commits October 1, 2024 06:49

upd

a53a3f8

upd fix

upd

78b9678

upd

72a7a1f

upd

768fa2b

upd

922c4aa

load aot ops if existed

47476a7

upd

409f461

upd

502826d

upd

64f1918

tests passed

8bb68de

trailing empty lines

3f42c03

yzh119 merged commit 3613a5b into main Oct 7, 2024

github-actions bot mentioned this pull request Oct 7, 2024

chore(main): release 0.2.0 #476

Merged

yzh119 deleted the jit branch October 11, 2024 23:17

zhyncs mentioned this pull request Oct 31, 2024

Have any plans to optimize the decode kernel for NV-Hopper #576

Open

github-actions bot mentioned this pull request Dec 25, 2024

chore(main): release 0.3.0 #698

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: JIT compilation #507

feat: JIT compilation #507

yzh119 commented Sep 25, 2024 •

edited

Loading

feat: JIT compilation #507

feat: JIT compilation #507

Conversation

yzh119 commented Sep 25, 2024 • edited Loading

Motivation

Attention Variants

Roadmap

yzh119 commented Sep 25, 2024 •

edited

Loading