Comparing changes

Changes: 1. Prefetch page indices (we have already done such optimization on decode kernels, but not on append/prefill kernels which was used in GQA). 2. Unlock 1x4 warp layout in #322, we didn't enable this because the binary size is too large, we should further reduce some unnecessary template arguments. 3. Optimize `threadblock_sync_mdo_states` for efficient merging attention states of multiple warps in a threadblock. Our previous implementation assumes small shared memory size and interleaves shared memory reads/writes with computations, which is not as efficient as a bulk shared memory access. After this PR, the GQA kernel execution time (on H100) for setting `batch_size=128, seq_len=1024, num_qo_heads=32, num_kv_heads=4, head_dim=128` was improved from 133us to 103us.

…lt stream (#361) This PR fixes #349 by using the default stream of input tensors' device instead of the default stream of default device (which might be different to input tensors' device). This PR also adds sanity check on input tensors device id (all input tensors must be on the same GPU).

@MasterJH5574

When some request has empty kv cache, the output of decode kernels doesn't align with prefill kernels. This PR fixes the issue. Thanks @MasterJH5574 for reporting this bug.

- add `__launch_bounds__` - add unroll hint for prefetching page indices - change loop structure of `threadblock_sync_mdo_states`

…pies (#366) Small tweak to avoid unnecessary copying by combining `to` calls. Discovered during profiling.

Alibi experienced a performance degradation after #262 because of increased number of integer division. This PR fixes the issue.

The `begin_forward` function in decode attention wrappers sometimes triggers segfault, this PR fixes the issue.

…d of template parameter (#370) This PR reduces binary size by half, by moving `kv_layout` from template parameter to input argument. This PR also adds `stride_n` and `stride_h` fields to `tensor_info_t` and `paged_kv_t`, thus making it possible to support non-contiguous inputs (#311 ), however, I'll leave it for another PR.

@Yard1

Commits on Jul 3, 2024

Fix doc typo (#357 )

Ying1123 authored Jul 3, 2024

View commit details

Copy full SHA for 2e64a65

Browse repository at this point

2e64a65

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing changes

Open a pull request

Commits on Jul 3, 2024

Commits on Jul 4, 2024

Commits on Jul 6, 2024

Commits on Jul 10, 2024

Commits on Jul 11, 2024

Commits on Jul 12, 2024

This comparison is taking too long to generate.