Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: flashinfer-ai/flashinfer
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v0.0.8
Choose a base ref
...
head repository: flashinfer-ai/flashinfer
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: v0.0.9
Choose a head ref
  • 11 commits
  • 49 files changed
  • 5 contributors

Commits on Jul 3, 2024

  1. Fix doc typo (#357)

    Ying1123 authored Jul 3, 2024
    Copy the full SHA
    2e64a65 View commit details

Commits on Jul 4, 2024

  1. perf: accelerate gqa performance (#356)

    Changes:
    1. Prefetch page indices (we have already done such optimization on
    decode kernels, but not on append/prefill kernels which was used in
    GQA).
    2. Unlock 1x4 warp layout in
    #322, we didn't enable
    this because the binary size is too large, we should further reduce some
    unnecessary template arguments.
    3. Optimize `threadblock_sync_mdo_states` for efficient merging
    attention states of multiple warps in a threadblock. Our previous
    implementation assumes small shared memory size and interleaves shared
    memory reads/writes with computations, which is not as efficient as a
    bulk shared memory access.
    
    After this PR, the GQA kernel execution time (on H100) for setting
    `batch_size=128, seq_len=1024, num_qo_heads=32, num_kv_heads=4,
    head_dim=128` was improved from 133us to 103us.
    yzh119 authored Jul 4, 2024
    Copy the full SHA
    e56ddad View commit details
  2. Copy the full SHA
    3536198 View commit details

Commits on Jul 6, 2024

  1. bugfix: check gpu id in PyTorch APIs and use input tensor's gpu defau…

    …lt stream (#361)
    
    This PR fixes #349 by using the default stream of input tensors' device
    instead of the default stream of default device (which might be
    different to input tensors' device). This PR also adds sanity check on
    input tensors device id (all input tensors must be on the same GPU).
    yzh119 authored Jul 6, 2024
    Copy the full SHA
    1b84fab View commit details

Commits on Jul 10, 2024

  1. bugfix: fix decode kernels output for empty kv cache (#363)

    When some request has empty kv cache, the output of decode kernels
    doesn't align with prefill kernels. This PR fixes the issue.
    
    Thanks @MasterJH5574 for reporting this bug.
    yzh119 authored Jul 10, 2024
    Copy the full SHA
    ac72b1c View commit details
  2. refactor: slight refactor of prefill kernels (#364)

    - add `__launch_bounds__`
    - add unroll hint for prefetching page indices
    - change loop structure of `threadblock_sync_mdo_states`
    yzh119 authored Jul 10, 2024
    Copy the full SHA
    264082e View commit details
  3. perf: Optimize tensor conversions in C++ code to avoid unnecessary co…

    …pies (#366)
    
    Small tweak to avoid unnecessary copying by combining `to` calls.
    Discovered during profiling.
    Yard1 authored Jul 10, 2024
    Copy the full SHA
    1116237 View commit details
  4. perf: accelerate alibi (#365)

    Alibi experienced a performance degradation after #262 because of
    increased number of integer division.
    This PR fixes the issue.
    yzh119 authored Jul 10, 2024
    Copy the full SHA
    4f0a9f9 View commit details

Commits on Jul 11, 2024

  1. bugfix: fix the decode kernel segfault in cudagraph mode (#368)

    The `begin_forward` function in decode attention wrappers sometimes
    triggers segfault, this PR fixes the issue.
    yzh119 authored Jul 11, 2024
    Copy the full SHA
    c69cfab View commit details

Commits on Jul 12, 2024

  1. refactor: reduce binary size by making kv_layout an argument instea…

    …d of template parameter (#370)
    
    This PR reduces binary size by half, by moving `kv_layout` from template
    parameter to input argument.
    
    This PR also adds `stride_n` and `stride_h` fields to `tensor_info_t`
    and `paged_kv_t`, thus making it possible to support non-contiguous
    inputs (#311 ), however, I'll leave it for another PR.
    yzh119 authored Jul 12, 2024
    Copy the full SHA
    024a79f View commit details
  2. chore(main): release 0.0.9 (#359)

    🤖 I have created a release *beep* *boop*
    ---
    
    
    ##
    [0.0.9](v0.0.8...v0.0.9)
    (2024-07-12)
    
    ### Bugfix
    
    * fix the decode kernel segfault in cudagraph mode
    ([#368](https://github.com/flashinfer-ai/flashinfer/pull/368))([c69cfa](https://github.com/flashinfer-ai/flashinfer/commit/c69cfabc540e4a7edd991713df10d575ff3b0c21))
    - fix decode kernels output for empty kv cache
    ([#363](https://github.com/flashinfer-ai/flashinfer/pull/363))([ac72b1](https://github.com/flashinfer-ai/flashinfer/commit/ac72b1cc14a6474d601f371c8d69e2600ac28d2f))
    - check gpu id in PyTorch APIs and use input tensor's gpu default stream
    ([#361](https://github.com/flashinfer-ai/flashinfer/pull/361))([1b84fa](https://github.com/flashinfer-ai/flashinfer/commit/1b84fab3e4f53fb4fa26952fdb46fa8018634057))
    
    ### Performance Improvements
    
    * accelerate alibi
    ([#365](#365))
    ([4f0a9f9](4f0a9f9))
    * accelerate gqa performance
    ([#356](#356))
    ([e56ddad](e56ddad))
    * Optimize tensor conversions in C++ code to avoid unnecessary copies
    ([#366](#366))
    ([1116237](1116237))
    
    ### Acknowledgement
    
    We thank [@Yard1](https://github.com/Yard1),
    [@Ying1123](https://github.com/Ying1123) and
    [@zhyncs](https://github.com/zhyncs) for their contributions.
    
    ---
    This PR was generated with [Release
    Please](https://github.com/googleapis/release-please). See
    [documentation](https://github.com/googleapis/release-please#release-please).
    
    ---------
    
    Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
    Co-authored-by: Zihao Ye <expye@outlook.com>
    github-actions[bot] and yzh119 authored Jul 12, 2024
    Copy the full SHA
    17a5f1b View commit details
Loading