Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Roadmap] FlashInfer v0.2 to v0.3 #675

Open
12 tasks
yzh119 opened this issue Dec 17, 2024 · 0 comments
Open
12 tasks

[Roadmap] FlashInfer v0.2 to v0.3 #675

yzh119 opened this issue Dec 17, 2024 · 0 comments
Labels

Comments

@yzh119
Copy link
Collaborator

yzh119 commented Dec 17, 2024

Milestones

Our tentative roadmap includes the following milestones:

  • SageAttention-2 in FlashAttention3: Implement SageAttention-2 in FlashAttention3 template
  • Flex-Attention Compatible Interface: standarize JIT interface @shadowpa0327
  • SM89 Kernel Optimization: Leverage Ada FP8 Tensor Cores for better performance on Ada6000 & 4090.
  • Template Refactoring: Refactor FA-2 and MLA templates using CuTE.
  • MLA Acceleration: Optimize Multi-Level Attention (MLA) with Tensor Core support, follow up of feat: support MLA decode #551 .
  • Triton Porting: Migrate elementwise, normalization, and other kernels (that are not on critical path) to Triton.
  • API Standardization: Simplify and standardize the attention APIs for better usability.
  • POD-Attention Integration: Implement POD-Attention for improved efficiency of chunked-prefill.
  • Nanoflow Parallelism: Expose python-level APIs for performing GEMM and Attention on a subset of SMs, which is required for nanoflow style parallelism, see #591.
  • Fused Tree Speculative Sampling: follow up of sampling: fused speculative sampling kernels #259 , we should support tree-speculative sampling as well, we will port the implementation of fused tree-speculative sampling written by @spectrometerHBH from https://github.com/mlc-ai/mlc-llm to accelerate eagle and medusa etc.
  • Improvements on Existing Top-P/K Sampling Operators: change the algorithm to guarantee all samples are successful after 32 rounds.
  • PyPI wheels: upload wheels to PyPI (pending issue: PEP 541 Request: flashinfer pypi/support#5355)

We welcome your feedback and suggestions!
Let us know what features you'd like to see in FlashInfer.

@yzh119 yzh119 added the roadmap label Dec 17, 2024
@yzh119 yzh119 pinned this issue Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant