You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SM89 Kernel Optimization: Leverage Ada FP8 Tensor Cores for better performance on Ada6000 & 4090.
Template Refactoring: Refactor FA-2 and MLA templates using CuTE.
MLA Acceleration: Optimize Multi-Level Attention (MLA) with Tensor Core support, follow up of feat: support MLA decode #551 .
Triton Porting: Migrate elementwise, normalization, and other kernels (that are not on critical path) to Triton.
API Standardization: Simplify and standardize the attention APIs for better usability.
POD-Attention Integration: Implement POD-Attention for improved efficiency of chunked-prefill.
Nanoflow Parallelism: Expose python-level APIs for performing GEMM and Attention on a subset of SMs, which is required for nanoflow style parallelism, see #591.
Milestones
Our tentative roadmap includes the following milestones:
We welcome your feedback and suggestions!
Let us know what features you'd like to see in FlashInfer.
The text was updated successfully, but these errors were encountered: