Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
🤖 I have created a release beep boop
0.1.6 (2024-08-27)
SM75 Support
Starting from 0.1.6, our pre-built wheels include experimental support sm75 (Turing architecture GPUs such as Tesla T4, Quadro RTX 6000 and RTX 2080).
API Changes
plan
/run
Since 0.1.6 on,
begin_forward
/forward
/end_forward
APIs are replaced with the newplan
/run
API.forward
is renamed torun
, which is more precise and consistent with the naming convention of cutlass's python API.begin_forward
is renamed toplan
, which is consistent with the naming convention of nvmath API.end_forward
is deprecated and has no effect after this PR.There is some slight difference between the old
forward
and the newrun
API:causal
andlogits_soft_cap
will be provided inplan
(previouslybegin_forward
) API, and cached until nextplan
call, and we only need to provide query and KV-Cache tensors inrun
API.The old
begin_forward
/forward
/end_forward
APIs are still functional, but we will gradually deprecate them in future releases.Check #466 for more details.
MultiLevelCascadeAttentionWrapper
Since 0.1.6 on, we introduce a new
MultiLevelCascadeAttentionWrapper
API for cascade inference,which supports multi-level cascade inference where all levels' KV-Cache can be managed in a unified Paged KV-Cache.
See documentation and tutorial on API usage and layout explaination.
The old
BatchDecodeWithSharedPrefixPagedKVCacheWrapper
andBatchPrefillWithSharedPrefixPagedKVCacheWrapper
will be deprecated in future releases.Features
MultiLevelCascadeAttentionWrapper
API (#462) (1e37989)Refactor
begin_forward
/forward
/end_forward
withplan
/run
#466Misc
Performance Improvements
Acknowledgement
We thank @LiuXiaoxuanPKU on enhance of speculative sampling operator, @merrymercy on API change suggestion and @zhyncs on integrating fp8 BMM cublas implementation.
This PR was generated with Release Please. See documentation.