Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(main): release 0.1.6 #447

Merged
merged 3 commits into from
Aug 27, 2024
Merged

chore(main): release 0.1.6 #447

merged 3 commits into from
Aug 27, 2024

Conversation

github-actions[bot]
Copy link
Contributor

@github-actions github-actions bot commented Aug 14, 2024

🤖 I have created a release beep boop

0.1.6 (2024-08-27)

SM75 Support

Starting from 0.1.6, our pre-built wheels include experimental support sm75 (Turing architecture GPUs such as Tesla T4, Quadro RTX 6000 and RTX 2080).

API Changes

plan/run

Since 0.1.6 on, begin_forward/forward/end_forward APIs are replaced with the new plan/run API.

  • forward is renamed to run, which is more precise and consistent with the naming convention of cutlass's python API.
  • begin_forward is renamed to plan, which is consistent with the naming convention of nvmath API.
  • end_forward is deprecated and has no effect after this PR.

There is some slight difference between the old forward and the new run API:

  • All extra arguments such as causal and logits_soft_cap will be provided in plan (previously begin_forward) API, and cached until next plan call, and we only need to provide query and KV-Cache tensors in run API.

The old begin_forward/forward/end_forward APIs are still functional, but we will gradually deprecate them in future releases.

Check #466 for more details.

MultiLevelCascadeAttentionWrapper

Since 0.1.6 on, we introduce a new MultiLevelCascadeAttentionWrapper API for cascade inference,
which supports multi-level cascade inference where all levels' KV-Cache can be managed in a unified Paged KV-Cache.

See documentation and tutorial on API usage and layout explaination.

The old BatchDecodeWithSharedPrefixPagedKVCacheWrapper and BatchPrefillWithSharedPrefixPagedKVCacheWrapper will be deprecated in future releases.

Features

Refactor

  • refactor: replace begin_forward/forward/end_forward with plan/run #466

Misc

  • misc: improve error handling of sampling kernels (#456) (0dce178)

Performance Improvements

  • slight optimization on f16->f8 fragment layout swizzling (#453) (0d61871)
  • slight optimization on fragment layout swizzle (#458) (7c397cb)
  • use persistent kernel for merging attention states (#459) (be6bf5b)

Acknowledgement

We thank @LiuXiaoxuanPKU on enhance of speculative sampling operator, @merrymercy on API change suggestion and @zhyncs on integrating fp8 BMM cublas implementation.


This PR was generated with Release Please. See documentation.

@github-actions github-actions bot force-pushed the release-please--branches--main branch from 0e67173 to 1d40f4c Compare August 16, 2024 00:56
@github-actions github-actions bot changed the title chore(main): release 0.1.6 chore(main): release 0.2.0 Aug 17, 2024
@github-actions github-actions bot force-pushed the release-please--branches--main branch 10 times, most recently from d77d9a3 to f9d875e Compare August 24, 2024 03:18
@github-actions github-actions bot force-pushed the release-please--branches--main branch 3 times, most recently from e83fee8 to ca9f399 Compare August 26, 2024 19:32
@github-actions github-actions bot force-pushed the release-please--branches--main branch from ca9f399 to d214c64 Compare August 27, 2024 00:27
@yzh119 yzh119 changed the title chore(main): release 0.2.0 chore(main): release 0.1.6 Aug 27, 2024
@yzh119 yzh119 merged commit a23979b into main Aug 27, 2024
Copy link
Contributor Author

🤖 Created releases:

@yzh119 yzh119 deleted the release-please--branches--main branch August 27, 2024 04:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant