chore(main): release 0.1.6 #447

github-actions · 2024-08-14T09:47:33Z

🤖 I have created a release beep boop

0.1.6 (2024-08-27)

SM75 Support

Starting from 0.1.6, our pre-built wheels include experimental support sm75 (Turing architecture GPUs such as Tesla T4, Quadro RTX 6000 and RTX 2080).

API Changes

`plan`/`run`

Since 0.1.6 on, begin_forward/forward/end_forward APIs are replaced with the new plan/run API.

forward is renamed to run, which is more precise and consistent with the naming convention of cutlass's python API.
begin_forward is renamed to plan, which is consistent with the naming convention of nvmath API.
end_forward is deprecated and has no effect after this PR.

There is some slight difference between the old forward and the new run API:

All extra arguments such as causal and logits_soft_cap will be provided in plan (previously begin_forward) API, and cached until next plan call, and we only need to provide query and KV-Cache tensors in run API.

The old begin_forward/forward/end_forward APIs are still functional, but we will gradually deprecate them in future releases.

Check #466 for more details.

`MultiLevelCascadeAttentionWrapper`

Since 0.1.6 on, we introduce a new MultiLevelCascadeAttentionWrapper API for cascade inference,
which supports multi-level cascade inference where all levels' KV-Cache can be managed in a unified Paged KV-Cache.

See documentation and tutorial on API usage and layout explaination.

The old BatchDecodeWithSharedPrefixPagedKVCacheWrapper and BatchPrefillWithSharedPrefixPagedKVCacheWrapper will be deprecated in future releases.

Features

sm75 support (#448, #449)
add MultiLevelCascadeAttentionWrapper API (#462) (1e37989)
add accept num, emit num metric for ChainSpeculativeSampling (#450) (fa38b5e)
support bmm fp8 (#469) (f1c0b68)

Refactor

refactor: replace begin_forward/forward/end_forward with plan/run #466

Misc

misc: improve error handling of sampling kernels (#456) (0dce178)

Performance Improvements

slight optimization on f16->f8 fragment layout swizzling (#453) (0d61871)
slight optimization on fragment layout swizzle (#458) (7c397cb)
use persistent kernel for merging attention states (#459) (be6bf5b)

Acknowledgement

We thank @LiuXiaoxuanPKU on enhance of speculative sampling operator, @merrymercy on API change suggestion and @zhyncs on integrating fp8 BMM cublas implementation.

This PR was generated with Release Please. See documentation.

github-actions · 2024-08-27T01:18:52Z

🤖 Created releases:

v0.1.6
🌻

github-actions bot added the autorelease: pending label Aug 14, 2024

github-actions bot force-pushed the release-please--branches--main branch from 0e67173 to 1d40f4c Compare August 16, 2024 00:56

github-actions bot changed the title ~~chore(main): release 0.1.6~~ chore(main): release 0.2.0 Aug 17, 2024

github-actions bot force-pushed the release-please--branches--main branch 10 times, most recently from d77d9a3 to f9d875e Compare August 24, 2024 03:18

github-actions bot force-pushed the release-please--branches--main branch 3 times, most recently from e83fee8 to ca9f399 Compare August 26, 2024 19:32

chore(main): release 0.2.0

d214c64

github-actions bot force-pushed the release-please--branches--main branch from ca9f399 to d214c64 Compare August 27, 2024 00:27

yzh119 added 2 commits August 27, 2024 00:47

upd

2485a8f

upd

488d677

yzh119 changed the title ~~chore(main): release 0.2.0~~ chore(main): release 0.1.6 Aug 27, 2024

yzh119 merged commit a23979b into main Aug 27, 2024

github-actions bot added autorelease: tagged and removed autorelease: pending labels Aug 27, 2024

yzh119 deleted the release-please--branches--main branch August 27, 2024 04:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(main): release 0.1.6 #447

chore(main): release 0.1.6 #447

github-actions bot commented Aug 14, 2024 •

edited by yzh119

Loading

github-actions bot commented Aug 27, 2024

chore(main): release 0.1.6 #447

chore(main): release 0.1.6 #447

Conversation

github-actions bot commented Aug 14, 2024 • edited by yzh119 Loading

🤖 I have created a release beep boop

0.1.6 (2024-08-27)

SM75 Support

API Changes

plan/run

MultiLevelCascadeAttentionWrapper

Features

Refactor

Misc

Performance Improvements

Acknowledgement

github-actions bot commented Aug 27, 2024

github-actions bot commented Aug 14, 2024 •

edited by yzh119

Loading

`plan`/`run`

`MultiLevelCascadeAttentionWrapper`