[RUNTIME] Enabing streaming llm for Runtime #501

zhenwei-intel · 2023-10-19T07:49:04Z

Type of Change

feature

Description

support text generation continuously, when context length is greater than ctx_size

n_keep, number of tokens to keep from initial prompt
n_discard, number of tokens will be discarded, with remaining tokens insert begin of current tokens

Expected Behavior & Potential Risk

outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300, ctx_size=100, n_keep=4, n_discard=1)

How has this PR been tested?

how to reproduce the test (including hardware information)

Dependency Change?

N/A

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

a32543254

LGTM

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

a32543254 · 2023-10-19T08:13:17Z

For streaming llm,
Based on paper
We will recommend use n_keep as 4 and n_discard = -1 to keep a relative great acc and performance with infinite inference with streaming llm

intel_extension_for_transformers/llm/runtime/graph/__init__.py

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

airMeng · 2023-10-19T09:07:19Z

wonder whether this will impact contiguous batching since both n_keep and n_discard determines the size of KV cache.

hshen14 · 2023-10-19T09:15:03Z

wonder whether this will impact contiguous batching since both n_keep and n_discard determines the size of KV cache.

I asked the same question to Zhentao. The answer is No.

* Support StreamingLLM on CPU Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

* [CPP Graph] Opt qbits dequant (#465) * use INC 2.3.1 Signed-off-by: Wenxin Zhang <wenxin.zhang@intel.com> * use INC 2.3.1 (#500) Signed-off-by: Wenxin Zhang <wenxin.zhang@intel.com> * [RUNTIME] Enabing streaming llm for Runtime (#501) * Support StreamingLLM on CPU Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com> * Reduce the UT evaluation time (#498) Signed-off-by: changwangss <chang1.wang@intel.com> Signed-off-by: Wenxin Zhang <wenxin.zhang@intel.com> Signed-off-by: Wang, Chang <chang1.wang@intel.com> Co-authored-by: Wenxin Zhang <wenxin.zhang@intel.com> * Minor fix (#507) * Fix ChatGLM2 model loading issue (#510) * Fix ChatGLM2 model loading issue Signed-off-by: lvliang-intel <liang1.lv@intel.com> * Update README.md Signed-off-by: Haihao Shen <haihao.shen@intel.com> * Remove OneDNN env setint for BF16 inference (#509) Signed-off-by: lvliang-intel <liang1.lv@intel.com> Co-authored-by: VincyZhang <wenxin.zhang@intel.com> * support Avx2 (#493) * support Memcpy2D * support gelu fusion --------- Co-authored-by: luoyu-intel <yu.luo@intel.com> * add neuralchat ut for audio util (#466) * reduce ut time consumption (#499) Signed-off-by: Xin He <xin3.he@intel.com> * update python api readme (#504) * Add docker setup session for neuralchat finetuning sample (#496) * Update README.md to new added docker setup session Signed-off-by: Louie Tsai <louie.tsai@intel.com> * Update README.md Signed-off-by: Haihao Shen <haihao.shen@intel.com> * Update README.md Signed-off-by: Haihao Shen <haihao.shen@intel.com> * Update README.md Signed-off-by: Haihao Shen <haihao.shen@intel.com> * Update README.md Signed-off-by: Haihao Shen <haihao.shen@intel.com> * Update README.md Signed-off-by: Haihao Shen <haihao.shen@intel.com> * Update README.md Signed-off-by: Haihao Shen <haihao.shen@intel.com> * Update README.md for fast token issue (#515) Signed-off-by: Louie Tsai <louie.tsai@intel.com> * Fix typo in README.md (#516) convertion -> conversion Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com> * Update README.md Signed-off-by: Haihao Shen <haihao.shen@intel.com> * Update README.md Signed-off-by: Haihao Shen <haihao.shen@intel.com> * Update README.md Signed-off-by: Haihao Shen <haihao.shen@intel.com> * Update README.md Signed-off-by: Haihao Shen <haihao.shen@intel.com> * improve Avx2 (#511) * Revert "update python api readme (#504)" This reverts commit 5f4175a. * Update README.md Signed-off-by: Haihao Shen <haihao.shen@intel.com> * Update README.md (#519) Signed-off-by: ayushrakesh <115995339+ayushrakesh@users.noreply.github.com> * docs: fix typos in question answering of pytorch (#520) Signed-off-by: Surav Shrestha <suravshresth@gmail.com> * fixed typos (#522) * Updated README.md (#517) Signed-off-by: Aditya Aryaman Das <128703909+alienishi@users.noreply.github.com> * update python api readme Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com> * fix readme Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com> * Update README.md Signed-off-by: Dong, Bo <bo1.dong@intel.com> * Update README.md Signed-off-by: Dong, Bo <bo1.dong@intel.com> * Update README.md Signed-off-by: Dong, Bo <bo1.dong@intel.com> * Update README.md Signed-off-by: Dong, Bo <bo1.dong@intel.com> * Add Data type description Align Doc and help info Signed-off-by: Hengyu Meng <hengyu.meng@intel.com> * align Signed-off-by: Hengyu Meng <hengyu.meng@intel.com> * fix eos token id Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com> --------- Signed-off-by: Wenxin Zhang <wenxin.zhang@intel.com> Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com> Signed-off-by: changwangss <chang1.wang@intel.com> Signed-off-by: Wang, Chang <chang1.wang@intel.com> Signed-off-by: lvliang-intel <liang1.lv@intel.com> Signed-off-by: Haihao Shen <haihao.shen@intel.com> Signed-off-by: Xin He <xin3.he@intel.com> Signed-off-by: Louie Tsai <louie.tsai@intel.com> Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com> Signed-off-by: ayushrakesh <115995339+ayushrakesh@users.noreply.github.com> Signed-off-by: Surav Shrestha <suravshresth@gmail.com> Signed-off-by: Aditya Aryaman Das <128703909+alienishi@users.noreply.github.com> Signed-off-by: Dong, Bo <bo1.dong@intel.com> Signed-off-by: Hengyu Meng <hengyu.meng@intel.com> Co-authored-by: Wang, Zhe <zhe1.wang@intel.com> Co-authored-by: Wenxin Zhang <wenxin.zhang@intel.com> Co-authored-by: Wang, Chang <chang1.wang@intel.com> Co-authored-by: lvliang-intel <liang1.lv@intel.com> Co-authored-by: Haihao Shen <haihao.shen@intel.com> Co-authored-by: yuchengliu1 <yucheng.liu@intel.com> Co-authored-by: luoyu-intel <yu.luo@intel.com> Co-authored-by: Liangyx2 <106130696+Liangyx2@users.noreply.github.com> Co-authored-by: xinhe <xin3.he@intel.com> Co-authored-by: Louie Tsai <louie.tsai@intel.com> Co-authored-by: Ikko Eltociear Ashimine <eltociear@gmail.com> Co-authored-by: ayushrakesh <115995339+ayushrakesh@users.noreply.github.com> Co-authored-by: Surav Shrestha <148626286+shresthasurav@users.noreply.github.com> Co-authored-by: Smoothieewastaken <86610201+Smoothieewastaken@users.noreply.github.com> Co-authored-by: Aditya Aryaman Das <128703909+alienishi@users.noreply.github.com> Co-authored-by: Dong, Bo <bo1.dong@intel.com> Co-authored-by: Hengyu Meng <hengyu.meng@intel.com>

* Support StreamingLLM on CPU Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

add n_keep and n_discard for streaming llm

1bb086e

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

zhenwei-intel requested review from DDEle and a32543254 October 19, 2023 07:49

zhenwei-intel requested a review from airMeng as a code owner October 19, 2023 07:49

a32543254 changed the title ~~add n_keep and n_discard for streaming llm~~ [RUNTIME] Enabing streaming llm for Runtime Oct 19, 2023

fix for main run

6242450

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

a32543254 approved these changes Oct 19, 2023

View reviewed changes

fix bigcoder

d228edb

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

airMeng reviewed Oct 19, 2023

View reviewed changes

intel_extension_for_transformers/llm/runtime/graph/__init__.py Show resolved Hide resolved

format

4bdb3e4

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

airMeng requested a review from zhentaoyu October 19, 2023 09:07

zhentaoyu approved these changes Oct 19, 2023

View reviewed changes

hshen14 merged commit 66238a5 into main Oct 19, 2023
11 checks passed

hshen14 deleted the lzw/add_n_discard branch October 19, 2023 09:17

zhenwei-intel added a commit that referenced this pull request Oct 23, 2023

[RUNTIME] Enabing streaming llm for Runtime (#501)

b226ed3

* Support StreamingLLM on CPU Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

VincyZhang pushed a commit that referenced this pull request Oct 23, 2023

[RUNTIME] Enabing streaming llm for Runtime (#501)

ffc73bb

* Support StreamingLLM on CPU Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RUNTIME] Enabing streaming llm for Runtime #501

[RUNTIME] Enabing streaming llm for Runtime #501

zhenwei-intel commented Oct 19, 2023 •

edited

Loading

a32543254 left a comment

a32543254 commented Oct 19, 2023

airMeng commented Oct 19, 2023

hshen14 commented Oct 19, 2023

[RUNTIME] Enabing streaming llm for Runtime #501

[RUNTIME] Enabing streaming llm for Runtime #501

Conversation

zhenwei-intel commented Oct 19, 2023 • edited Loading

Type of Change

Description

Expected Behavior & Potential Risk

How has this PR been tested?

Dependency Change?

a32543254 left a comment

Choose a reason for hiding this comment

a32543254 commented Oct 19, 2023

airMeng commented Oct 19, 2023

hshen14 commented Oct 19, 2023

zhenwei-intel commented Oct 19, 2023 •

edited

Loading