Skip to content
This repository has been archived by the owner on Oct 25, 2024. It is now read-only.

support Avx2 #493

Merged
merged 11 commits into from
Oct 20, 2023
Merged

support Avx2 #493

merged 11 commits into from
Oct 20, 2023

Conversation

yuchengliu1
Copy link
Contributor

@yuchengliu1 yuchengliu1 commented Oct 18, 2023

Type of Change

feature
No API changed

Run LLM on client CPU with AVX2 (without avx512)

detail description
JIRA ticket: xxx

Expected Behavior & Potential Risk

CPU: i7-9850H@2.6GHz
memory: single channel 32GB@2666MHz (memory bandwidth 21.3GB/s)
compute_type is FP32

 

  first token next token
gptj-6B_q4j_b128 382.57 ms (63.76 ms per token) 220.18 ms per token
llama-7B_q4j_b128 515.35 ms (73.62 ms per token) 250.71 ms per token
llama2-7B_q4j_b128 518.14 ms (74.02 ms per token) 253.23 ms per token

 

How has this PR been tested?

how to reproduce the test (including hardware information)

Dependency Change?

any library dependency introduced or removed

@yuchengliu1 yuchengliu1 requested a review from airMeng as a code owner October 18, 2023 09:02
@yuchengliu1 yuchengliu1 mentioned this pull request Oct 18, 2023
@airMeng
Copy link
Contributor

airMeng commented Oct 19, 2023

I think you already remove all warnings in this PR?

@airMeng
Copy link
Contributor

airMeng commented Oct 19, 2023

test on i7-9850H@2.6Hz

 

  first token next token
gptj-6B_q4j_b128 382.57 ms (63.76 ms per token) 220.18 ms per token
llama-7B_q4j_b128 515.35 ms (73.62 ms per token) 250.71 ms per token
llama2-7B_q4j_b128 518.14 ms (74.02 ms per token) 253.23 ms per token

Can you provide memory bandwidth comparison between clients and SPR? It's useful to judge whether the gap is meaningful.

@@ -61,7 +61,7 @@ cd build
cmake ..
cmake --build . -j
```

Note: add compile args ```-DNE_AVX512=OFF -DNE_AVX512_VBMI=OFF -DNE_AVX512_VNNI=OFF``` to ```cmake``` when compiling it on a CPU without AVX512
Copy link
Contributor

@a32543254 a32543254 Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we auto detect the machine's isa to add follow compile args without manually ? some consumers may not sure about their machine's isa

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix in PR511

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. If I remember correctly, there was a similar discussion for the deprecated executor, and the result is that it is not possible to find the hardware info of the runtime match during compile time on the machine that compiles the executable.
cc @luoyu-intel

@yuchengliu1
Copy link
Contributor Author

I think you already remove all warnings in this PR?

Yes, there is no warning now.

Copy link
Contributor

@a32543254 a32543254 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@a32543254
Copy link
Contributor

a32543254 commented Oct 19, 2023

Better we also give the same models's performance between ITREX AVX2 and GGML AVX2 to find our benefit or gap.

@kevinintel kevinintel merged commit ea69f9a into main Oct 20, 2023
11 checks passed
@kevinintel kevinintel deleted the avx2 branch October 20, 2023 03:14
@yuchengliu1
Copy link
Contributor Author

yuchengliu1 commented Oct 20, 2023

Better we also give the same models's performance between ITREX AVX2 and GGML AVX2 to find our benefit or gap.

llama2-7B  first token next token
llama.cpp 494.15 ms (61.77 ms per token) 141.69 ms per token
ITREX q4_0 743.89ms (106.27ms per token) 136.37ms per token
ITREX q4_j 518.14 ms (74.02 ms per token) 253.23 ms per token

the profiling of ITREX AVX2 (profile on q4_j without ffn fusion)

perf_total_per_op_us[ ADD] = 0.184 ms
perf_total_per_op_us[ MUL] = 0.327 ms
perf_total_per_op_us[ SILU] = 3.488 ms
perf_total_per_op_us[ RMS_NORM] = 0.396 ms
perf_total_per_op_us[ MUL_MAT] = 2.497 ms
perf_total_per_op_us[ SCALE] = 0.032 ms
perf_total_per_op_us[ CPY] = 1.177 ms
perf_total_per_op_us[ RESHAPE] = 0.037 ms
perf_total_per_op_us[ VIEW] = 0.066 ms
perf_total_per_op_us[ PERMUTE] = 0.029 ms
perf_total_per_op_us[ TRANSPOSE] = 0.006 ms
perf_total_per_op_us[ GET_ROWS] = 0.008 ms
perf_total_per_op_us[ DIAG_MASK_INF] = 0.025 ms
perf_total_per_op_us[ SOFT_MAX] = 0.166 ms
perf_total_per_op_us[ ROPE] = 2.940 ms
perf_total_per_op_us[ INNER PRODUCT] = 254.974 ms

the profiling of GGML AVX2 (profile on q4_0)

perf_total_per_op_us[ ADD] = 0.119 ms
perf_total_per_op_us[ MUL] = 0.244 ms
perf_total_per_op_us[ SILU] = 3.395 ms
perf_total_per_op_us[ RMS_NORM] = 0.345 ms
perf_total_per_op_us[ MUL_MAT] = 2.281 ms
perf_total_per_op_us[ SCALE] = 0.024 ms
perf_total_per_op_us[ CPY] = 0.976 ms
perf_total_per_op_us[ RESHAPE] = 0.032 ms
perf_total_per_op_us[ VIEW] = 0.063 ms
perf_total_per_op_us[ PERMUTE] = 0.023 ms
perf_total_per_op_us[ TRANSPOSE] = 0.017 ms
perf_total_per_op_us[ GET_ROWS] = 0.004 ms
perf_total_per_op_us[ DIAG_MASK_INF] = 0.015 ms
perf_total_per_op_us[ SOFT_MAX] = 0.117 ms
perf_total_per_op_us[ ROPE] = 0.271 ms
perf_total_per_op_us[ INNER PRODUCT] = 123.976 ms

zhenwei-intel pushed a commit that referenced this pull request Oct 23, 2023
* support Memcpy2D

* support gelu fusion

---------

Co-authored-by: luoyu-intel <yu.luo@intel.com>
VincyZhang added a commit that referenced this pull request Oct 23, 2023
* [CPP Graph] Opt qbits dequant (#465)

* use INC 2.3.1

Signed-off-by: Wenxin Zhang <wenxin.zhang@intel.com>

* use INC 2.3.1 (#500)

Signed-off-by: Wenxin Zhang <wenxin.zhang@intel.com>

* [RUNTIME] Enabing streaming llm for Runtime (#501)

* Support StreamingLLM on CPU

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

* Reduce the UT evaluation time (#498)

Signed-off-by: changwangss <chang1.wang@intel.com>
Signed-off-by: Wenxin Zhang <wenxin.zhang@intel.com>
Signed-off-by: Wang, Chang <chang1.wang@intel.com>
Co-authored-by: Wenxin Zhang <wenxin.zhang@intel.com>

* Minor fix (#507)

* Fix ChatGLM2 model loading issue (#510)

* Fix ChatGLM2 model loading issue

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

* Update README.md

Signed-off-by: Haihao Shen <haihao.shen@intel.com>

* Remove OneDNN env setint for BF16 inference (#509)

Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Co-authored-by: VincyZhang <wenxin.zhang@intel.com>

* support Avx2 (#493)

* support Memcpy2D

* support gelu fusion

---------

Co-authored-by: luoyu-intel <yu.luo@intel.com>

* add neuralchat ut for audio util (#466)

* reduce ut time consumption (#499)

Signed-off-by: Xin He <xin3.he@intel.com>

* update python api readme (#504)

* Add docker setup session for neuralchat finetuning sample (#496)

* Update README.md to new added docker setup session

Signed-off-by: Louie Tsai <louie.tsai@intel.com>

* Update README.md

Signed-off-by: Haihao Shen <haihao.shen@intel.com>

* Update README.md

Signed-off-by: Haihao Shen <haihao.shen@intel.com>

* Update README.md

Signed-off-by: Haihao Shen <haihao.shen@intel.com>

* Update README.md

Signed-off-by: Haihao Shen <haihao.shen@intel.com>

* Update README.md

Signed-off-by: Haihao Shen <haihao.shen@intel.com>

* Update README.md

Signed-off-by: Haihao Shen <haihao.shen@intel.com>

* Update README.md for fast token issue (#515)

Signed-off-by: Louie Tsai <louie.tsai@intel.com>

* Fix typo in README.md (#516)

convertion -> conversion

Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com>

* Update README.md

Signed-off-by: Haihao Shen <haihao.shen@intel.com>

* Update README.md

Signed-off-by: Haihao Shen <haihao.shen@intel.com>

* Update README.md

Signed-off-by: Haihao Shen <haihao.shen@intel.com>

* Update README.md

Signed-off-by: Haihao Shen <haihao.shen@intel.com>

* improve Avx2  (#511)

* Revert "update python api readme (#504)"

This reverts commit 5f4175a.

* Update README.md

Signed-off-by: Haihao Shen <haihao.shen@intel.com>

* Update README.md (#519)

Signed-off-by: ayushrakesh <115995339+ayushrakesh@users.noreply.github.com>

* docs: fix typos in question answering of pytorch (#520)

Signed-off-by: Surav Shrestha <suravshresth@gmail.com>

* fixed typos (#522)

* Updated README.md (#517)

Signed-off-by: Aditya Aryaman Das <128703909+alienishi@users.noreply.github.com>

* update python api readme

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

* fix readme

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

* Update README.md

Signed-off-by: Dong, Bo <bo1.dong@intel.com>

* Update README.md

Signed-off-by: Dong, Bo <bo1.dong@intel.com>

* Update README.md

Signed-off-by: Dong, Bo <bo1.dong@intel.com>

* Update README.md

Signed-off-by: Dong, Bo <bo1.dong@intel.com>

* Add Data type description
Align Doc and help info

Signed-off-by: Hengyu Meng <hengyu.meng@intel.com>

* align

Signed-off-by: Hengyu Meng <hengyu.meng@intel.com>

* fix eos token id

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

---------

Signed-off-by: Wenxin Zhang <wenxin.zhang@intel.com>
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Signed-off-by: changwangss <chang1.wang@intel.com>
Signed-off-by: Wang, Chang <chang1.wang@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: Haihao Shen <haihao.shen@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Louie Tsai <louie.tsai@intel.com>
Signed-off-by: Ikko Eltociear Ashimine <eltociear@gmail.com>
Signed-off-by: ayushrakesh <115995339+ayushrakesh@users.noreply.github.com>
Signed-off-by: Surav Shrestha <suravshresth@gmail.com>
Signed-off-by: Aditya Aryaman Das <128703909+alienishi@users.noreply.github.com>
Signed-off-by: Dong, Bo <bo1.dong@intel.com>
Signed-off-by: Hengyu Meng <hengyu.meng@intel.com>
Co-authored-by: Wang, Zhe <zhe1.wang@intel.com>
Co-authored-by: Wenxin Zhang <wenxin.zhang@intel.com>
Co-authored-by: Wang, Chang <chang1.wang@intel.com>
Co-authored-by: lvliang-intel <liang1.lv@intel.com>
Co-authored-by: Haihao Shen <haihao.shen@intel.com>
Co-authored-by: yuchengliu1 <yucheng.liu@intel.com>
Co-authored-by: luoyu-intel <yu.luo@intel.com>
Co-authored-by: Liangyx2 <106130696+Liangyx2@users.noreply.github.com>
Co-authored-by: xinhe <xin3.he@intel.com>
Co-authored-by: Louie Tsai <louie.tsai@intel.com>
Co-authored-by: Ikko Eltociear Ashimine <eltociear@gmail.com>
Co-authored-by: ayushrakesh <115995339+ayushrakesh@users.noreply.github.com>
Co-authored-by: Surav Shrestha <148626286+shresthasurav@users.noreply.github.com>
Co-authored-by: Smoothieewastaken <86610201+Smoothieewastaken@users.noreply.github.com>
Co-authored-by: Aditya Aryaman Das <128703909+alienishi@users.noreply.github.com>
Co-authored-by: Dong, Bo <bo1.dong@intel.com>
Co-authored-by: Hengyu Meng <hengyu.meng@intel.com>
VincyZhang pushed a commit that referenced this pull request Oct 23, 2023
* support Memcpy2D

* support gelu fusion

---------

Co-authored-by: luoyu-intel <yu.luo@intel.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants