[LLM Runtime] integrate AVX_VNNI #565

yuchengliu1 · 2023-10-27T03:17:53Z

Type of Change

NO API changed or not

integrate AVX_VNNI

detail description
JIRA ticket: xxx

Expected Behavior & Potential Risk

the expected behavior that triggered by this PR

How has this PR been tested?

how to reproduce the test (including hardware information)

Dependency Change?

any library dependency introduced or removed

airMeng · 2023-10-27T03:20:46Z

@VincyZhang can you deploy extension tests on one of our client machine? @yuchengliu1 can help

yuchengliu1 · 2023-11-06T05:48:23Z

CPU: 12900
memory: DDR5 dual channel 32GB@4800MHz (memory bandwidth 76.8GB/s)
compute_type: int32

	first token	next token
llama-7B_q4j_perN	17046.92 ms (16.65 ms per token)	169.06 ms per token
llama2-7B_q4j_perN	16952.96 ms (16.56 ms per token)	166.27 ms per token

kevinintel · 2023-11-06T12:27:26Z

I remembered latency of AVX2 is 141.69ms.
Is 166 ms good enough?

airMeng · 2023-11-06T13:50:37Z

I remembered latency of AVX2 is 141.69ms. Is 166 ms good enough?

compared with #493 161ms

BTW it will be better to list memory bandwidth of your machine @yuchengliu1

yuchengliu1 · 2023-11-07T05:26:49Z

I remembered latency of AVX2 is 141.69ms. Is 166 ms good enough?

141.69ms is the latency of llama.cpp not ITREX, and it was performed on a different machine. Considering the different CPU and memory, a direct comparison would not be appropriate. The machine from #493 seem have a better performance. The next token latency of ITREX q4_0 179.03ms(this PR) VS 136.37ms(#493 ). @VincyZhang and I will run a complete CI on this machine.

VincyZhang · 2023-11-09T03:14:12Z

https://inteltf-jenk.sh.intel.com/job/nlp_toolkit_cpp_graph_test/1445/

Model	AVX VNNI First token (ms)	AVX VNNI Next token (ms)	AVX2 First token (ms)	AVX2 Next token (ms)
LLAMA2-7B-Chat	17046.92	169.06	25337.58	250.01

yuchengliu1 · 2023-11-10T05:22:06Z

https://inteltf-jenk.sh.intel.com/job/nlp_toolkit_cpp_graph_test/1445/

Model AVX VNNI First token (ms) AVX VNNI Next token (ms) AVX2 First token (ms) AVX2 Next token (ms)
LLAMA2-7B-Chat 17046.92 169.06 25337.58 250.01

llama.cpp ( set thread=16 manually ) performance on the same matchine.
first token: 39511.79 ms / 1024 tokens ( 38.59 ms per token, 25.92 tokens per second)
next token: 5337.70 ms / 31 runs ( 172.18 ms per token, 5.81 tokens per second)

yuchengliu1 requested a review from airMeng as a code owner October 27, 2023 03:17

yuchengliu1 changed the title ~~integrate AVX_VNNI~~ [CPP Graph]integrate AVX_VNNI Oct 31, 2023

yuchengliu1 changed the title ~~[CPP Graph]integrate AVX_VNNI~~ [LLM Runtime]integrate AVX_VNNI Oct 31, 2023

yuchengliu1 force-pushed the integrate_AVX_VNNI branch from aba61ea to 8298285 Compare October 31, 2023 06:41

airMeng added the ITREX.cpp label Nov 1, 2023

airMeng changed the title ~~[LLM Runtime]integrate AVX_VNNI~~ [LLM Runtime] integrate AVX_VNNI Nov 1, 2023

airMeng approved these changes Nov 3, 2023

View reviewed changes

airMeng force-pushed the integrate_AVX_VNNI branch from 8298285 to 14bcb1f Compare November 3, 2023 07:39

zhewang1-intc and others added 5 commits November 9, 2023 11:17

update KBlocks

6ba3e18

support kblock

0df9531

support perN

f3b73af

fix bug

466e2b6

update readme

f3f7cfa

airMeng force-pushed the integrate_AVX_VNNI branch from 14bcb1f to f3f7cfa Compare November 9, 2023 03:17

airMeng added the ready for merge label Nov 9, 2023

VincyZhang merged commit c9e2ef3 into main Nov 9, 2023
11 checks passed

VincyZhang deleted the integrate_AVX_VNNI branch November 9, 2023 03:37

DDEle mentioned this pull request Nov 16, 2023

What is the system requirement to run the sample code? #531

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLM Runtime] integrate AVX_VNNI #565

[LLM Runtime] integrate AVX_VNNI #565

yuchengliu1 commented Oct 27, 2023

airMeng commented Oct 27, 2023

yuchengliu1 commented Nov 6, 2023 •

edited

Loading

kevinintel commented Nov 6, 2023

airMeng commented Nov 6, 2023

yuchengliu1 commented Nov 7, 2023

VincyZhang commented Nov 9, 2023 •

edited

Loading

yuchengliu1 commented Nov 10, 2023

[LLM Runtime] integrate AVX_VNNI #565

[LLM Runtime] integrate AVX_VNNI #565

Conversation

yuchengliu1 commented Oct 27, 2023

Type of Change

Expected Behavior & Potential Risk

How has this PR been tested?

Dependency Change?

airMeng commented Oct 27, 2023

yuchengliu1 commented Nov 6, 2023 • edited Loading

kevinintel commented Nov 6, 2023

airMeng commented Nov 6, 2023

yuchengliu1 commented Nov 7, 2023

VincyZhang commented Nov 9, 2023 • edited Loading

yuchengliu1 commented Nov 10, 2023

yuchengliu1 commented Nov 6, 2023 •

edited

Loading

VincyZhang commented Nov 9, 2023 •

edited

Loading