Skip to content
This repository has been archived by the owner on Oct 25, 2024. It is now read-only.

[LLM Runtime] integrate AVX_VNNI #565

Merged
merged 5 commits into from
Nov 9, 2023
Merged

[LLM Runtime] integrate AVX_VNNI #565

merged 5 commits into from
Nov 9, 2023

Conversation

yuchengliu1
Copy link
Contributor

Type of Change

NO API changed or not

integrate AVX_VNNI

detail description
JIRA ticket: xxx

Expected Behavior & Potential Risk

the expected behavior that triggered by this PR

How has this PR been tested?

how to reproduce the test (including hardware information)

Dependency Change?

any library dependency introduced or removed

@yuchengliu1 yuchengliu1 requested a review from airMeng as a code owner October 27, 2023 03:17
@airMeng
Copy link
Contributor

airMeng commented Oct 27, 2023

@VincyZhang can you deploy extension tests on one of our client machine? @yuchengliu1 can help

@yuchengliu1 yuchengliu1 changed the title integrate AVX_VNNI [CPP Graph]integrate AVX_VNNI Oct 31, 2023
@yuchengliu1 yuchengliu1 changed the title [CPP Graph]integrate AVX_VNNI [LLM Runtime]integrate AVX_VNNI Oct 31, 2023
@airMeng airMeng changed the title [LLM Runtime]integrate AVX_VNNI [LLM Runtime] integrate AVX_VNNI Nov 1, 2023
@airMeng airMeng force-pushed the integrate_AVX_VNNI branch from 8298285 to 14bcb1f Compare November 3, 2023 07:39
@yuchengliu1
Copy link
Contributor Author

yuchengliu1 commented Nov 6, 2023

CPU: 12900
memory: DDR5 dual channel 32GB@4800MHz (memory bandwidth 76.8GB/s)
compute_type: int32

  first token next token
llama-7B_q4j_perN 17046.92 ms (16.65 ms per token) 169.06 ms per token
llama2-7B_q4j_perN 16952.96 ms (16.56 ms per token) 166.27 ms per token

@kevinintel
Copy link
Contributor

I remembered latency of AVX2 is 141.69ms.
Is 166 ms good enough?

@airMeng
Copy link
Contributor

airMeng commented Nov 6, 2023

I remembered latency of AVX2 is 141.69ms. Is 166 ms good enough?

compared with #493 161ms

BTW it will be better to list memory bandwidth of your machine @yuchengliu1

@yuchengliu1
Copy link
Contributor Author

I remembered latency of AVX2 is 141.69ms. Is 166 ms good enough?

141.69ms is the latency of llama.cpp not ITREX, and it was performed on a different machine. Considering the different CPU and memory, a direct comparison would not be appropriate. The machine from #493 seem have a better performance. The next token latency of ITREX q4_0 179.03ms(this PR) VS 136.37ms(#493 ). @VincyZhang and I will run a complete CI on this machine.

@VincyZhang
Copy link
Contributor

VincyZhang commented Nov 9, 2023

https://inteltf-jenk.sh.intel.com/job/nlp_toolkit_cpp_graph_test/1445/

Model AVX VNNI First token  (ms) AVX VNNI Next token  (ms) AVX2 First token  (ms) AVX2 Next token  (ms)
LLAMA2-7B-Chat 17046.92 169.06 25337.58 250.01

@airMeng airMeng force-pushed the integrate_AVX_VNNI branch from 14bcb1f to f3f7cfa Compare November 9, 2023 03:17
@VincyZhang VincyZhang merged commit c9e2ef3 into main Nov 9, 2023
11 checks passed
@VincyZhang VincyZhang deleted the integrate_AVX_VNNI branch November 9, 2023 03:37
@yuchengliu1
Copy link
Contributor Author

https://inteltf-jenk.sh.intel.com/job/nlp_toolkit_cpp_graph_test/1445/

Model AVX VNNI First token  (ms) AVX VNNI Next token  (ms) AVX2 First token  (ms) AVX2 Next token  (ms)
LLAMA2-7B-Chat 17046.92 169.06 25337.58 250.01

llama.cpp ( set thread=16 manually ) performance on the same matchine.
first token: 39511.79 ms / 1024 tokens ( 38.59 ms per token, 25.92 tokens per second)
next token: 5337.70 ms / 31 runs ( 172.18 ms per token, 5.81 tokens per second)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants