quick_infer

介绍

在x86和arm架构下，使用KVcache、SIMD、多线程、循环展开等方法对llama2推理加速

纯c++实现，执行效率高

依赖

xmake

使用方法

安装xmake

克隆仓库并进入目录：

git clone https://github.com/zhaosiyuan1098/yuangine.git
cd yuangine

下载所需模型：

使用curl下载

cd ./model

x86:

curl -L -o LLaMA_7B_2_chat.zip "https://www.dropbox.com/scl/fi/vu7wnes1c7gkcegg854ys/LLaMA_7B_2_chat.zip?rlkey=q61o8fpc954g1ke6g2eaot7cf&dl=1"

ARM:

curl -L -o LLaMA_7B_2_chat.zip "https://www.dropbox.com/scl/fi/1trpw92vmh4czvl28hkv0/LLaMA_7B_2_chat.zip?rlkey=dy1pdek0147gnuxdzpodi6pkt&dl=1"

解压

unzip LLaMA_7B_2_chat.zip

使用python下载其他模型（可选）

conda create -n yuangine python=3.10
conda activate yuangine
pip install -r requirenments.txt
cd ./model

python download_model.py --model 想要下载的模型名 --QM 对应的架构

编译项目：
```
cd ..
xmake
```
运行项目：
```
xmake run
```

结构

参照llama2原始结构实现

具体代码架构

效果展示

使用各种方法加速效果对比

方法	x86 加速比	ARM 加速比	备注
SIMD+多线程+循环展开	16.16x	18.3x	使用缓存加速
SIMD	8.83x	10.24x	单指令多数据
多线程	2.99x	3.17x	并行计算
循环展开	1.04x	1.06x	减少循环开销

运行结果

SIMD+多线程+循环展开: SIMD: 多线程循环展开

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
include		include
kernel		kernel
model		model
params		params
pic		pic
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
structure.txt		structure.txt
xmake.lua		xmake.lua

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

quick_infer

介绍

依赖

使用方法

结构

效果展示

使用各种方法加速效果对比

运行结果

About

Releases

Packages

Languages

zhaosiyuan1098/quick_infer

Folders and files

Latest commit

History

Repository files navigation

quick_infer

介绍

依赖

使用方法

结构

效果展示

使用各种方法加速效果对比

运行结果

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages