quick_infer

介绍

在x86和arm架构下，使用KVcache、SIMD、多线程、循环展开等方法对llama2推理加速

纯c++实现，执行效率高

依赖

xmake

使用方法

安装xmake

克隆仓库并进入目录：

git clone https://github.com/zhaosiyuan1098/yuangine.git
cd yuangine

下载所需模型：

使用curl下载

cd ./model

x86:

curl -L -o LLaMA_7B_2_chat.zip "https://www.dropbox.com/scl/fi/vu7wnes1c7gkcegg854ys/LLaMA_7B_2_chat.zip?rlkey=q61o8fpc954g1ke6g2eaot7cf&dl=1"

ARM:

curl -L -o LLaMA_7B_2_chat.zip "https://www.dropbox.com/scl/fi/1trpw92vmh4czvl28hkv0/LLaMA_7B_2_chat.zip?rlkey=dy1pdek0147gnuxdzpodi6pkt&dl=1"

解压

unzip LLaMA_7B_2_chat.zip

使用python下载其他模型（可选）

conda create -n yuangine python=3.10
conda activate yuangine
pip install -r requirenments.txt
cd ./model

python download_model.py --model 想要下载的模型名 --QM 对应的架构

编译项目：
```
cd ..
xmake
```
运行项目：
```
xmake run
```

结构

参照llama2原始结构实现

具体代码架构

效果展示

使用各种方法加速效果对比

方法	x86 加速比	ARM 加速比	备注
SIMD+多线程+循环展开	16.16x	18.3x	使用缓存加速
SIMD	8.83x	10.24x	单指令多数据
多线程	2.99x	3.17x	并行计算
循环展开	1.04x	1.06x	减少循环开销

运行结果

SIMD+多线程+循环展开: SIMD: 多线程循环展开

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

quick_infer

介绍

依赖

使用方法

结构

效果展示

使用各种方法加速效果对比

运行结果

Files

README.md

Latest commit

History

README.md

File metadata and controls

quick_infer

介绍

依赖

使用方法

结构

效果展示

使用各种方法加速效果对比

运行结果