MobileQuant/device at main · saic-fi/MobileQuant

History

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
calibrate.py		calibrate.py
convert_sim.py		convert_sim.py
debug.py		debug.py
export.py		export.py
harness_aimet_ctx.py		harness_aimet_ctx.py
utils.py		utils.py

README.md

LLM on-device profiling

We provide code to profile the quantized model on an Android phone with a Snapdragon 8 Gen 3 NPU. Please install the necessary packages, e.g. AIMET, QNN SDK, following INSTALL.md

🐼 Extra requirement for Gemma 2B

If you're playing with LlaMA 1.1B, please ignore this part as it is completely unnecessary.

For models larger than 2B, e.g. Gemma 2B, we need a gcc that uses 64-bit memory addresses. Our solution is to compile our own gcc😢.

get the gcc

git clone https://github.com/fwtan/gcc.git
git checkout gcc-12/qnn

compile the gcc

./contrib/download_prerequisites
cd ..
mkdir objdir
cd objdir
$PWD/../gcc/configure --prefix=$HOME/gcc-qnn --enable-languages=c,c++,fortran,go
make
make install

Please do let us know if you have any simpler solutions🙂!

🏃 Profiling

Please adb connect to an Android phone with a Snapdragon 8 Gen 3 HTP before running the profiling.

Benchmarking W4A8 LlaMA-1.1B

get the HF model

git clone https://huggingface.co/fwtan/llama-1.1b-mobilequant-w4a8-s1024-e60-sym-hf
export HF_PATH=llama-1.1b-mobilequant-w4a8-s1024-e60-sym-hf
export HF_NAME=llama-1.1b-mobilequant-w4a8-s1024-e60-sym-hf

convert the HF model to a simpler format (sim model) that can be parsed by AIMET. This step is only necessary if you'd like to use your own pretrained model.

# Not necessry if you're using the provided model
# CUDA_VISIBLE_DEVICES=0 python device/convert_sim.py --hf_path ${HF_PATH}

export the ONNX files and the quantization encodings. Note that ${HF_PATH}/act_dict.json will be used to override the quantization encodings from the AIMET calibration.

CUDA_VISIBLE_DEVICES=0 python device/calibrate.py --hf_path ${HF_PATH} --per_channel \
    --use_conv --weight_bitwidth 4 --act_dict_path ${HF_PATH}/act_dict.json

export the on-device model and run the profiling

CUDA_VISIBLE_DEVICES=0 python device/export.py --hf_path ${HF_PATH} \
    --kv_cache --per_channel --use_conv --quant_config 4 8 32 \
    --quant_encoding results/sim_${HF_NAME}_calibration/gen/model_gen_transfered.encodings \
    --kv_encoding results/sim_${HF_NAME}_calibration/gen/model_gen_kv_cache.encodings

The script will report the average latency (ms), and save the profiling data qnn_profiling.csv and model qnn_model.bin in sim_${HF_NAME}_qnn_with_kv/device. The qnn_model.bin file can then be used for the provided demo, as discussed in capp/README.md

Benchmarking W8A8 LlaMA-1.1B

get the HF model

git clone https://huggingface.co/fwtan/llama-1.1b-mobilequant-w8a8-s1024-e60-hf
export HF_PATH=llama-1.1b-mobilequant-w8a8-s1024-e60-hf
export HF_NAME=llama-1.1b-mobilequant-w8a8-s1024-e60-hf

convert the HF model if necessary

# Not necessry if you're using the provided model
# CUDA_VISIBLE_DEVICES=0 python device/convert_sim.py --hf_path ${HF_PATH}

The other commands are very similar to W4A8 except we don't use the flags --per_channel, --use_conv

export the ONNX files and the quantization encodings.

CUDA_VISIBLE_DEVICES=0 python device/calibrate.py --hf_path ${HF_PATH} \
    --weight_bitwidth 8 --act_dict_path ${HF_PATH}/act_dict.json

export the on-device model and run the profiling

CUDA_VISIBLE_DEVICES=0 python device/export.py --hf_path ${HF_PATH} \
    --kv_cache --quant_config 8 8 32 \
    --quant_encoding results/sim_${HF_NAME}_calibration/gen/model_gen_transfered.encodings \
    --kv_encoding results/sim_${HF_NAME}_calibration/gen/model_gen_kv_cache.encodings

Benchmarking Gemma 2B

The commands are the same as LlaMA 1.1B, as long as ${HF_PATH} and ${HF_NAME} are set properly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

device

device

README.md

LLM on-device profiling

🐼 Extra requirement for Gemma 2B

🏃 Profiling

Benchmarking W4A8 LlaMA-1.1B

Benchmarking W8A8 LlaMA-1.1B

Benchmarking Gemma 2B

Files

device

Directory actions

More options

Directory actions

More options

Latest commit

History

device

Folders and files

parent directory

README.md

LLM on-device profiling

🐼 Extra requirement for Gemma 2B

🏃 Profiling

Benchmarking W4A8 LlaMA-1.1B

Benchmarking W8A8 LlaMA-1.1B

Benchmarking Gemma 2B