Skip to content

Building Large-Scale AI Applications for Distributed Big Data

License

Notifications You must be signed in to change notification settings

hzjane/BigDL

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

IPEX-LLM

ipex-llm is a library for running LLM (large language model) on Intel XPU (from Laptop to GPU to Cloud) using INT4/FP4/INT8/FP8 with very low latency[^1] (for any PyTorch model).

It is built on the excellent work of llama.cpp, bitsandbytes, qlora, gptq, AutoGPTQ, awq, AutoAWQ, vLLM, llama-cpp-python, gptq_for_llama, chatglm.cpp, redpajama.cpp, gptneox.cpp, bloomz.cpp, etc.

Latest update πŸ”₯

  • [2024/03] LangChain added support for ipex-llm; see the details here.
  • [2024/02] ipex-llm now supports directly loading model from ModelScope (魔搭).
  • [2024/02] ipex-llm added inital INT2 support (based on llama.cpp IQ2 mechanism), which makes it possible to run large-size LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM.
  • [2024/02] Users can now use ipex-llm through Text-Generation-WebUI GUI.
  • [2024/02] ipex-llm now supports Self-Speculative Decoding, which in practice brings ~30% speedup for FP16 and BF16 inference latency on Intel GPU and CPU respectively.
  • [2024/02] ipex-llm now supports a comprehensive list of LLM finetuning on Intel GPU (including LoRA, QLoRA, DPO, QA-LoRA and ReLoRA).
  • [2024/01] Using ipex-llm QLoRA, we managed to finetune LLaMA2-7B in 21 minutes and LLaMA2-70B in 3.14 hours on 8 Intel Max 1550 GPU for Standford-Alpaca (see the blog here).
  • [2024/01] πŸ””πŸ””πŸ”” The default ipex-llm GPU Linux installation has switched from PyTorch 2.0 to PyTorch 2.1, which requires new oneAPI and GPU driver versions. (See the GPU installation guide for more details.)
  • [2023/12] ipex-llm now supports ReLoRA (see "ReLoRA: High-Rank Training Through Low-Rank Updates").
  • [2023/12] ipex-llm now supports Mixtral-8x7B on both Intel GPU and CPU.
  • [2023/12] ipex-llm now supports QA-LoRA (see "QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models").
  • [2023/12] ipex-llm now supports FP8 and FP4 inference on Intel GPU.
  • [2023/11] Initial support for directly loading GGUF, AWQ and GPTQ models into ipex-llm is available.
  • [2023/11] ipex-llm now supports vLLM continuous batching on both Intel GPU and CPU.
  • [2023/10] ipex-llm now supports QLoRA finetuning on both Intel GPU and CPU.
  • [2023/10] ipex-llm now supports FastChat serving on on both Intel CPU and GPU.
  • [2023/09] ipex-llm now supports Intel GPU (including iGPU, Arc, Flex and MAX).
  • [2023/09] ipex-llm tutorial is released.
  • [2023/09] Over 40 models have been optimized/verified on ipex-llm, including LLaMA/LLaMA2, ChatGLM2/ChatGLM3, Mistral, Falcon, MPT, LLaVA, WizardCoder, Dolly, Whisper, Baichuan/Baichuan2, InternLM, Skywork, QWen/Qwen-VL, Aquila, MOSS, and more; see the complete list here.

ipex-llm Demos

See the optimized performance of chatglm2-6b and llama-2-13b-chat models on 12th Gen Intel Core CPU and Intel Arc GPU below.

12th Gen Intel Core CPU Intel Arc GPU
chatglm2-6b llama-2-13b-chat chatglm2-6b llama-2-13b-chat

ipex-llm quickstart

CPU INT4

Install

You may install ipex-llm on Intel CPU as follows:

Note: See the CPU installation guide for more details.

pip install --pre --upgrade ipex-llm[all]

Note: ipex-llm has been tested on Python 3.9, 3.10 and 3.11

Run Model

You may apply INT4 optimizations to any Hugging Face Transformers models as follows.

#load Hugging Face Transformers model with INT4 optimizations
from ipex_llm.transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)

#run the optimized model on CPU
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
input_ids = tokenizer.encode(input_str, ...)
output_ids = model.generate(input_ids, ...)
output = tokenizer.batch_decode(output_ids)

See the complete examples here.

GPU INT4

Install

You may install ipex-llm on Intel GPU as follows:

Note: See the GPU installation guide for more details.

# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu

Note: ipex-llm has been tested on Python 3.9, 3.10 and 3.11

Run Model

You may apply INT4 optimizations to any Hugging Face Transformers models as follows.

#load Hugging Face Transformers model with INT4 optimizations
from ipex_llm.transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)

#run the optimized model on Intel GPU
model = model.to('xpu')

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
input_ids = tokenizer.encode(input_str, ...).to('xpu')
output_ids = model.generate(input_ids, ...)
output = tokenizer.batch_decode(output_ids.cpu())

See the complete examples here.

More Low-Bit Support

Save and load

After the model is optimized using ipex-llm, you may save and load the model as follows:

model.save_low_bit(model_path)
new_model = AutoModelForCausalLM.load_low_bit(model_path)

See the complete example here.

Additonal data types

In addition to INT4, You may apply other low bit optimizations (such as INT8, INT5, NF4, etc.) as follows:

model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int8")

See the complete example here.

Verified Models

Over 40 models have been optimized/verified on ipex-llm, including LLaMA/LLaMA2, ChatGLM/ChatGLM2, Mistral, Falcon, MPT, Baichuan/Baichuan2, InternLM, QWen and more; see the example list below.

Model CPU Example GPU Example
LLaMA (such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.) link1, link2 link
LLaMA 2 link1, link2 link1, link2-low GPU memory example
ChatGLM link
ChatGLM2 link link
ChatGLM3 link link
Mistral link link
Mixtral link link
Falcon link link
MPT link link
Dolly-v1 link link
Dolly-v2 link link
Replit Code link link
RedPajama link1, link2
Phoenix link1, link2
StarCoder link1, link2 link
Baichuan link link
Baichuan2 link link
InternLM link link
Qwen link link
Qwen1.5 link link
Qwen-VL link link
Aquila link link
Aquila2 link link
MOSS link
Whisper link link
Phi-1_5 link link
Flan-t5 link link
LLaVA link link
CodeLlama link link
Skywork link
InternLM-XComposer link
WizardCoder-Python link
CodeShell link
Fuyu link
Distil-Whisper link link
Yi link link
BlueLM link link
Mamba link link
SOLAR link link
Phixtral link link
InternLM2 link link
RWKV4 link
RWKV5 link
Bark link link
SpeechT5 link
DeepSeek-MoE link
Ziya-Coding-34B-v1.0 link
Phi-2 link link
Yuan2 link link
Gemma link link
DeciLM-7B link link
Deepseek link link

For more details, please refer to the ipex-llm Document, Readme, Tutorial and API Doc.

About

Building Large-Scale AI Applications for Distributed Big Data

Resources

License

Security policy

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 97.3%
  • Shell 2.2%
  • Other 0.5%