Release Intel® Extension for Transformers v1.3 Release · intel/intel-extension-for-transformers

Highlights
Publication
Features
Examples
Bug Fixing
Incompatible change

Highlights

LLM Workflow/Neural Chat
- Achieved Top-1 7B LLM Hugging Face Open Leaderboard in Nov’23
- Released DPO dataset to Hugging Face Space for fine-tuning
- Published the blog and fine-tuning code on Gaudi2
- Supported fine-tuning and inference on Gaudi2 and Xeon
- Updated notebooks for chatbot development and deployment
- Provided customizable RAG-based chatbot applications
- Published INT4 chatbot on Hugging Face Space
Transformer Extension for Low-bit Inference and Fine-tuning
- Supported INT4/NF4/FP4/FP8 LLM inference
- Improved StreamingLLM for efficient endless text generation
- Demonstrated up to 40x better performance than llama.cpp on Intel Xeon Scalable Processors
- Supported QLoRA fine-tuning on CPU

Publications

NeurIPS'2023 on Efficient Natural Language and Speech Processing: Efficient LLM Inference on CPUs
NeurIPS'2023 on Diffusion Models: Effective Quantization for Diffusion Models on CPUs
Arxiv: TEQ: Trainable Equivalent Transformation for Quantization of LLMs
Arxiv: Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs

Features

LLM Workflow/Neural Chat
- Support Gaudi model parallelism serving (7f0090)
- Add PEFT model support in deepspeed sharded mode (370ca3)
- Support return error code (ea173a)
- Enhance NeuralChat security (ab43c7, 43e8b9, 6e0386)
- Support assisted generation for NeuralChat (5ba797)
- Add codegen restful API in NeuralChat (0c77b1)
- Support multi cards streaming inference on Gaudi (9ad75c)
- Support multi CPU restful API serving (fec4bb4)
- Support IPEX int8 model (e13363)
- Enable retrieval with URL as inputs (9d90e1d)
- Add NER plugin to NeuralChat (aa5d8a)
- Integrate PhotoAI backend into NeuralChat (da138c, d7a1d8)
- Support image to image plugin as service (12ad4c)
- Support optimized SadTalker to Video plugin in NeuralChat (7f24c79)
- Add askdoc retrieval API & example (89cf76)
- Add sidebyside UI (dbbcc2b)
Transformer Extension for Low-bit Inference and Fine-tuning
- Support load_in_nbit in llm runtime (4423f7)
- Extend langchain embedding API (80a779)
- Support QLoRA on CPU device (adb109)
- Support PPO rl_training (936c2d2, 8543e2f)
- Support multi-model training (ecb448)
- Transformers Extension for Low-bit Inference Runtime support GPTQ models (8145e6)
- Enable Beam Search post-processing (958d04, ae95a2, ae95a2, 224656, 958d04, 6ea825)
- Add MX-Format (FP8_E5M2, FP8_E4M3, FP4_E2M1, NF4) (f49f2d, 9f96ae)
- Refactor Transformers Extension for Low-bit Inference Runtime based on the latest Jblas (43e30b)
- Support attention block TP and add jblas split weight interface (2c31dc, 22ceda4)
- Enabing streaming LLM for Runtime (ffc73bb5)
- Support starcoder MHA fusion (841b29a)
- SmoothQuantConfig support recipes (1e0d7e)
- Add SetFit API in ITREX (ffb7fd8)
- Support full parameters finetuning (2b541)
- Support SmoothQuant auto tune (2fde68c)
- Use python logging instead of print (60942e)
- Support Falcon and unify int8 API (0fb2da8)
- Support ipex.optimize_transformers feature (d2bd4d, ee855, 3f9ee42)
- Optimized dropout operator (7d276c)
- Add Script for PPL Evaluation (df40d5)
- Refine Python API (91511d, 6e32ca6)
- Allow CompileBF16 on GCC11 (d9e95d)
- Multi-Round chat with ChatGLM2 (db35a3)
- Shift-RoPE-based Streaming-LLM (68ca20, 61f19f9)
- Enable MHA fusion for LLM (81dde2, 7b73b1, 6599bd, 692fde3)
- Support AVX_VNNI and AVX2 (c9e2ef3, 00baa42, a05ff4b)
- Optimize qbits backend (e1f9e2b3, 45e03b9)
- GELU support (4f5de0)

Examples

LLM Workflow/Neural Chat
- Add Mistral, Code-Llama, NeuralChat-7B, Qwen (fcee612, 7baa96b, d9a864, 698e58)
- Added StarCoder, CodeLlama, Falcon and Mistral finetuning example(477018)
- Add fine-tuning with Deepspeed example (554fb9)
Transformer Extension for Low-bit Inference and Fine-tuning
- Add ChatGLM and Code-Llama example (130b59)
- Add WOQ to code-generation example (65a645f)
- Add Text-generation example support ChatGLM2&3 (4525b)
- Text-generation support qwen (8f41d4)
- Add INT4 ONNX whisper example (c7f8173c, e9fc4c2)
- Support DPO on habana/gaudi (98d3ce3)
- Enable finetune for Qwen-7b-chat on CPU (6bc938)
- Enable Whisper C++ API (74e92a)
- Apply the STS task to BAAI/BGE models (0c4c5ed, c399e38)
- Enable Qwen graph (381331c)
- Add instruction_tuning Stable Diffusion examples (17f01c6)
- Enable Mistral-7b (7d1495)
- Support Falcon-180B (900ebf4)
- Add Baichuan/Baichuan2 example (98e5f9)

Bug Fixing

LLM Workflow/Neural Chat
- Enhance SafetyChecker to resolve can't find stopword.txt (5ba797)
- Multilingual ASR enhance (62d002)
- Remove haystack dependency (16ff4fb)
- Fix starcoder issues for IPEX int8 and Weight Only int4 (e88c7b)
- Remove OneDNN env setint for BF16 inference (59ab03)
- Fix ChatGLM2 model loading issue (4f2169)
- Fix init issue of langchain chroma (fdefe27)
Transformer Extension for Low-bit Inference and Fine-tuning
- Fixed bug for woq with AWQ (565ab4b)
- Use validation dataset for evaluation (e764bb)
- Fix gradient issue for qlora on seq2seq (ff0465)
- Fix post process with topk topp of python API (7b4730)
- Fix PC codegen streaming issue (0f0bf22)
- Fix Jblas stack overflow on Windows (65af04)

Incompatible Changes

[Neural Chat] Optimize the structure of NeuralChat example directories (1447e6f)
[Transformers Extension for Low-bit Inference] Update baichuan/baichuan2 API (98e5f9)

Validated Configurations

Python 3.9, 3.10, 3.11
Centos 8.4 & Ubuntu 20.04 & Windows 10
Intel® Extension for TensorFlow 2.13.0, 2.14.0
PyTorch 2.1.0+cpu 2.0.0+cpu
Intel® Extension for PyTorch 2.1.0+cpu, 2.0.0+cpu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intel® Extension for Transformers v1.3 Release