m-LoRA (short for Multi-LoRA) is an open-source LLMOps framework developed by the IDs Lab at Sichuan University. It is designed for high-throughput fine-tuning, evaluation, and inference of Large Language Models (LLMs) using techniques such as LoRA, DoRA, MixLoRA, and others. Key features of mLoRA include:
-
Concurrent fine-tuning of multiple adapters with a shared pre-trained model.
-
Support for multiple PEFT algorithms and various pre-trained models.
-
Mo-LoRA (Mixture of LoRAs) optimization, mainly for MixLoRA.
You can try m-LoRA with Google Colab before local installation.
This repository has transferred to https://github.com/TUDB-Labs/MoE-PEFT, please use MoE-PEFT instead of this software.
This is an actively developing fork of the official m-LoRA repository, focusing on the PEFT algorithm and its related improvements. It is maintained by the authors of m-LoRA. Currently, this fork does not support pipeline parallelism and can only utilize a single compute device, such as a GPU or NPU, for each m-LoRA process. Please note that the functions, interfaces, and performance of this fork differ from those of the original m-LoRA. Compatibility is not guaranteed. For production use, please prefer the original m-LoRA.
OS | Backend | Model Precision | Quantization | Flash Attention |
---|---|---|---|---|
Linux | CUDA | FP32, FP16, TF32, BF16 | 8bit and 4bit | ✓ |
Windows | CUDA | FP32, FP16, TF32, BF16 | 8bit and 4bit | - |
macOS | MPS | FP32, FP16, BF16 | ✗ | ✗ |
All | CPU | FP32, FP16, BF16 | ✗ | ✗ |
You can use the MLORA_BACKEND_TYPE
environment variable to force m-LoRA to use a specific backend. For example, if you want m-LoRA to run only on CPU, you can set MLORA_BACKEND_TYPE=CPU
before importing mlora
.
Model | Model Size | |
---|---|---|
✓ | LLaMA 1/2 | 7B/13B/70B |
✓ | LLaMA 3/3.1 | 8B/70B |
✓ | Yi 1/1.5 | 6B/9B/34B |
✓ | TinyLLaMA | 1.1B |
✓ | Qwen 1.5/2 | 0.5B ~ 72B |
✓ | Gemma | 2B/7B |
✓ | Gemma 2 | 9B/27B |
✓ | Mistral | 7B |
✓ | Phi 1.5/2 | 2.7B |
✓ | Phi 3 | 3.8B/7B/14B |
✓ | ChatGLM 1/2/3 | 6B |
✓ | GLM 4 | 6B |
PEFT Methods | Arguments* | |
---|---|---|
✓ | QLoRA | See Quantize Methods |
✓ | LoRA+ | "loraplus_lr_ratio": 20.0 |
✓ | DoRA | "use_dora": true |
✓ | rsLoRA | "use_rslora": true |
✓ | MoLA | "routing_strategy": "mola", "num_experts": 8 |
✓ | LoRAMoE | "routing_strategy": "loramoe", "num_experts": 8 |
✓ | MixLoRA | "routing_strategy": "mixlora", "num_experts": 8 |
✓ | MixLoRA-Dynamic | "routing_strategy": "mixlora-dynamic", "num_experts": 8 |
✓ | MixLoRA-Switch | "routing_strategy": "mixlora-switch", "num_experts": 8 |
*: Arguments of configuration file
- m-LoRA supports specific optimized operators for these PEFT methods, which can effectively improve the computing performance during training, evaluation and inference. However, these operators may cause a certain degree of accuracy loss (less than 5%). You can disable the optimized operators by defining the
MLORA_EVALUATE_MODE
environment variable in advance. - Auxiliary Loss is not currently supported for Mo-LoRA (Mixture of LoRAs) methods other than MixLoRA.
- You can check detailed arguments of MixLoRA in TUDB-Labs/MixLoRA.
Attention Methods | Name | Arguments* | |
---|---|---|---|
✓ | Scaled Dot Product | "eager" |
--attn_impl eager |
✓ | Flash Attention 2 | "flash_attn" |
--attn_impl flash_attn |
✓ | Sliding Window Attention | - | --sliding_window |
*: Arguments of mlora.py
m-LoRA only supports scaled-dot product attention (eager) by default. Additional requirements are necessary for flash attention.
For flash attention, manual installation of the following dependencies is required:
pip3 install ninja
pip3 install flash-attn==2.5.8 --no-build-isolation
If any attention method is not specified, flash attention is used if available.
Quantize Methods | Arguments* | |
---|---|---|
✓ | Full Precision (FP32) | by default |
✓ | Tensor Float 32 | --tf32 |
✓ | Half Precision (FP16) | --fp16 |
✓ | Brain Float 16 | --bf16 |
✓ | 8bit Quantize | --load_8bit |
✓ | 4bit Quantize | --load_4bit |
*: Arguments of mlora.py
m-LoRA offers support for various model accuracy and quantization methods. By default, m-LoRA utilizes full precision (Float32), but users can opt for half precision (Float16) using --fp16
or BrainFloat16 using --bf16
. Enabling half precision reduces the model size by half, and for further reduction, quantization methods can be employed.
Quantization can be activated using --load_4bit
for 4-bit quantization or --load_8bit
for 8-bit quantization. However, when only quantization is enabled, m-LoRA utilizes Float32 for calculations. To achieve memory savings during training, users can combine quantization and half-precision modes.
To enable quantization support, please manually install bitsandbytes
:
pip3 install bitsandbytes==0.43.1
It's crucial to note that regardless of the settings, LoRA weights are always calculated and stored at full precision. For maintaining calculation accuracy, m-LoRA framework mandates the use of full precision for calculations when accuracy is imperative.
For users with NVIDIA Ampere or newer GPU architectures, the --tf32
option can be utilized to enable full-precision calculation acceleration.
m-LoRA relies on HuggingFace Hub to download necessary models, datasets, etc. If you cannot access the Internet or need to deploy m-LoRA in an offline environment, please refer to the following guide.
- Use
git-lfs
manually downloads models and datasets from HuggingFace Hub. - Set
--data_path
to the local path to datasets when executinglaunch.py gen
. - Clone the evaluate code repository locally.
- Set environment variable
MLORA_METRIC_PATH
to the local path tometrics
folder of evaluate code repository. - Set
--base_model
to the local path to models when executinglaunch.py run
.
Example of (4): export MLORA_METRIC_PATH=/path-to-your-git-repo/evaluate/metrics
- Quantization with Qwen2 have no effect (same with transformers).
- Applying quantization with DoRA will result in higher memory and computation cost (same with PEFT).
- Sliding window attention with generate cache may product abnormal output.
- Lack of Long RoPE support.
Please refer to m-LoRA Install Guide.
You can conveniently utilize m-LoRA via launch.py
. The following example demonstrates a streamlined approach to training a dummy model with m-LoRA.
# Generating configuration
python launch.py gen --template lora --tasks ./tests/dummy_data.json
# Running the training task
python launch.py run --base_model TinyLlama/TinyLlama_v1.1
# Try with gradio web ui
python inference.py \
--base_model TinyLlama/TinyLlama_v1.1 \
--template alpaca \
--lora_weights ./casual_0
For further detailed usage information, please refer to the help
command:
python launch.py help
The mlora.py
code is a starting point for finetuning on various datasets.
Basic command for finetuning a baseline model on the Alpaca Cleaned dataset:
# Generating configuration
python launch.py gen \
--template lora \
--tasks yahma/alpaca-cleaned
python mlora.py \
--base_model meta-llama/Llama-2-7b-hf \
--config mlora.json \
--bf16
You can check the template finetune configuration in templates folder.
For further detailed usage information, please use --help
option:
python mlora.py --help
Firstly, ensure that you have installed Docker Engine and NVIDIA Container Toolkit correctly.
After that, you can launch the container using the following typical command:
docker run --gpus all -it --rm mikecovlee/mlora
You can check all available tags from: mikecovlee/mlora/tags
Please note that this container only provides a proper environment to run m-LoRA. The codes of m-LoRA are not included.
Copyright © 2023-2024 IDs Lab, Sichuan University
This project is licensed under the Apache 2.0 License.