Q-Sparse-LLM is an implementation of a sparse transformer architecture designed for efficient and high-performance language modeling. This project introduces sparsity and quantization techniques to the traditional transformer architecture, aiming to reduce computational costs and memory footprint while maintaining model performance.
- Top-K Sparsity: Implements a sparse activation mechanism that retains only the top K% of values in each layer.
- Quantized Top-K Sparsity: Extends the sparsity mechanism with 8-bit quantization for further efficiency.
- ReLU²GLU Activation: Uses a squared ReLU Gated Linear Unit for improved sparsity in feed-forward layers.
- Compatibility with 1-bit LLMs: Designed to be compatible with extremely quantized models like BitNet b1.58.
The Q-Sparse architecture is based on the Transformer architecture with modifications to enable sparsity in the activations:
-
Top-K Sparsity:
- Applies a mask to keep only the top K% of activations (by magnitude).
- Rescales the output by its L2 norm.
-
Quantized Top-K Sparsity:
- Quantizes the input to 8-bit representation before applying Top-K sparsity.
-
Squared ReLU (ReLU²GLU):
- Implements ReLU²GLU for feed-forward layers:
ReLU²GLU(X) = X · W_up^T ⊙ ReLU²(X · W_gate^T)
- Implements ReLU²GLU for feed-forward layers:
git clone https://github.com/nanowell/Q-Sparse-LLM.git
cd Q-Sparse-LLM
Here's a basic example of how to use the Q-Sparse-LLM model:
from q_sparse import QSparseModel
# Initialize the model
model = QSparseModel(
vocab_size=30000,
d_model=768,
nhead=12,
num_layers=12,
dim_feedforward=3072,
k_ratio=0.5,
quantized=True
)
# Use the model for inference or training
# (Add specific usage instructions based on your implementation)
Contributions to Q-Sparse-LLM are welcome!
This project is licensed under the MIT License.
If you use Q-Sparse-LLM in your research, please cite:
@software{Q-Sparse-LLM,
author = {nanowell},
title = {Q-Sparse-LLM: Quantized Sparse Language Model},
year = {2024},
url = {https://github.com/nanowell/Q-Sparse-LLM}
}
This project builds upon the work Q-Sparse paper:
@misc{wang2024qsparselargelanguagemodels,
title={Q-Sparse: All Large Language Models can be Fully Sparsely-Activated},
author={Hongyu Wang and Shuming Ma and Ruiping Wang and Furu Wei},
year={2024},
eprint={2407.10969},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.10969},
}
For questions and feedback, please open an issue in the GitHub repository or contact [zarugeos@gmail.com].