Awesome efficient llm

Awesome efficient llm
- Efficient finetuning
  - Adapter
- Quantization
  - Survey
  - Papers
  - Projects
  - Other
- Distillation
- Pruning
- Efficient Inference
  - Other
- Mobile
- Toolkits
- Efficient transformer

Survey

Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey, arXiv, 2403.14608, arxiv, pdf, cication: 18

Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, Sai Qian Zhang
A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models, arXiv, 2405.13019, arxiv, pdf, cication: -1

Mahsa Khoshnoodi, Vinija Jain, Mingye Gao, Malavika Srikanth, Aman Chadha
Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models, arXiv, 2404.14897, arxiv, pdf, cication: -1

Chen Zhang, Zhuorui Liu, Dawei Song
A Survey on Efficient Inference for Large Language Models, arXiv, 2404.14294, arxiv, pdf, cication: -1

Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li
A Survey on Knowledge Distillation of Large Language Models, arXiv, 2402.13116, arxiv, pdf, cication: -1

Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, Tianyi Zhou · (Awesome-Knowledge-Distillation-of-LLMs - Tebmer) · (jiqizhixin)
Efficient Exploration for LLMs, arXiv, 2402.00396, arxiv, pdf, cication: -1

Vikranth Dwaracherla, Seyed Mohammad Asghari, Botao Hao, Benjamin Van Roy
A Comprehensive Survey of Compression Algorithms for Language Models, arXiv, 2401.15347, arxiv, pdf, cication: -1

Seungcheol Park, Jaehyeon Choi, Sojin Lee, U Kang
A Survey of Resource-efficient LLM and Multimodal Foundation Models, arXiv, 2401.08092, arxiv, pdf, cication: -1

Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang

· (efficient_foundation_model_survey - ubiquitouslearning)
Understanding LLMs: A Comprehensive Overview from Training to Inference, arXiv, 2401.02038, arxiv, pdf, cication: -1

Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong
Efficient Large Language Models: A Survey, arXiv, 2312.03863, arxiv, pdf, cication: -1

Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, Mosharaf Chowdhury · (Efficient-LLMs-Survey - AIoT-MLSys-Lab)
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, arXiv, 2312.15234, arxiv, pdf, cication: -1

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia · (mp.weixin.qq)
The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, arXiv, 2312.00678, arxiv, pdf, cication: -1

Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang
Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning, arXiv, 2303.15647, arxiv, pdf, cication: -1

Vladislav Lialin, Vijeta Deshpande, Anna Rumshisky

Efficient finetuning

Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models, arXiv, 2407.01906, arxiv, pdf, cication: -1

Zihan Wang, Deli Chen, Damai Dai, Runxin Xu, Zhuoshu Li, Y. Wu
OLoRA: Orthonormal Low-Rank Adaptation of Large Language Models, arXiv, 2406.01775, arxiv, pdf, cication: -1

Kerim Büyükakyüz
VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections, arXiv, 2405.17991, arxiv, pdf, cication: -1

Roy Miles, Pradyumna Reddy, Ismail Elezi, Jiankang Deng
$\textit{Trans-LoRA}$: towards data-free Transferable Parameter Efficient Finetuning, arXiv, 2405.17258, arxiv, pdf, cication: -1

Runqian Wang, Soumya Ghosh, David Cox, Diego Antognini, Aude Oliva, Rogerio Feris, Leonid Karlinsky
Towards Modular LLMs by Building and Reusing a Library of LoRAs, arXiv, 2405.11157, arxiv, pdf, cication: -1

Oleksiy Ostapenko, Zhan Su, Edoardo Maria Ponti, Laurent Charlin, Nicolas Le Roux, Matheus Pereira, Lucas Caccia, Alessandro Sordoni
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning, arXiv, 2405.12130, arxiv, pdf, cication: -1

Ting Jiang, Shaohan Huang, Shengyue Luo, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang
LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report, arXiv, 2405.00732, arxiv, pdf, cication: -1

Justin Zhao, Timothy Wang, Wael Abid, Geoffrey Angus, Arnav Garg, Jeffery Kinnison, Alex Sherstinsky, Piero Molino, Travis Addair, Devvret Rishi
PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models, arXiv, 2404.02948, arxiv, pdf, cication: -1

Fanxu Meng, Zhaohui Wang, Muhan Zhang · (PiSSA - GraphPKU)
ReFT: Representation Finetuning for Language Models, arXiv, 2404.03592, arxiv, pdf, cication: -1

Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, Christopher Potts · (pyreft - stanfordnlp)
- by manipulating a small fraction of model representations it is possible to effectively steer model behavior to achieve better downstream performance at inference time; also proposes LoReFT as a drop-in replacement for PEFTs that is 10-50x more parameter efficient.
Model Stock: All we need is just a few fine-tuned models, arXiv, 2403.19522, arxiv, pdf, cication: -1

Dong-Hwan Jang, Sangdoo Yun, Dongyoon Han
- uses just two models for layer-wise weight averaging.
DiJiang: Efficient Large Language Models through Compact Kernelization, arXiv, 2403.19928, arxiv, pdf, cication: -1

Hanting Chen, Zhicheng Liu, Xutao Wang, Yuchuan Tian, Yunhe Wang · (DiJiang - YuchuanTian)
LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning, arXiv, 2403.17919, arxiv, pdf, cication: -1

Rui Pan, Xiang Liu, Shizhe Diao, Renjie Pi, Jipeng Zhang, Chi Han, Tong Zhang

· (jiqizhixin) · (LMFlow - OptimalScale)
- randomly freezing middle layers during training based on importance sampling, which is efficient and can outperform both LoRA and and full LLM finetuning by a noticeable margin in terms of model performance.
Mixture-of-LoRAs: An Efficient Multitask Tuning for Large Language Models, arXiv, 2403.03432, arxiv, pdf, cication: -1

Wenfeng Feng, Chuzhan Hao, Yuewei Zhang, Yu Han, Hao Wang
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection, arXiv, 2403.03507, arxiv, pdf, cication: -1

Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, Yuandong Tian · (galore - jiaweizzhao) · (huggingface)
- Gradient Low-Rank Projection (GaLore) is a new training strategy that significantly reduces memory usage by up to 65.5% for optimizer states during the training of LLMs, without sacrificing performance.
· (mp.weixin.qq)
LoRA+: Efficient Low Rank Adaptation of Large Models, arXiv, 2402.12354, arxiv, pdf, cication: -1

Soufiane Hayou, Nikhil Ghosh, Bin Yu
DoRA: Weight-Decomposed Low-Rank Adaptation, arXiv, 2402.09353, arxiv, pdf, cication: -1

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Min-Hung Chen · (dora - catid)

· (magazine.sebastianraschka)
Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models, arXiv, 2401.00788, arxiv, pdf, cication: -1

Terry Yue Zhuo, Armel Zebaze, Nitchakarn Suppattarachai, Leandro von Werra, Harm de Vries, Qian Liu, Niklas Muennighoff
Parameter Efficient Tuning Allows Scalable Personalization of LLMs for Text Entry: A Case Study on Abbreviation Expansion, arXiv, 2312.14327, arxiv, pdf, cication: -1

Katrin Tomanek, Shanqing Cai, Subhashini Venugopalan
Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation)

· (jiqizhixin)
MultiLoRA: Democratizing LoRA for Better Multi-Task Learning, arXiv, 2311.11501, arxiv, pdf, cication: -1

Yiming Wang, Yu Lin, Xiaodong Zeng, Guannan Zhang
Tied-Lora: Enhacing parameter efficiency of LoRA with weight tying, arXiv, 2311.09578, arxiv, pdf, cication: -1

Adithya Renduchintala, Tugrul Konuk, Oleksii Kuchaiev
SiRA: Sparse Mixture of Low Rank Adaptation, arXiv, 2311.09179, arxiv, pdf, cication: -1

Yun Zhu, Nevan Wichers, Chu-Cheng Lin, Xinyi Wang, Tianlong Chen, Lei Shu, Han Lu, Canoee Liu, Liangchen Luo, Jindong Chen
Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization, arXiv, 2311.06243, arxiv, pdf, cication: -1

Weiyang Liu, Zeju Qiu, Yao Feng, Yuliang Xiu, Yuxuan Xue, Longhui Yu, Haiwen Feng, Zhen Liu, Juyeon Heo, Songyou Peng · (boft.wyliu)
Punica: Multi-Tenant LoRA Serving, arXiv, 2310.18547, arxiv, pdf, cication: -1

Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, Arvind Krishnamurthy · (punica - punica-ai)
S-LoRA: Serving Thousands of Concurrent LoRA Adapters, arXiv, 2311.03285, arxiv, pdf, cication: -1

Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer · (s-lora - s-lora)
LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery, arXiv, 2310.18356, arxiv, pdf, cication: -1

Tianyi Chen, Tianyu Ding, Badal Yadav, Ilya Zharkov, Luming Liang
VeRA: Vector-based Random Matrix Adaptation, arXiv, 2310.11454, arxiv, pdf, cication: -1

Dawid Jan Kopiczko, Tijmen Blankevoort, Yuki Markus Asano · (mp.weixin.qq)
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models, arXiv, 2310.08659, arxiv, pdf, cication: 1

Yixiao Li, Yifan Yu, Chen Liang, Pengcheng He, Nikos Karampatziakis, Weizhu Chen, Tuo Zhao

· (peft - huggingface)
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models, arXiv, 2309.14717, arxiv, pdf, cication: -1

Yuhui Xu, Lingxi Xie, Xiaotao Gu, Xin Chen, Heng Chang, Hengheng Zhang, Zhengsu Chen, Xiaopeng Zhang, Qi Tian · (qa-lora - yuhuixu1993)
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, arXiv, 2309.12307, arxiv, pdf, cication: 5

Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia · (LongLoRA - dvlab-research)
LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition, arXiv, 2307.13269, arxiv, pdf, cication: 6

Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, Min Lin
Stack More Layers Differently: High-Rank Training Through Low-Rank Updates, arXiv, 2307.05695, arxiv, pdf, cication: 2

Vladislav Lialin, Namrata Shivagunde, Sherin Muckatira, Anna Rumshisky · (peft_pretraining - guitaricet)
LLaMA-Efficient-Tuning - hiyouga

Fine-tuning LLaMA with PEFT (PT+SFT+RLHF with QLoRA)
InRank: Incremental Low-Rank Learning, arXiv, 2306.11250, arxiv, pdf, cication: 2

Jiawei Zhao, Yifei Zhang, Beidi Chen, Florian Schäfer, Anima Anandkumar · (inrank - jiaweizzhao)
One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning, arXiv, 2306.07967, arxiv, pdf, cication: -1

Arnav Chavan, Zhuang Liu, Deepak Gupta, Eric Xing, Zhiqiang Shen · (ViT-Slim - Arnav0400)
Full Parameter Fine-tuning for Large Language Models with Limited Resources, arXiv, 2306.09782, arxiv, pdf, cication: -1

Kai Lv, Yuqing Yang, Tengxiao Liu, Qinghui Gao, Qipeng Guo, Xipeng Qiu · (LOMO - OpenLMLab)
PockEngine: Sparse and Efficient Fine-tuning in a Pocket, arXiv, 2310.17752, arxiv, pdf, cication: -1

Ligeng Zhu, Lanxiang Hu, Ji Lin, Wei-Chen Wang, Wei-Ming Chen, Chuang Gan, Song Han
AI and Memory Wall. (This blogpost has been written in… | by Amir Gholami | riselab | Medium

Adapter

Adapters: A Unified Library for Parameter-Efficient and Modular Transfer Learning, arXiv, 2311.11077, arxiv, pdf, cication: -1

Clifton Poth, Hannah Sterz, Indraneil Paul, Sukannya Purkayastha, Leon Engländer, Timo Imhof, Ivan Vulić, Sebastian Ruder, Iryna Gurevych, Jonas Pfeiffer · (adapterhub)
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models, arXiv, 2305.15023, arxiv, pdf, cication: 18

Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, Rongrong Ji · (LaVIN - luogen1996) · (mp.weixin.qq)
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model, arXiv, 2304.15010, arxiv, pdf, cication: 82

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue · (mp.weixin.qq)

Other

Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments - Lightning AI
instruction tuning experiments with LoRA/QLoRA
LoRA及其变体概述：LoRA, DoRA, AdaLoRA, Delta-LoRA

Quantization

Survey

A Performance Evaluation of a Quantized Large Language Model on Various Smartphones, arXiv, 2312.12472, arxiv, pdf, cication: -1

Tolga Çöplü, Marc Loedi, Arto Bendiken, Mykhailo Makohin, Joshua J. Bouw, Stephen Cobb
A Survey on Model Compression for Large Language Models, arXiv, 2308.07633, arxiv, pdf, cication: -1

Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang · (jiqizhixin)

Papers

4-bit Shampoo for Memory-Efficient Network Training, arXiv, 2405.18144, arxiv, pdf, cication: -1

Sike Wang, Jia Li, Pan Zhou, Hua Huang
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving, arXiv, 2405.04532, arxiv, pdf, cication: -1

Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han · (qserve - mit-han-lab)
QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs, arXiv, 2404.00456, arxiv, pdf, cication: -1

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Martin Jaggi, Dan Alistarh, Torsten Hoefler, James Hensman · (QuaRot - spcl)
EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs, arXiv, 2403.02775, arxiv, pdf, cication: -1

Hanlin Tang, Yifu Sun, Decheng Wu, Kai Liu, Jianchen Zhu, Zhanhui Kang
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits, arXiv, 2402.17764, arxiv, pdf, cication: -1

Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu Wei

· (unilm - microsoft) · (unilm - microsoft)
- NousResearch/OLMo-Bitnet-1B · Hugging Face
- 1bitLLM/bitnet_b1_58-3B · Hugging Face
GPTVQ: The Blessing of Dimensionality for LLM Quantization, arXiv, 2402.15319, arxiv, pdf, cication: -1

Mart van Baalen, Andrey Kuzmin, Markus Nagel, Peter Couperus, Cedric Bastoul, Eric Mahurin, Tijmen Blankevoort, Paul Whatmough
OneBit: Towards Extremely Low-bit Large Language Models, arXiv, 2402.11295, arxiv, pdf, cication: -1

Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, Wanxiang Che
Extreme Compression of Large Language Models via Additive Quantization, arXiv, 2401.06118, arxiv, pdf, cication: -1

Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh · (aqlm - vahe1994)
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache, arXiv, 2402.02750, arxiv, pdf, cication: -1

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu · (kivi - jy-yuan)
Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers, arXiv, 2402.08958, arxiv, pdf, cication: -1

Junhan Kim, Kyungphil Park, Chungman Lee, Ho-young Kim, Joonyoung Kim, Yongkweon Jeon
TP-Aware Dequantization, arXiv, 2402.04925, arxiv, pdf, cication: -1

Adnan Hoque, Mudhakar Srivatsa, Chih-Chieh Yang, Raghu Ganti
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs, arXiv, 2402.04291, arxiv, pdf, cication: -1

Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, Xiaojuan Qi
FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design, arXiv, 2401.14112, arxiv, pdf, cication: -1

Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou
ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks, arXiv, 2312.08583, arxiv, pdf, cication: -1

Xiaoxia Wu, Haojun Xia, Stephen Youn, Zhen Zheng, Shiyang Chen, Arash Bakhtiari, Michael Wyatt, Yuxiong He, Olatunji Ruwase, Leon Song
LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning, arXiv, 2311.12023, arxiv, pdf, cication: -1

Han Guo, Philip Greengard, Eric P. Xing, Yoon Kim · (lq-lora - hanguo97)
QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models, arXiv, 2310.09259, arxiv, pdf, cication: -1

Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, Dan Alistarh · (quik - ist-daslab)
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving, arXiv, 2310.19102, arxiv, pdf, cication: -1

Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci
FP8-LM: Training FP8 Large Language Models, arXiv, 2310.18313, arxiv, pdf, cication: -1

Houwen Peng, Kan Wu, Yixuan Wei, Guoshuai Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue Yang, Bolin Ni, Jingcheng Hu
LLM-FP4: 4-Bit Floating-Point Quantized Transformers, arXiv, 2310.16836, arxiv, pdf, cication: -1

Shih-yang Liu, Zechun Liu, Xijie Huang, Pingcheng Dong, Kwang-Ting Cheng
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models, arXiv, 2310.16795, arxiv, pdf, cication: -1

Elias Frantar, Dan Alistarh · (mp.weixin.qq)
BitNet: Scaling 1-bit Transformers for Large Language Models, arXiv, 2310.11453, arxiv, pdf, cication: -1

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, Furu Wei
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving, arXiv, 2310.19102, arxiv, pdf, cication: -1

Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci · (atom - efeslab)
TEQ: Trainable Equivalent Transformation for Quantization of LLMs, arXiv, 2310.10944, arxiv, pdf, cication: 1

Wenhua Cheng, Yiyang Cai, Kaokao Lv, Haihao Shen
Efficient Post-training Quantization with FP8 Formats, arXiv, 2309.14592, arxiv, pdf, cication: -1

Haihao Shen, Naveen Mellempudi, Xin He, Qun Gao, Chang Wang, Mengni Wang · (neural-compressor - intel)
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models, arXiv, 2309.14717, arxiv, pdf, cication: -1

Yuhui Xu, Lingxi Xie, Xiaotao Gu, Xin Chen, Heng Chang, Hengheng Zhang, Zhengsu Chen, Xiaopeng Zhang, Qi Tian
Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs, arXiv, 2309.05516, arxiv, pdf, cication: -1

Wenhua Cheng, Weiwei Zhang, Haihao Shen, Yiyang Cai, Xin He, Kaokao Lv
Memory Efficient Optimizers with 4-bit States, arXiv, 2309.01507, arxiv, pdf, cication: 1

Bingrui Li, Jianfei Chen, Jun Zhu · (jiqizhixin)
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models, arXiv, 2308.13137, arxiv, pdf, cication: 2

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, Ping Luo · (OmniQuant - OpenGVLab)
FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search, arXiv, 2308.03290, arxiv, pdf, cication: -1

Jordan Dotzel, Gang Wu, Andrew Li, Muhammad Umar, Yun Ni, Mohamed S. Abdelfattah, Zhiru Zhang, Liqun Cheng, Martin G. Dixon, Norman P. Jouppi
QuIP: 2-Bit Quantization of Large Language Models With Guarantees, arXiv, 2307.13304, arxiv, pdf, cication: -1

Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, Christopher De Sa · (quip - jerry-chee)
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing, arXiv, 2306.12929, arxiv, pdf, cication: -1

Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort
Training Transformers with 4-bit Integers, arXiv, 2306.11987, arxiv, pdf, cication: -1

Haocheng Xi, Changhao Li, Jianfei Chen, Jun Zhu · (jiqizhixin)
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression, arXiv, 2306.03078, arxiv, pdf, cication: -1

Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan Alistarh · (jiqizhixin)
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration, arXiv, 2306.00978, arxiv, pdf, cication: -1

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, Chuang Gan, Song Han
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers, arXiv, 2210.17323, arxiv, pdf, cication: -1

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh · (gptq - IST-DASLab)
[2208.07339] LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

· (bitsandbytes - timdettmers)

Projects

lmstudio-community (LM Studio Community)
BitMat - astramind-ai

An efficent implementation of the method proposed in "The Era of 1-bit LLMs"
1bitLLM (1bitLLM)
QLLM - wejoncy

日A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.
hqq - mobiusml

Official implementation of Half-Quadratic Quantization (HQQ) · (mobiusml.github)
exllamav2 - turboderp

A fast inference library for running LLMs locally on modern consumer-class GPUs · (mp.weixin.qq)
PB-LLM - hahnyuan

PB-LLM: Partially Binarized Large Language Models
AttentionIsOFFByOne - kyegomez

Implementation of "Attention Is Off By One" by Evan Miller · (evanmiller) · (jiqizhixin)
llama.cpp - ggerganov

Port of Facebook's LLaMA model in C/C++ · (finbarr)
llama2-webui - liltom-eth

Run Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac). Supporting Llama-2-7B/13B/70B with 8-bit, 4-bit. Supporting GPU inference (6 GB VRAM) and CPU inference.
neural-compressor - intel

Provide unified APIs for SOTA model compression techniques, such as low precision (INT8/INT4/FP4/NF4) quantization, sparsity, pruning, and knowledge distillation on mainstream AI frameworks such as TensorFlow, PyTorch, and ONNX Runtime. · (neural-compressor - intel) · (mp.weixin.qq)
exllama - turboderp

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
squeezellm - squeezeailab

SqueezeLLM: Dense-and-Sparse Quantization

Other

Overview of natively supported quantization schemes in 🤗 Transformers
Making LLMs lighter with AutoGPTQ and transformers
TheBloke (Tom Jobbins)
Quantization

Distillation

PLaD: Preference-based Large Language Model Distillation with Pseudo-Preference Pairs, arXiv, 2406.02886, arxiv, pdf, cication: -1

Rongzhi Zhang, Jiaming Shen, Tianqi Liu, Haorui Wang, Zhen Qin, Feng Han, Jialu Liu, Simon Baumgartner, Michael Bendersky, Chao Zhang
Divide-or-Conquer? Which Part Should You Distill Your LLM?, arXiv, 2402.15000, arxiv, pdf, cication: -1

Zhuofeng Wu, He Bai, Aonan Zhang, Jiatao Gu, VG Vinod Vydiswaran, Navdeep Jaitly, Yizhe Zhang
DistiLLM: Towards Streamlined Distillation for Large Language Models, arXiv, 2402.03898, arxiv, pdf, cication: -1

Jongwoo Ko, Sungnyun Kim, Tianyi Chen, Se-Young Yun · (distillm - jongwooko)
Scavenging Hyena: Distilling Transformers into Long Convolution Models, arXiv, 2401.17574, arxiv, pdf, cication: -1

Tokiniaina Raharison Ralambomihanta, Shahrad Mohammadzadeh, Mohammad Sami Nur Islam, Wassim Jabbour, Laurence Liang
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning - ACL Anthology

· (twitter)
Initializing Models with Larger Ones, arXiv, 2311.18823, arxiv, pdf, cication: -1

Zhiqiu Xu, Yanjie Chen, Kirill Vishniakov, Yida Yin, Zhiqiang Shen, Trevor Darrell, Lingjie Liu, Zhuang Liu · (weight-selection - oscarxzq)
Tailoring Self-Rationalizers with Multi-Reward Distillation, arXiv, 2311.02805, arxiv, pdf, cication: -1

Sahana Ramnath, Brihi Joshi, Skyler Hallinan, Ximing Lu, Liunian Harold Li, Aaron Chan, Jack Hessel, Yejin Choi, Xiang Ren
Co-training and Co-distillation for Quality Improvement and Compression of Language Models, arXiv, 2311.02849, arxiv, pdf, cication: -1

Hayeon Lee, Rui Hou, Jongpil Kim, Davis Liang, Hongbo Zhang, Sung Ju Hwang, Alexander Min
TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise, arXiv, 2310.19019, arxiv, pdf, cication: -1

Nan He, Hanyu Lai, Chenyang Zhao, Zirui Cheng, Junting Pan, Ruoyu Qin, Ruofan Lu, Rui Lu, Yunchen Zhang, Gangming Zhao
Farzi Data: Autoregressive Data Distillation, arXiv, 2310.09983, arxiv, pdf, cication: -1

Noveen Sachdeva, Zexue He, Wang-Cheng Kang, Jianmo Ni, Derek Zhiyuan Cheng, Julian McAuley
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes, arXiv, 2305.02301, arxiv, pdf, cication: 48

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, Tomas Pfister
Composable Function-preserving Expansions for Transformer Architectures, arXiv, 2308.06103, arxiv, pdf, cication: 1

Andrea Gesmundo, Kaitlin Maile
UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition, arXiv, 2308.03279, arxiv, pdf, cication: 2

Wenxuan Zhou, Sheng Zhang, Yu Gu, Muhao Chen, Hoifung Poon
Generalized Knowledge Distillation for Auto-regressive Language Models, arXiv, 2306.13649, arxiv, pdf, cication: -1

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, Olivier Bachem
Knowledge Distillation of Large Language Models, arXiv, 2306.08543, arxiv, pdf, cication: -1

Yuxian Gu, Li Dong, Furu Wei, Minlie Huang

Pruning

ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization, arXiv, 2406.05981, arxiv, pdf, cication: -1

Haoran You, Yipin Guo, Yichao Fu, Wei Zhou, Huihong Shi, Xiaofan Zhang, Souvik Kundu, Amir Yazdanbakhsh, Yingyan, Lin · (ShiftAddLLM - GATECH-EIC)
Scalable MatMul-free Language Modeling, arXiv, 2406.02528, arxiv, pdf, cication: -1

Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, Jason K. Eshraghian · (matmulfreellm - ridgerchu)
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models, arXiv, 2405.20541, arxiv, pdf, cication: -1

Zachary Ankner, Cody Blakeney, Kartik Sreenivasan, Max Marion, Matthew L. Leavitt, Mansheej Paul
PruneGPT - nyunAI

· (huggingface)
The Unreasonable Ineffectiveness of the Deeper Layers, arXiv, 2403.17887, arxiv, pdf, cication: -1

Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, Daniel A. Roberts
- selectively pruning up to half the layers of pretrained LLMs, followed by strategic finetuning with quantization and QLoRA, minimally impacts performance on question-answering tasks.
ShortGPT: Layers in Large Language Models are More Redundant Than You Expect, arXiv, 2403.03853, arxiv, pdf, cication: -1

Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, Weipeng Chen
- This study introduces the Block Influence (BI) metric to assess each layer's importance in LLMs and proposes ShortGPT, a pruning approach that removes redundant layers based on BI scores.
Shortened LLaMA: A Simple Depth Pruning for Large Language Models, arXiv, 2402.02834, arxiv, pdf, cication: -1

Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, Hyoung-Kyu Song
SliceGPT: Compress Large Language Models by Deleting Rows and Columns, arXiv, 2401.15024, arxiv, pdf, cication: -1

Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman
Fast Llama 2 on CPUs With Sparse Fine-Tuning and DeepSparse - Neural Magic
The LLM Surgeon, arXiv, 2312.17244, arxiv, pdf, cication: -1

Tycho F. A. van der Ouderaa, Markus Nagel, Mart van Baalen, Yuki M. Asano, Tijmen Blankevoort
Mini-GPTs: Efficient Large Language Models through Contextual Pruning, arXiv, 2312.12682, arxiv, pdf, cication: -1

Tim Valicenti, Justice Vidal, Ritik Patnaik
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning, arXiv, 2310.06694, arxiv, pdf, cication: 2

Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen · (qbitai) · (xiamengzhou.github) · (llm-shearing - princeton-nlp)
wanda - locuslab

A simple and effective LLM pruning approach.
ResiDual: Transformer with Dual Residual Connections, arXiv, 2304.14802, arxiv, pdf, cication: 7

Shufang Xie, Huishuai Zhang, Junliang Guo, Xu Tan, Jiang Bian, Hany Hassan Awadalla, Arul Menezes, Tao Qin, Rui Yan · (ResiDual - microsoft)

Efficient Inference

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention, arXiv, 2407.02490, arxiv, pdf, cication: -1

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin

· (MInference - microsoft) · (aka)
Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters, arXiv, 2406.16758, arxiv, pdf, cication: -1

Euiin Yi, Taehyeon Kim, Hongseok Jeung, Du-Seong Chang, Se-Young Yun · (Multilingual-SpecBench - Kthyeon)
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees, arXiv, 2406.16858, arxiv, pdf, cication: -1

Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang
Optimizing AI Inference at Character.AI
A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression, arXiv, 2406.11430, arxiv, pdf, cication: -1

Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini
PowerInfer-2: Fast Large Language Model Inference on a Smartphone, arXiv, 2406.06282, arxiv, pdf, cication: -1

Zhenliang Xue, Yixin Song, Zeyu Mi, Le Chen, Yubin Xia, Haibo Chen
PowerInfer-2: Fast Large Language Model Inference on a Smartphone | PowerInfer
Talaria: Interactively Optimizing Machine Learning Models for Efficient Inference, proceedings of the chi conference on human factors in computing systems, 2024, arxiv, pdf, cication: 1

Fred Hohman, Chaoqun Wang, Jinmook Lee, Jochen Görtler, Dominik Moritz, Jeffrey P Bigham, Zhile Ren, Cecile Foret, Qi Shan, Xiaoyi Zhang · (machinelearning.apple)
Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters, arXiv, 2406.05955, arxiv, pdf, cication: -1

Yixin Song, Haotong Xie, Zhengyan Zhang, Bo Wen, Li Ma, Zeyu Mi, Haibo Chen · (huggingface)
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable, arXiv, 2405.19888, arxiv, pdf, cication: 1

Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, Lili Qiu
Nearest Neighbor Speculative Decoding for LLM Generation and Attribution, arXiv, 2405.19325, arxiv, pdf, cication: -1

Minghan Li, Xilun Chen, Ari Holtzman, Beidi Chen, Jimmy Lin, Wen-tau Yih, Xi Victoria Lin
Hamel’s Blog - Optimizing latency
Distributed Speculative Inference of Large Language Models, arXiv, 2405.14105, arxiv, pdf, cication: -1

Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Oren Pereg, Moshe Wasserblat, Tomer Galanti, Michal Gordon, David Harel
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention, arXiv, 2405.12981, arxiv, pdf, cication: -1

William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly
PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference, arXiv, 2405.12532, arxiv, pdf, cication: -1

Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, Hai Zhao
Layer-Condensed KV Cache for Efficient Inference of Large Language Models, arXiv, 2405.10637, arxiv, pdf, cication: -1

Haoyi Wu, Kewei Tu · (LCKV - whyNLP)
vidur - microsoft

A large-scale simulation framework for LLM inference
Conv-Basis: A New Paradigm for Efficient Attention Inference and Gradient Computation in Transformers, arXiv, 2405.05219, arxiv, pdf, cication: -1

Jiuxiang Gu, Yingyu Liang, Heshan Liu, Zhenmei Shi, Zhao Song, Junze Yin
ThunderKittens - HazyResearch

Tile primitives for speedy kernels · (hazyresearch.stanford)
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention, arXiv, 2405.04437, arxiv, pdf, cication: -1

Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar
Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing, arXiv, 2404.14618, arxiv, pdf, cication: -1

Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V. S. Lakshmanan, Ahmed Hassan Awadallah
Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge, arXiv, 2405.00263, arxiv, pdf, cication: -1

Bin Xiao, Chunan Shi, Xiaonan Nie, Fan Yang, Xiangwei Deng, Lei Su, Weipeng Chen, Bin Cui
Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting, arXiv, 2404.18911, arxiv, pdf, cication: -1

Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Kai Han, Yunhe Wang · (Kangaroo - Equationliu)
Better & Faster Large Language Models via Multi-token Prediction, arXiv, 2404.19737, arxiv, pdf, cication: -1

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve · (qbitai)
Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding, arXiv, 2404.16710, arxiv, pdf, cication: -1

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman
BASS: Batched Attention-optimized Speculative Sampling, arXiv, 2404.15778, arxiv, pdf, cication: -1

Haifeng Qian, Sujan Kumar Gonugondla, Sungsoo Ha, Mingyue Shang, Sanjay Krishna Gouda, Ramesh Nallapati, Sudipta Sengupta, Xiaofei Ma, Anoop Deoras
XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference, arXiv, 2404.15420, arxiv, pdf, cication: -1

João Monteiro, Étienne Marcotte, Pierre-André Noël, Valentina Zantedeschi, David Vázquez, Nicolas Chapados, Christopher Pal, Perouz Taslakian
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding, arXiv, 2404.11912, arxiv, pdf, cication: -1

Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, Beidi Chen · (TriForce - Infini-AI-Lab)
Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models, arXiv, 2404.09529, arxiv, pdf, cication: -1

Siyan Zhao, Daniel Israel, Guy Van den Broeck, Aditya Grover · (prepacking - siyan-zhao)
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving, arXiv, 2401.09670, arxiv, pdf, cication: -1

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang · (hao-ai-lab.github)
Recurrent Drafter for Fast Speculative Decoding in Large Language Models, arXiv, 2403.09919, arxiv, pdf, cication: -1

Aonan Zhang, Chong Wang, Yi Wang, Xuanyu Zhang, Yunfei Cheng
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference, arXiv, 2403.09636, arxiv, pdf, cication: -1

Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti
CLLMs: Consistency Large Language Models, arXiv, 2403.00835, arxiv, pdf, cication: -1

Siqi Kou, Lanxiang Hu, Zhezhi He, Zhijie Deng, Hao Zhang · (Consistency_LLM - hao-ai-lab) · (hao-ai-lab.github)
Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference, arXiv, 2403.09054, arxiv, pdf, cication: -1

Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya Soloveychik, Purushotham Kamath
Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding, arXiv, 2402.12374, arxiv, pdf, cication: -1

Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen · (Sequoia - Infini-AI-Lab)
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models, arXiv, 2403.06764, arxiv, pdf, cication: -1

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, Baobao Chang

· (FastV - pkunlp-icler)
GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM, arXiv, 2403.05527, arxiv, pdf, cication: -1

Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao · (GEAR - HaoKang-Timmy)
DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving, arXiv, 2403.01876, arxiv, pdf, cication: -1

Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Tarnawski, Ana Klimovic
Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding, arXiv, 2402.16844, arxiv, pdf, cication: -1

Benjamin Bergner, Andrii Skliar, Amelie Royer, Tijmen Blankevoort, Yuki Asano, Babak Ehteshami Bejnordi
Ouroboros: Speculative Decoding with Large Model Enhanced Drafting, arXiv, 2402.13720, arxiv, pdf, cication: -1

Weilin Zhao, Yuxiang Huang, Xu Han, Chaojun Xiao, Zhiyuan Liu, Maosong Sun · (Ouroboros - thunlp)
Speculative Streaming: Fast LLM Inference without Auxiliary Models, arXiv, 2402.11131, arxiv, pdf, cication: -1

Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, Mahyar Najibi
Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding, arXiv, 2402.12374, arxiv, pdf, cication: -1

Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen
Tandem Transformers for Inference Efficient LLMs, arXiv, 2402.08644, arxiv, pdf, cication: -1

Aishwarya P S, Pranav Ajit Nair, Yashas Samaga, Toby Boyd, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models, arXiv, 2402.07033, arxiv, pdf, cication: -1

Keisuke Kamahori, Yile Gu, Kan Zhu, Baris Kasikci · (fiddler - efeslab)
SubGen: Token Generation in Sublinear Time and Memory, arXiv, 2402.06082, arxiv, pdf, cication: -1

Amir Zandieh, Insu Han, Vahab Mirrokni, Amin Karbasi
Hydragen: High-Throughput LLM Inference with Shared Prefixes, arXiv, 2402.05099, arxiv, pdf, cication: -1

Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christopher Ré, Azalia Mirhoseini
flashinfer - flashinfer-ai

FlashInfer: Kernel Library for LLM Serving
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty, arXiv, 2401.15077, arxiv, pdf, cication: -1

Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang
BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models, arXiv, 2401.12522, arxiv, pdf, cication: -1

Feng Lin, Hanling Yi, Hongbin Li, Yifan Yang, Xiaotian Yu, Guangming Lu, Rong Xiao
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads, arXiv, 2401.10774, arxiv, pdf, cication: -1

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao

· (medusa - fasterdecoding)
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference, arXiv, 2401.08671, arxiv, pdf, cication: -1

Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko
Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models, arXiv, 2401.08294, arxiv, pdf, cication: -1

Shuming Shi, Enbo Zhao, Deng Cai, Leyang Cui, Xinting Huang, Huayang Li · (inferflow - inferflow)
PainlessInferenceAcceleration - alipay
Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding, arXiv, 2401.07851, arxiv, pdf, cication: -1

Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui
Efficient LLM inference solution on Intel GPU, arXiv, 2401.05391, arxiv, pdf, cication: -1

Hui Wu, Yi Gan, Feng Yuan, Jing Ma, Wei Zhu, Yutao Xu, Hong Zhu, Yuhua Zhu, Xiaoli Liu, Jinghui Gu
SwiftInfer - hpcaitech

Efficient AI Inference & Serving · (qbitai)
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache, arXiv, 2401.02669, arxiv, pdf, cication: -1

Bin Lin, Tao Peng, Chen Zhang, Minmin Sun, Lanbo Li, Hanyu Zhao, Wencong Xiao, Qi Xu, Xiafei Qiu, Shen Li
nitro - janhq

A fast, lightweight, embeddable inference engine to supercharge your apps with local AI. OpenAI-compatible API
jan - janhq

Jan is an open source alternative to ChatGPT that runs 100% offline on your computer
Fairness in Serving Large Language Models, arXiv, 2401.00588, arxiv, pdf, cication: -1

Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, Ion Stoica · (s-lora - s-lora)
tricksy - austinsilveria

Fast approximate inference on a single GPU with sparsity aware offloading
mixtral-offloading - dvmazur

Run Mixtral-8x7B models in Colab or consumer desktops
LLM in a flash: Efficient Large Language Model Inference with Limited Memory, arXiv, 2312.11514, arxiv, pdf, cication: -1

Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar
Efficiently Programming Large Language Models using SGLang, arXiv, 2312.07104, arxiv, pdf, cication: -1

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez · (sglang?tab=readme-ov-file - sgl-project) · (lmsys)
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU, arXiv, 2312.12456, arxiv, pdf, cication: -1

Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen · (PowerInfer - SJTU-IPADS)
Cascade Speculative Drafting for Even Faster LLM Inference, arXiv, 2312.11462, arxiv, pdf, cication: -1

Ziyi Chen, Xiaocong Yang, Jiacheng Lin, Chenkai Sun, Jie Huang, Kevin Chen-Chuan Chang
LLMLingua - microsoft

To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
vllm - vllm-project

A high-throughput and memory-efficient inference and serving engine for LLMs
SparQ Attention: Bandwidth-Efficient LLM Inference, arXiv, 2312.04985, arxiv, pdf, cication: -1

Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr
Notion – The all-in-one workspace for your notes, tasks, wikis, and databases.

· (yaofu.notion)
Optimum-NVIDIA Unlocking blazingly fast LLM inference in just 1 line of code
PaSS: Parallel Speculative Sampling, arXiv, 2311.13581, arxiv, pdf, cication: -1

Giovanni Monea, Armand Joulin, Edouard Grave
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding | LMSYS Org

· (LookaheadDecoding - hao-ai-lab)
Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models, arXiv, 2311.03687, arxiv, pdf, cication: -1

Longteng Zhang, Xiang Liu, Zeyu Li, Xinglin Pan, Peijie Dong, Ruibo Fan, Rui Guo, Xin Wang, Qiong Luo, Shaohuai Shi

· (jiqizhixin)
FlashDecoding++: Faster Large Language Model Inference on GPUs, arXiv, 2311.01282, arxiv, pdf, cication: -1

Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, Kangdi Chen, Yuhan Dong, Yu Wang
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time, ICML, 2023, arxiv, pdf, cication: 16

Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re
TensorRT-LLM - NVIDIA

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs)
Approximating Two-Layer Feedforward Networks for Efficient Transformers, arXiv, 2310.10837, arxiv, pdf, cication: -1

Róbert Csordás, Kazuki Irie, Jürgen Schmidhuber
deepsparse - neuralmagic

Inference runtime offering GPU-class performance on CPUs and APIs to integrate ML into your application · (huggingface)
attention_sinks - tomaarsen

Extend existing LLMs way beyond the original training length with constant memory usage, and without retraining
Efficient Streaming Language Models with Attention Sinks, arXiv, 2309.17453, arxiv, pdf, cication: 3

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis

· (streaming-llm - mit-han-lab)

· (mp.weixin.qq)
Efficient Memory Management for Large Language Model Serving with PagedAttention, proceedings of the 29th symposium on operating systems principles, 2023, arxiv, pdf, cication: 21

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica · (jiqizhixin)
llama2.mojo - tairov

Inference Llama 2 in one file of pure 🔥 · (qbitai)
fastllm - ztxz16

纯c++的全平台llm加速库，支持python调用，chatglm-6B级模型单卡可达10000+token / s，支持glm, llama, moss基座，手机端流畅运行
flexflow - flexflow

A distributed deep learning framework.
Accelerating LLM Inference with Staged Speculative Decoding, arXiv, 2308.04623, arxiv, pdf, cication: 3

Benjamin Spector, Chris Re
CTranslate2 - OpenNMT

Fast inference engine for Transformer models
Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding, arXiv, 2307.15337, arxiv, pdf, cication: 4

Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, Yu Wang
SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference, arXiv, 2307.02628, arxiv, pdf, cication: -1

Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal, Bin Yu, Ahmed Awadallah, Subhabrata Mukherjee
An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs, arXiv, 2306.16601, arxiv, pdf, cication: -1

Haihao Shen, Hengyu Meng, Bo Dong, Zhe Wang, Ofir Zafrir, Yi Ding, Yu Luo, Hanwen Chang, Qun Gao, Ziheng Wang
NeuralFuse: Learning to Improve the Accuracy of Access-Limited Neural Network Inference in Low-Voltage Regimes, arXiv, 2306.16869, arxiv, pdf, cication: -1

Hao-Lun Sun, Lei Hsiung, Nandhini Chandramoorthy, Pin-Yu Chen, Tsung-Yi Ho
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models, arXiv, 2306.14048, arxiv, pdf, cication: -1

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett · (H2O - FMInference)

· (mp.weixin.qq)
DeepSpeed ZeRO++: A leap in speed for LLM and chat model training with 4X less communication - Microsoft Research

· (zhuanlan.zhihu)
vllm - vllm-project

A high-throughput and memory-efficient inference and serving engine for LLMs · (mp.weixin.qq) · (jiqizhixin)
SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification, arXiv, 2305.09781, arxiv, pdf, cication: -1

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi · (FlexFlow - flexflow) · (mp.weixin.qq)
llama.cpp - ggerganov

Port of Facebook's LLaMA model in C/C++ · (ggml) · (llama.cpp - ggerganov)

Other

Benchmarking Text Generation Inference
Accelerate Mixtral 8x7B with Speculative Decoding and Quantziation on Amazon SageMaker
LLM Inference Provider Leaderboard

· (jiqizhixin)
Accelerating SD Turbo and SDXL Turbo Inference with ONNX Runtime and Olive
Unbelievable! Run 70B LLM Inference on a Single 4GB GPU with This NEW Technique | by Gavin Li | Nov, 2023 | AI Advances

· (mp.weixin.qq)
How to make LLMs go fast
Sparse LLM Inference on CPU
Optimizing your LLM in production
Speculative execution for LLMs is an excellent inference-time optimization.
tvm_mlir_learn - BBuf

compiler learning resources collect. · (mp.weixin.qq)
LLM生成延迟降低50%！DeepSpeed团队发布FastGen：动态SplitFuse技术，提升2.3倍有效吞吐量
不用4个H100！340亿参数Code Llama在Mac可跑，每秒20个token，代码生成最拿手｜Karpathy转赞
研究完llama.cpp，我发现手机跑大模型竟这么简单 | 机器之心

Mobile

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases, arXiv, 2402.14905, arxiv, pdf, cication: 15

Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi

· (MobileLLM - facebookresearch)
HARE: HumAn pRiors, a key to small language model Efficiency, arXiv, 2406.11410, arxiv, pdf, cication: -1

Lingyun Zhang, Bin jin, Gaojian Ge, Lunhui Liu, Xuewen Shen, Mingyong Wu, Houqian Zhang, Yongneng Jiang, Shiqi Chen, Shi Pu
Octopus v2: On-device language model for super agent, arXiv, 2404.01744, arxiv, pdf, cication: -1

Wei Chen, Zhiyuan Li
Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs, arXiv, 2403.20041, arxiv, pdf, cication: -1

Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT, arXiv, 2402.16840, arxiv, pdf, cication: -1

Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan

· (MobiLlama - mbzuai-oryx)
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases, arXiv, 2402.14905, arxiv, pdf, cication: -1

Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi
MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices, arXiv, 2312.16886, arxiv, pdf, cication: -1

Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei · (MobileVLM - Meituan-AutoML)
mlc-llm - mlc-ai

Enable everyone to develop, optimize and deploy AI models natively on everyone's devices. · (jiqizhixin) · (jiqizhixin)

Toolkits

transformer-heads - center-for-humans-and-machines

Toolkit for attaching, training, saving and loading of new heads for transformer models
quanto - huggingface

A pytorch Quantization Toolkit · (huggingface)
fsdp_qlora - AnswerDotAI

Training LLMs with QLoRA + FSDP · (answer) · (mp.weixin.qq)
GPTFast - MDK8888

Accelerate your Hugging Face Transformers 6-7x. Native to Hugging Face and PyTorch.
vllm - vllm-project
lorax - predibase

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
Winners 🏆 | NeurIPS Large Language Model Efficiency Challenge:1 LLM + 1GPU + 1Day
gigaGPT - Cerebras

a small code base for training large models · (cerebras)
EAGLE - SafeAILab

EAGLE: Lossless Acceleration of LLM Decoding by Feature Extrapolation · (sites.google)

· (jiqizhixin)
optimum-nvidia - huggingface
unsloth - unslothai

5X faster 50% less memory LLM finetuning
lit-gpt - Lightning-AI

Hackable implementation of state-of-the-art open-source LLMs based on nanoGPT. Supports flash attention, 4-bit and 8-bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.
gpt-fast - pytorch-labs

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
MS-AMP - Azure

Microsoft Automatic Mixed Precision Library
DeepSpeed - microsoft

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Efficient transformer

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision | Tri Dao

· (flash-attention - Dao-AILab)
MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression, arXiv, 2406.14909, arxiv, pdf, cication: -1

Tianyu Fu, Haofeng Huang, Xuefei Ning, Genghan Zhang, Boju Chen, Tianqi Wu, Hongyi Wang, Zixiao Huang, Shiyao Li, Shengen Yan
Flash Attention (Fast and Memory-Efficient Exact Attention wi
Block Transformer: Global-to-Local Language Modeling for Fast Inference, arXiv, 2406.02657, arxiv, pdf, cication: -1

Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun

· (block-transformer - itsnamgyu)
LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models, arXiv, 2405.18377, arxiv, pdf, cication: -1

Anthony Sarah, Sharath Nittur Sridhar, Maciej Szankin, Sairam Sundaresan
SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization, arXiv, 2405.11582, arxiv, pdf, cication: -1

Jialong Guo, Xinghao Chen, Yehui Tang, Yunhe Wang · (SLAB - xinghaochen)
You Only Cache Once: Decoder-Decoder Architectures for Language Models, arXiv, 2405.05254, arxiv, pdf, cication: -1

Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei · (aka) · (unilm - microsoft)
Is Flash Attention Stable?, arXiv, 2405.02803, arxiv, pdf, cication: -1

Alicia Golden, Samuel Hsia, Fei Sun, Bilge Acun, Basil Hosmer, Yejin Lee, Zachary DeVito, Jeff Johnson, Gu-Yeon Wei, David Brooks
RecurrentGemma: Moving Past Transformers for Efficient Open Language Models, arXiv, 2404.07839, arxiv, pdf, cication: -1

Aleksandar Botev, Soham De, Samuel L Smith, Anushan Fernando, George-Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi · (recurrentgemma - google-deepmind)
8bit HippoAttention: Up to 3X Faster Compared to FlashAttentionV2 | by HippoML Blog | Medium
LASP - OpenNLPLab

Linear Attention Sequence Parallelism (LASP)
Simple linear attention language models balance the recall-throughput tradeoff, arXiv, 2402.18668, arxiv, pdf, cication: -1

Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, Christopher Ré

· (based - hazyresearch)
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models, arXiv, 2402.19427, arxiv, pdf, cication: -1

Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan

· (jiqizhixin)

· (huggingface) · (twitter)
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition, arXiv, 2402.15220, arxiv, pdf, cication: -1

Lu Ye, Ze Tao, Yong Huang, Yang Li
Linear Transformers are Versatile In-Context Learners, arXiv, 2402.14180, arxiv, pdf, cication: -1

Max Vladymyrov, Johannes von Oswald, Mark Sandler, Rong Ge
The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry, arXiv, 2402.04347, arxiv, pdf, cication: -1

Michael Zhang, Kush Bhatia, Hermann Kumbong, Christopher Ré
FireAttention — Serving Open Source Models 4x faster than vLLM by quantizing with ~no tradeoffs | by Fireworks.ai | Jan, 2024 | Medium
flash-linear-attention - sustcsonglin

Fast implementations of causal linear attention for autogressive language modeling (Pytorch)
PanGu-$π$: Enhancing Language Model Architectures via Nonlinearity Compensation, arXiv, 2312.17276, arxiv, pdf, cication: -1

Yunhe Wang, Hanting Chen, Yehui Tang, Tianyu Guo, Kai Han, Ying Nie, Xutao Wang, Hailin Hu, Zheyuan Bai, Yun Wang
Agent Attention: On the Integration of Softmax and Linear Attention, arXiv, 2312.08874, arxiv, pdf, cication: -1

Dongchen Han, Tianzhu Ye, Yizeng Han, Zhuofan Xia, Shiji Song, Gao Huang · (agent-attention - leaplabthu)
Weight subcloning: direct initialization of transformers using larger pretrained ones, arXiv, 2312.09299, arxiv, pdf, cication: -1

Mohammad Samragh, Mehrdad Farajtabar, Sachin Mehta, Raviteja Vemulapalli, Fartash Faghri, Devang Naik, Oncel Tuzel, Mohammad Rastegari
Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models, arXiv, 2312.07046, arxiv, pdf, cication: -1

Arnav Chavan, Nahush Lele, Deepak Gupta
Paving the way to efficient architectures: StripedHyena-7B, open source models offering a glimpse into a world beyond Transformers
Efficient Monotonic Multihead Attention, arXiv, 2312.04515, arxiv, pdf, cication: -1

Xutai Ma, Anna Sun, Siqi Ouyang, Hirofumi Inaguma, Paden Tomasello
Mamba: Linear-Time Sequence Modeling with Selective State Spaces, arXiv, 2312.00752, arxiv, pdf, cication: -1

Albert Gu, Tri Dao · (qbitai)
Simplifying Transformer Blocks, arXiv, 2311.01906, arxiv, pdf, cication: -1

Bobby He, Thomas Hofmann · (jiqizhixin)
Exponentially Faster Language Modelling, arXiv, 2311.10770, arxiv, pdf, cication: -1

Peter Belcak, Roger Wattenhofer
- GitHub - pbelcak/UltraFastBERT: The repository for the code of the UltraFastBERT paper
Simplifying Transformer Blocks, arXiv, 2311.01906, arxiv, pdf, cication: -1

Bobby He, Thomas Hofmann
Alternating Updates for Efficient Transformers, arXiv, 2301.13310, arxiv, pdf, cication: -1

Cenk Baykal, Dylan Cutler, Nishanth Dikkala, Nikhil Ghosh, Rina Panigrahy, Xin Wang
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, NeurIPS, 2022, arxiv, pdf, cication: 278

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré
Fast Transformer Decoding: One Write-Head is All You Need, arXiv, 1911.02150, arxiv, pdf, cication: 61

Noam Shazeer

· (zhuanlan.zhihu)

Hardware

Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU, arXiv, 2403.06504, arxiv, pdf, cication: -1

Changyue Liao, Mo Sun, Zihan Yang, Kaiqi Chen, Binhang Yuan, Fei Wu, Zeke Wang
Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers, arXiv, 2402.04744, arxiv, pdf, cication: -1

Abhimanyu Rajeshkumar Bambhaniya, Amir Yazdanbakhsh, Suvinay Subramanian, Sheng-Chun Kao, Shivani Agrawal, Utku Evci, Tushar Krishna
FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs, arXiv, 2401.03868, arxiv, pdf, cication: -1

Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang

· (jiqizhixin)
大模型最快推理芯片一夜易主：每秒500tokens干翻GPU！谷歌TPU人马打造

Other

How to make LLMs go fast

· (jiqizhixin)
Make LLM Fine-tuning 2x faster with Unsloth and 🤗 TRL
Schedule | NeurIPS Large Language Model Efficiency Challenge:1 LLM + 1GPU + 1Day
Dynamic LoRA loading for better performance and optimized resource usage

Courses

Code LoRA from Scratch - a Lightning Studio by sebastian

EfficientML

EfficientML.ai Lecture, Fall 2023, MIT 6.5940 - YouTube

· (jiqizhixin) · (bilibili) · (dropbox)

Extra Reference

Awesome-Knowledge-Distillation-of-LLMs - Tebmer

This repository collects papers for "A Survey on Knowledge Distillation of Large Language Models". We break down KD into Knowledge Elicitation and Distillation Algorithms, and explore the Skill & Vertical Distillation of LLMs.
Awesome-LLM-Compression - HuangOwen

Awesome LLM compression research papers and tools.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

awesome_efficient_llm.md

awesome_efficient_llm.md

Awesome efficient llm

Survey

Efficient finetuning

Adapter

Other

Quantization

Survey

Papers

Projects

Other

Distillation

Pruning

Efficient Inference

Other

Mobile

Toolkits

Efficient transformer

Hardware

Other

Courses

EfficientML

Extra Reference

Files

awesome_efficient_llm.md

Latest commit

History

awesome_efficient_llm.md

File metadata and controls

Awesome efficient llm

Survey

Efficient finetuning

Adapter

Other

Quantization

Survey

Papers

Projects

Other

Distillation

Pruning

Efficient Inference

Other

Mobile

Toolkits

Efficient transformer

Hardware

Other

Courses

EfficientML

Extra Reference