-
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey,
arXiv, 2403.14608
, arxiv, pdf, cication: 18Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, Sai Qian Zhang
-
A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models,
arXiv, 2405.13019
, arxiv, pdf, cication: -1Mahsa Khoshnoodi, Vinija Jain, Mingye Gao, Malavika Srikanth, Aman Chadha
-
Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models,
arXiv, 2404.14897
, arxiv, pdf, cication: -1Chen Zhang, Zhuorui Liu, Dawei Song
-
A Survey on Efficient Inference for Large Language Models,
arXiv, 2404.14294
, arxiv, pdf, cication: -1Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li
-
A Survey on Knowledge Distillation of Large Language Models,
arXiv, 2402.13116
, arxiv, pdf, cication: -1Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, Tianyi Zhou · (Awesome-Knowledge-Distillation-of-LLMs - Tebmer)
· (jiqizhixin)
-
Efficient Exploration for LLMs,
arXiv, 2402.00396
, arxiv, pdf, cication: -1Vikranth Dwaracherla, Seyed Mohammad Asghari, Botao Hao, Benjamin Van Roy
-
A Comprehensive Survey of Compression Algorithms for Language Models,
arXiv, 2401.15347
, arxiv, pdf, cication: -1Seungcheol Park, Jaehyeon Choi, Sojin Lee, U Kang
-
A Survey of Resource-efficient LLM and Multimodal Foundation Models,
arXiv, 2401.08092
, arxiv, pdf, cication: -1Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang
· (efficient_foundation_model_survey - ubiquitouslearning)
-
Understanding LLMs: A Comprehensive Overview from Training to Inference,
arXiv, 2401.02038
, arxiv, pdf, cication: -1Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong
-
Efficient Large Language Models: A Survey,
arXiv, 2312.03863
, arxiv, pdf, cication: -1Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, Mosharaf Chowdhury · (Efficient-LLMs-Survey - AIoT-MLSys-Lab)
-
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems,
arXiv, 2312.15234
, arxiv, pdf, cication: -1Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia · (mp.weixin.qq)
-
The Efficiency Spectrum of Large Language Models: An Algorithmic Survey,
arXiv, 2312.00678
, arxiv, pdf, cication: -1Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang
-
Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning,
arXiv, 2303.15647
, arxiv, pdf, cication: -1Vladislav Lialin, Vijeta Deshpande, Anna Rumshisky
-
Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models,
arXiv, 2407.01906
, arxiv, pdf, cication: -1Zihan Wang, Deli Chen, Damai Dai, Runxin Xu, Zhuoshu Li, Y. Wu
-
OLoRA: Orthonormal Low-Rank Adaptation of Large Language Models,
arXiv, 2406.01775
, arxiv, pdf, cication: -1Kerim Büyükakyüz
-
VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections,
arXiv, 2405.17991
, arxiv, pdf, cication: -1Roy Miles, Pradyumna Reddy, Ismail Elezi, Jiankang Deng
-
$\textit{Trans-LoRA}$ : towards data-free Transferable Parameter Efficient Finetuning,arXiv, 2405.17258
, arxiv, pdf, cication: -1Runqian Wang, Soumya Ghosh, David Cox, Diego Antognini, Aude Oliva, Rogerio Feris, Leonid Karlinsky
-
Towards Modular LLMs by Building and Reusing a Library of LoRAs,
arXiv, 2405.11157
, arxiv, pdf, cication: -1Oleksiy Ostapenko, Zhan Su, Edoardo Maria Ponti, Laurent Charlin, Nicolas Le Roux, Matheus Pereira, Lucas Caccia, Alessandro Sordoni
-
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning,
arXiv, 2405.12130
, arxiv, pdf, cication: -1Ting Jiang, Shaohan Huang, Shengyue Luo, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang
-
LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report,
arXiv, 2405.00732
, arxiv, pdf, cication: -1Justin Zhao, Timothy Wang, Wael Abid, Geoffrey Angus, Arnav Garg, Jeffery Kinnison, Alex Sherstinsky, Piero Molino, Travis Addair, Devvret Rishi
-
PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models,
arXiv, 2404.02948
, arxiv, pdf, cication: -1Fanxu Meng, Zhaohui Wang, Muhan Zhang · (PiSSA - GraphPKU)
-
ReFT: Representation Finetuning for Language Models,
arXiv, 2404.03592
, arxiv, pdf, cication: -1Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, Christopher Potts · (pyreft - stanfordnlp)
by manipulating a small fraction of model representations it is possible to effectively steer model behavior to achieve better downstream performance at inference time; also proposes LoReFT as a drop-in replacement for PEFTs that is 10-50x more parameter efficient.
-
Model Stock: All we need is just a few fine-tuned models,
arXiv, 2403.19522
, arxiv, pdf, cication: -1Dong-Hwan Jang, Sangdoo Yun, Dongyoon Han
uses just two models for layer-wise weight averaging.
-
DiJiang: Efficient Large Language Models through Compact Kernelization,
arXiv, 2403.19928
, arxiv, pdf, cication: -1Hanting Chen, Zhicheng Liu, Xutao Wang, Yuchuan Tian, Yunhe Wang · (DiJiang - YuchuanTian)
-
LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning,
arXiv, 2403.17919
, arxiv, pdf, cication: -1Rui Pan, Xiang Liu, Shizhe Diao, Renjie Pi, Jipeng Zhang, Chi Han, Tong Zhang
· (jiqizhixin) · (LMFlow - OptimalScale)
randomly freezing middle layers during training based on importance sampling, which is efficient and can outperform both LoRA and and full LLM finetuning by a noticeable margin in terms of model performance.
-
Mixture-of-LoRAs: An Efficient Multitask Tuning for Large Language Models,
arXiv, 2403.03432
, arxiv, pdf, cication: -1Wenfeng Feng, Chuzhan Hao, Yuewei Zhang, Yu Han, Hao Wang
-
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection,
arXiv, 2403.03507
, arxiv, pdf, cication: -1Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, Yuandong Tian · (galore - jiaweizzhao)
· (huggingface)
Gradient Low-Rank Projection (GaLore) is a new training strategy that significantly reduces memory usage by up to 65.5% for optimizer states during the training of LLMs, without sacrificing performance.
· (mp.weixin.qq)
-
LoRA+: Efficient Low Rank Adaptation of Large Models,
arXiv, 2402.12354
, arxiv, pdf, cication: -1Soufiane Hayou, Nikhil Ghosh, Bin Yu
-
DoRA: Weight-Decomposed Low-Rank Adaptation,
arXiv, 2402.09353
, arxiv, pdf, cication: -1Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Min-Hung Chen · (dora - catid)
-
Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models,
arXiv, 2401.00788
, arxiv, pdf, cication: -1Terry Yue Zhuo, Armel Zebaze, Nitchakarn Suppattarachai, Leandro von Werra, Harm de Vries, Qian Liu, Niklas Muennighoff
-
Parameter Efficient Tuning Allows Scalable Personalization of LLMs for Text Entry: A Case Study on Abbreviation Expansion,
arXiv, 2312.14327
, arxiv, pdf, cication: -1Katrin Tomanek, Shanqing Cai, Subhashini Venugopalan
-
Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation)
· (jiqizhixin)
-
MultiLoRA: Democratizing LoRA for Better Multi-Task Learning,
arXiv, 2311.11501
, arxiv, pdf, cication: -1Yiming Wang, Yu Lin, Xiaodong Zeng, Guannan Zhang
-
Tied-Lora: Enhacing parameter efficiency of LoRA with weight tying,
arXiv, 2311.09578
, arxiv, pdf, cication: -1Adithya Renduchintala, Tugrul Konuk, Oleksii Kuchaiev
-
SiRA: Sparse Mixture of Low Rank Adaptation,
arXiv, 2311.09179
, arxiv, pdf, cication: -1Yun Zhu, Nevan Wichers, Chu-Cheng Lin, Xinyi Wang, Tianlong Chen, Lei Shu, Han Lu, Canoee Liu, Liangchen Luo, Jindong Chen
-
Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization,
arXiv, 2311.06243
, arxiv, pdf, cication: -1Weiyang Liu, Zeju Qiu, Yao Feng, Yuliang Xiu, Yuxuan Xue, Longhui Yu, Haiwen Feng, Zhen Liu, Juyeon Heo, Songyou Peng · (boft.wyliu)
-
Punica: Multi-Tenant LoRA Serving,
arXiv, 2310.18547
, arxiv, pdf, cication: -1Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, Arvind Krishnamurthy · (punica - punica-ai)
-
S-LoRA: Serving Thousands of Concurrent LoRA Adapters,
arXiv, 2311.03285
, arxiv, pdf, cication: -1Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer · (s-lora - s-lora)
-
LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery,
arXiv, 2310.18356
, arxiv, pdf, cication: -1Tianyi Chen, Tianyu Ding, Badal Yadav, Ilya Zharkov, Luming Liang
-
VeRA: Vector-based Random Matrix Adaptation,
arXiv, 2310.11454
, arxiv, pdf, cication: -1Dawid Jan Kopiczko, Tijmen Blankevoort, Yuki Markus Asano · (mp.weixin.qq)
-
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models,
arXiv, 2310.08659
, arxiv, pdf, cication: 1Yixiao Li, Yifan Yu, Chen Liang, Pengcheng He, Nikos Karampatziakis, Weizhu Chen, Tuo Zhao
· (peft - huggingface)
-
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models,
arXiv, 2309.14717
, arxiv, pdf, cication: -1Yuhui Xu, Lingxi Xie, Xiaotao Gu, Xin Chen, Heng Chang, Hengheng Zhang, Zhengsu Chen, Xiaopeng Zhang, Qi Tian · (qa-lora - yuhuixu1993)
-
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models,
arXiv, 2309.12307
, arxiv, pdf, cication: 5Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia · (LongLoRA - dvlab-research)
-
LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition,
arXiv, 2307.13269
, arxiv, pdf, cication: 6Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, Min Lin
-
Stack More Layers Differently: High-Rank Training Through Low-Rank Updates,
arXiv, 2307.05695
, arxiv, pdf, cication: 2Vladislav Lialin, Namrata Shivagunde, Sherin Muckatira, Anna Rumshisky · (peft_pretraining - guitaricet)
-
LLaMA-Efficient-Tuning - hiyouga
Fine-tuning LLaMA with PEFT (PT+SFT+RLHF with QLoRA)
-
InRank: Incremental Low-Rank Learning,
arXiv, 2306.11250
, arxiv, pdf, cication: 2Jiawei Zhao, Yifei Zhang, Beidi Chen, Florian Schäfer, Anima Anandkumar · (inrank - jiaweizzhao)
-
One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning,
arXiv, 2306.07967
, arxiv, pdf, cication: -1Arnav Chavan, Zhuang Liu, Deepak Gupta, Eric Xing, Zhiqiang Shen · (ViT-Slim - Arnav0400)
-
Full Parameter Fine-tuning for Large Language Models with Limited Resources,
arXiv, 2306.09782
, arxiv, pdf, cication: -1Kai Lv, Yuqing Yang, Tengxiao Liu, Qinghui Gao, Qipeng Guo, Xipeng Qiu · (LOMO - OpenLMLab)
-
PockEngine: Sparse and Efficient Fine-tuning in a Pocket,
arXiv, 2310.17752
, arxiv, pdf, cication: -1Ligeng Zhu, Lanxiang Hu, Ji Lin, Wei-Chen Wang, Wei-Ming Chen, Chuang Gan, Song Han
-
AI and Memory Wall. (This blogpost has been written in… | by Amir Gholami | riselab | Medium
-
Adapters: A Unified Library for Parameter-Efficient and Modular Transfer Learning,
arXiv, 2311.11077
, arxiv, pdf, cication: -1Clifton Poth, Hannah Sterz, Indraneil Paul, Sukannya Purkayastha, Leon Engländer, Timo Imhof, Ivan Vulić, Sebastian Ruder, Iryna Gurevych, Jonas Pfeiffer · (adapterhub)
-
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models,
arXiv, 2305.15023
, arxiv, pdf, cication: 18Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, Rongrong Ji · (LaVIN - luogen1996)
· (mp.weixin.qq)
-
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model,
arXiv, 2304.15010
, arxiv, pdf, cication: 82Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue · (mp.weixin.qq)
- Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments - Lightning AI
- instruction tuning experiments with LoRA/QLoRA
- LoRA及其变体概述:LoRA, DoRA, AdaLoRA, Delta-LoRA
-
A Performance Evaluation of a Quantized Large Language Model on Various Smartphones,
arXiv, 2312.12472
, arxiv, pdf, cication: -1Tolga Çöplü, Marc Loedi, Arto Bendiken, Mykhailo Makohin, Joshua J. Bouw, Stephen Cobb
-
A Survey on Model Compression for Large Language Models,
arXiv, 2308.07633
, arxiv, pdf, cication: -1Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang · (jiqizhixin)
-
4-bit Shampoo for Memory-Efficient Network Training,
arXiv, 2405.18144
, arxiv, pdf, cication: -1Sike Wang, Jia Li, Pan Zhou, Hua Huang
-
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving,
arXiv, 2405.04532
, arxiv, pdf, cication: -1Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han · (qserve - mit-han-lab)
-
QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs,
arXiv, 2404.00456
, arxiv, pdf, cication: -1Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Martin Jaggi, Dan Alistarh, Torsten Hoefler, James Hensman · (QuaRot - spcl)
-
EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs,
arXiv, 2403.02775
, arxiv, pdf, cication: -1Hanlin Tang, Yifu Sun, Decheng Wu, Kai Liu, Jianchen Zhu, Zhanhui Kang
-
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits,
arXiv, 2402.17764
, arxiv, pdf, cication: -1Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu Wei
-
GPTVQ: The Blessing of Dimensionality for LLM Quantization,
arXiv, 2402.15319
, arxiv, pdf, cication: -1Mart van Baalen, Andrey Kuzmin, Markus Nagel, Peter Couperus, Cedric Bastoul, Eric Mahurin, Tijmen Blankevoort, Paul Whatmough
-
OneBit: Towards Extremely Low-bit Large Language Models,
arXiv, 2402.11295
, arxiv, pdf, cication: -1Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, Wanxiang Che
-
Extreme Compression of Large Language Models via Additive Quantization,
arXiv, 2401.06118
, arxiv, pdf, cication: -1Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh · (aqlm - vahe1994)
-
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache,
arXiv, 2402.02750
, arxiv, pdf, cication: -1Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu · (kivi - jy-yuan)
-
Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers,
arXiv, 2402.08958
, arxiv, pdf, cication: -1Junhan Kim, Kyungphil Park, Chungman Lee, Ho-young Kim, Joonyoung Kim, Yongkweon Jeon
-
TP-Aware Dequantization,
arXiv, 2402.04925
, arxiv, pdf, cication: -1Adnan Hoque, Mudhakar Srivatsa, Chih-Chieh Yang, Raghu Ganti
-
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs,
arXiv, 2402.04291
, arxiv, pdf, cication: -1Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, Xiaojuan Qi
-
FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design,
arXiv, 2401.14112
, arxiv, pdf, cication: -1Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou
-
ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks,
arXiv, 2312.08583
, arxiv, pdf, cication: -1Xiaoxia Wu, Haojun Xia, Stephen Youn, Zhen Zheng, Shiyang Chen, Arash Bakhtiari, Michael Wyatt, Yuxiong He, Olatunji Ruwase, Leon Song
-
LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning,
arXiv, 2311.12023
, arxiv, pdf, cication: -1Han Guo, Philip Greengard, Eric P. Xing, Yoon Kim · (lq-lora - hanguo97)
-
QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models,
arXiv, 2310.09259
, arxiv, pdf, cication: -1Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, Dan Alistarh · (quik - ist-daslab)
-
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving,
arXiv, 2310.19102
, arxiv, pdf, cication: -1Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci
-
FP8-LM: Training FP8 Large Language Models,
arXiv, 2310.18313
, arxiv, pdf, cication: -1Houwen Peng, Kan Wu, Yixuan Wei, Guoshuai Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue Yang, Bolin Ni, Jingcheng Hu
-
LLM-FP4: 4-Bit Floating-Point Quantized Transformers,
arXiv, 2310.16836
, arxiv, pdf, cication: -1Shih-yang Liu, Zechun Liu, Xijie Huang, Pingcheng Dong, Kwang-Ting Cheng
-
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models,
arXiv, 2310.16795
, arxiv, pdf, cication: -1Elias Frantar, Dan Alistarh · (mp.weixin.qq)
-
BitNet: Scaling 1-bit Transformers for Large Language Models,
arXiv, 2310.11453
, arxiv, pdf, cication: -1Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, Furu Wei
-
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving,
arXiv, 2310.19102
, arxiv, pdf, cication: -1Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci · (atom - efeslab)
-
TEQ: Trainable Equivalent Transformation for Quantization of LLMs,
arXiv, 2310.10944
, arxiv, pdf, cication: 1Wenhua Cheng, Yiyang Cai, Kaokao Lv, Haihao Shen
-
Efficient Post-training Quantization with FP8 Formats,
arXiv, 2309.14592
, arxiv, pdf, cication: -1Haihao Shen, Naveen Mellempudi, Xin He, Qun Gao, Chang Wang, Mengni Wang · (neural-compressor - intel)
-
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models,
arXiv, 2309.14717
, arxiv, pdf, cication: -1Yuhui Xu, Lingxi Xie, Xiaotao Gu, Xin Chen, Heng Chang, Hengheng Zhang, Zhengsu Chen, Xiaopeng Zhang, Qi Tian
-
Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs,
arXiv, 2309.05516
, arxiv, pdf, cication: -1Wenhua Cheng, Weiwei Zhang, Haihao Shen, Yiyang Cai, Xin He, Kaokao Lv
-
Memory Efficient Optimizers with 4-bit States,
arXiv, 2309.01507
, arxiv, pdf, cication: 1Bingrui Li, Jianfei Chen, Jun Zhu · (jiqizhixin)
-
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models,
arXiv, 2308.13137
, arxiv, pdf, cication: 2Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, Ping Luo · (OmniQuant - OpenGVLab)
-
FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search,
arXiv, 2308.03290
, arxiv, pdf, cication: -1Jordan Dotzel, Gang Wu, Andrew Li, Muhammad Umar, Yun Ni, Mohamed S. Abdelfattah, Zhiru Zhang, Liqun Cheng, Martin G. Dixon, Norman P. Jouppi
-
QuIP: 2-Bit Quantization of Large Language Models With Guarantees,
arXiv, 2307.13304
, arxiv, pdf, cication: -1Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, Christopher De Sa · (quip - jerry-chee)
-
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing,
arXiv, 2306.12929
, arxiv, pdf, cication: -1Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort
-
Training Transformers with 4-bit Integers,
arXiv, 2306.11987
, arxiv, pdf, cication: -1Haocheng Xi, Changhao Li, Jianfei Chen, Jun Zhu · (jiqizhixin)
-
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression,
arXiv, 2306.03078
, arxiv, pdf, cication: -1Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan Alistarh · (jiqizhixin)
-
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration,
arXiv, 2306.00978
, arxiv, pdf, cication: -1Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, Chuang Gan, Song Han
-
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers,
arXiv, 2210.17323
, arxiv, pdf, cication: -1Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh · (gptq - IST-DASLab)
-
[2208.07339] LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
· (bitsandbytes - timdettmers)
-
BitMat - astramind-ai
An efficent implementation of the method proposed in "The Era of 1-bit LLMs"
-
QLLM - wejoncy
日A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.
-
hqq - mobiusml
Official implementation of Half-Quadratic Quantization (HQQ) · (mobiusml.github)
-
exllamav2 - turboderp
A fast inference library for running LLMs locally on modern consumer-class GPUs · (mp.weixin.qq)
-
PB-LLM - hahnyuan
PB-LLM: Partially Binarized Large Language Models
-
AttentionIsOFFByOne - kyegomez
Implementation of "Attention Is Off By One" by Evan Miller · (evanmiller) · (jiqizhixin)
-
llama.cpp - ggerganov
Port of Facebook's LLaMA model in C/C++ · (finbarr)
-
llama2-webui - liltom-eth
Run Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac). Supporting Llama-2-7B/13B/70B with 8-bit, 4-bit. Supporting GPU inference (6 GB VRAM) and CPU inference.
-
neural-compressor - intel
Provide unified APIs for SOTA model compression techniques, such as low precision (INT8/INT4/FP4/NF4) quantization, sparsity, pruning, and knowledge distillation on mainstream AI frameworks such as TensorFlow, PyTorch, and ONNX Runtime. · (neural-compressor - intel)
· (mp.weixin.qq)
-
exllama - turboderp
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
-
squeezellm - squeezeailab
SqueezeLLM: Dense-and-Sparse Quantization
- Overview of natively supported quantization schemes in 🤗 Transformers
- Making LLMs lighter with AutoGPTQ and transformers
- TheBloke (Tom Jobbins)
- Quantization
-
PLaD: Preference-based Large Language Model Distillation with Pseudo-Preference Pairs,
arXiv, 2406.02886
, arxiv, pdf, cication: -1Rongzhi Zhang, Jiaming Shen, Tianqi Liu, Haorui Wang, Zhen Qin, Feng Han, Jialu Liu, Simon Baumgartner, Michael Bendersky, Chao Zhang
-
Divide-or-Conquer? Which Part Should You Distill Your LLM?,
arXiv, 2402.15000
, arxiv, pdf, cication: -1Zhuofeng Wu, He Bai, Aonan Zhang, Jiatao Gu, VG Vinod Vydiswaran, Navdeep Jaitly, Yizhe Zhang
-
DistiLLM: Towards Streamlined Distillation for Large Language Models,
arXiv, 2402.03898
, arxiv, pdf, cication: -1Jongwoo Ko, Sungnyun Kim, Tianyi Chen, Se-Young Yun · (distillm - jongwooko)
-
Scavenging Hyena: Distilling Transformers into Long Convolution Models,
arXiv, 2401.17574
, arxiv, pdf, cication: -1Tokiniaina Raharison Ralambomihanta, Shahrad Mohammadzadeh, Mohammad Sami Nur Islam, Wassim Jabbour, Laurence Liang
-
· (twitter)
-
Initializing Models with Larger Ones,
arXiv, 2311.18823
, arxiv, pdf, cication: -1Zhiqiu Xu, Yanjie Chen, Kirill Vishniakov, Yida Yin, Zhiqiang Shen, Trevor Darrell, Lingjie Liu, Zhuang Liu · (weight-selection - oscarxzq)
-
Tailoring Self-Rationalizers with Multi-Reward Distillation,
arXiv, 2311.02805
, arxiv, pdf, cication: -1Sahana Ramnath, Brihi Joshi, Skyler Hallinan, Ximing Lu, Liunian Harold Li, Aaron Chan, Jack Hessel, Yejin Choi, Xiang Ren
-
Co-training and Co-distillation for Quality Improvement and Compression of Language Models,
arXiv, 2311.02849
, arxiv, pdf, cication: -1Hayeon Lee, Rui Hou, Jongpil Kim, Davis Liang, Hongbo Zhang, Sung Ju Hwang, Alexander Min
-
TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise,
arXiv, 2310.19019
, arxiv, pdf, cication: -1Nan He, Hanyu Lai, Chenyang Zhao, Zirui Cheng, Junting Pan, Ruoyu Qin, Ruofan Lu, Rui Lu, Yunchen Zhang, Gangming Zhao
-
Farzi Data: Autoregressive Data Distillation,
arXiv, 2310.09983
, arxiv, pdf, cication: -1Noveen Sachdeva, Zexue He, Wang-Cheng Kang, Jianmo Ni, Derek Zhiyuan Cheng, Julian McAuley
-
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes,
arXiv, 2305.02301
, arxiv, pdf, cication: 48Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, Tomas Pfister
-
Composable Function-preserving Expansions for Transformer Architectures,
arXiv, 2308.06103
, arxiv, pdf, cication: 1Andrea Gesmundo, Kaitlin Maile
-
UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition,
arXiv, 2308.03279
, arxiv, pdf, cication: 2Wenxuan Zhou, Sheng Zhang, Yu Gu, Muhao Chen, Hoifung Poon
-
Generalized Knowledge Distillation for Auto-regressive Language Models,
arXiv, 2306.13649
, arxiv, pdf, cication: -1Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, Olivier Bachem
-
Knowledge Distillation of Large Language Models,
arXiv, 2306.08543
, arxiv, pdf, cication: -1Yuxian Gu, Li Dong, Furu Wei, Minlie Huang
-
ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization,
arXiv, 2406.05981
, arxiv, pdf, cication: -1Haoran You, Yipin Guo, Yichao Fu, Wei Zhou, Huihong Shi, Xiaofan Zhang, Souvik Kundu, Amir Yazdanbakhsh, Yingyan, Lin · (ShiftAddLLM - GATECH-EIC)
-
Scalable MatMul-free Language Modeling,
arXiv, 2406.02528
, arxiv, pdf, cication: -1Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, Jason K. Eshraghian · (matmulfreellm - ridgerchu)
-
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models,
arXiv, 2405.20541
, arxiv, pdf, cication: -1Zachary Ankner, Cody Blakeney, Kartik Sreenivasan, Max Marion, Matthew L. Leavitt, Mansheej Paul
-
PruneGPT - nyunAI
· (huggingface)
-
The Unreasonable Ineffectiveness of the Deeper Layers,
arXiv, 2403.17887
, arxiv, pdf, cication: -1Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, Daniel A. Roberts
selectively pruning up to half the layers of pretrained LLMs, followed by strategic finetuning with quantization and QLoRA, minimally impacts performance on question-answering tasks.
-
ShortGPT: Layers in Large Language Models are More Redundant Than You Expect,
arXiv, 2403.03853
, arxiv, pdf, cication: -1Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, Weipeng Chen
- This study introduces the Block Influence (BI) metric to assess each layer's importance in LLMs and proposes ShortGPT, a pruning approach that removes redundant layers based on BI scores.
-
Shortened LLaMA: A Simple Depth Pruning for Large Language Models,
arXiv, 2402.02834
, arxiv, pdf, cication: -1Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, Hyoung-Kyu Song
-
SliceGPT: Compress Large Language Models by Deleting Rows and Columns,
arXiv, 2401.15024
, arxiv, pdf, cication: -1Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman
-
Fast Llama 2 on CPUs With Sparse Fine-Tuning and DeepSparse - Neural Magic
-
The LLM Surgeon,
arXiv, 2312.17244
, arxiv, pdf, cication: -1Tycho F. A. van der Ouderaa, Markus Nagel, Mart van Baalen, Yuki M. Asano, Tijmen Blankevoort
-
Mini-GPTs: Efficient Large Language Models through Contextual Pruning,
arXiv, 2312.12682
, arxiv, pdf, cication: -1Tim Valicenti, Justice Vidal, Ritik Patnaik
-
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning,
arXiv, 2310.06694
, arxiv, pdf, cication: 2Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen · (qbitai) · (xiamengzhou.github) · (llm-shearing - princeton-nlp)
-
wanda - locuslab
A simple and effective LLM pruning approach.
-
ResiDual: Transformer with Dual Residual Connections,
arXiv, 2304.14802
, arxiv, pdf, cication: 7Shufang Xie, Huishuai Zhang, Junliang Guo, Xu Tan, Jiang Bian, Hany Hassan Awadalla, Arul Menezes, Tao Qin, Rui Yan · (ResiDual - microsoft)
-
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention,
arXiv, 2407.02490
, arxiv, pdf, cication: -1Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin
· (MInference - microsoft)
· (aka)
-
Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters,
arXiv, 2406.16758
, arxiv, pdf, cication: -1Euiin Yi, Taehyeon Kim, Hongseok Jeung, Du-Seong Chang, Se-Young Yun · (Multilingual-SpecBench - Kthyeon)
-
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees,
arXiv, 2406.16858
, arxiv, pdf, cication: -1Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang
-
A Simple and Effective
$L_2$ Norm-Based Strategy for KV Cache Compression,arXiv, 2406.11430
, arxiv, pdf, cication: -1Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini
-
PowerInfer-2: Fast Large Language Model Inference on a Smartphone,
arXiv, 2406.06282
, arxiv, pdf, cication: -1Zhenliang Xue, Yixin Song, Zeyu Mi, Le Chen, Yubin Xia, Haibo Chen
-
PowerInfer-2: Fast Large Language Model Inference on a Smartphone | PowerInfer
-
Talaria: Interactively Optimizing Machine Learning Models for Efficient Inference,
proceedings of the chi conference on human factors in computing systems, 2024
, arxiv, pdf, cication: 1Fred Hohman, Chaoqun Wang, Jinmook Lee, Jochen Görtler, Dominik Moritz, Jeffrey P Bigham, Zhile Ren, Cecile Foret, Qi Shan, Xiaoyi Zhang · (machinelearning.apple)
-
Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters,
arXiv, 2406.05955
, arxiv, pdf, cication: -1Yixin Song, Haotong Xie, Zhengyan Zhang, Bo Wen, Li Ma, Zeyu Mi, Haibo Chen · (huggingface)
-
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable,
arXiv, 2405.19888
, arxiv, pdf, cication: 1Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, Lili Qiu
-
Nearest Neighbor Speculative Decoding for LLM Generation and Attribution,
arXiv, 2405.19325
, arxiv, pdf, cication: -1Minghan Li, Xilun Chen, Ari Holtzman, Beidi Chen, Jimmy Lin, Wen-tau Yih, Xi Victoria Lin
-
Distributed Speculative Inference of Large Language Models,
arXiv, 2405.14105
, arxiv, pdf, cication: -1Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Oren Pereg, Moshe Wasserblat, Tomer Galanti, Michal Gordon, David Harel
-
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention,
arXiv, 2405.12981
, arxiv, pdf, cication: -1William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly
-
PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference,
arXiv, 2405.12532
, arxiv, pdf, cication: -1Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, Hai Zhao
-
Layer-Condensed KV Cache for Efficient Inference of Large Language Models,
arXiv, 2405.10637
, arxiv, pdf, cication: -1Haoyi Wu, Kewei Tu · (LCKV - whyNLP)
-
vidur - microsoft
A large-scale simulation framework for LLM inference
-
Conv-Basis: A New Paradigm for Efficient Attention Inference and Gradient Computation in Transformers,
arXiv, 2405.05219
, arxiv, pdf, cication: -1Jiuxiang Gu, Yingyu Liang, Heshan Liu, Zhenmei Shi, Zhao Song, Junze Yin
-
ThunderKittens - HazyResearch
Tile primitives for speedy kernels · (hazyresearch.stanford)
-
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention,
arXiv, 2405.04437
, arxiv, pdf, cication: -1Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar
-
Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing,
arXiv, 2404.14618
, arxiv, pdf, cication: -1Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V. S. Lakshmanan, Ahmed Hassan Awadallah
-
Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge,
arXiv, 2405.00263
, arxiv, pdf, cication: -1Bin Xiao, Chunan Shi, Xiaonan Nie, Fan Yang, Xiangwei Deng, Lei Su, Weipeng Chen, Bin Cui
-
Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting,
arXiv, 2404.18911
, arxiv, pdf, cication: -1Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Kai Han, Yunhe Wang · (Kangaroo - Equationliu)
-
Better & Faster Large Language Models via Multi-token Prediction,
arXiv, 2404.19737
, arxiv, pdf, cication: -1Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve · (qbitai)
-
Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding,
arXiv, 2404.16710
, arxiv, pdf, cication: -1Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman
-
BASS: Batched Attention-optimized Speculative Sampling,
arXiv, 2404.15778
, arxiv, pdf, cication: -1Haifeng Qian, Sujan Kumar Gonugondla, Sungsoo Ha, Mingyue Shang, Sanjay Krishna Gouda, Ramesh Nallapati, Sudipta Sengupta, Xiaofei Ma, Anoop Deoras
-
XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference,
arXiv, 2404.15420
, arxiv, pdf, cication: -1João Monteiro, Étienne Marcotte, Pierre-André Noël, Valentina Zantedeschi, David Vázquez, Nicolas Chapados, Christopher Pal, Perouz Taslakian
-
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding,
arXiv, 2404.11912
, arxiv, pdf, cication: -1Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, Beidi Chen · (TriForce - Infini-AI-Lab)
-
Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models,
arXiv, 2404.09529
, arxiv, pdf, cication: -1Siyan Zhao, Daniel Israel, Guy Van den Broeck, Aditya Grover · (prepacking - siyan-zhao)
-
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving,
arXiv, 2401.09670
, arxiv, pdf, cication: -1Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang · (hao-ai-lab.github)
-
Recurrent Drafter for Fast Speculative Decoding in Large Language Models,
arXiv, 2403.09919
, arxiv, pdf, cication: -1Aonan Zhang, Chong Wang, Yi Wang, Xuanyu Zhang, Yunfei Cheng
-
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference,
arXiv, 2403.09636
, arxiv, pdf, cication: -1Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti
-
CLLMs: Consistency Large Language Models,
arXiv, 2403.00835
, arxiv, pdf, cication: -1Siqi Kou, Lanxiang Hu, Zhezhi He, Zhijie Deng, Hao Zhang · (Consistency_LLM - hao-ai-lab)
· (hao-ai-lab.github)
-
Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference,
arXiv, 2403.09054
, arxiv, pdf, cication: -1Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya Soloveychik, Purushotham Kamath
-
Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding,
arXiv, 2402.12374
, arxiv, pdf, cication: -1Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen · (Sequoia - Infini-AI-Lab)
-
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models,
arXiv, 2403.06764
, arxiv, pdf, cication: -1Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, Baobao Chang
· (FastV - pkunlp-icler)
-
GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM,
arXiv, 2403.05527
, arxiv, pdf, cication: -1Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao · (GEAR - HaoKang-Timmy)
-
DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving,
arXiv, 2403.01876
, arxiv, pdf, cication: -1Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Tarnawski, Ana Klimovic
-
Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding,
arXiv, 2402.16844
, arxiv, pdf, cication: -1Benjamin Bergner, Andrii Skliar, Amelie Royer, Tijmen Blankevoort, Yuki Asano, Babak Ehteshami Bejnordi
-
Ouroboros: Speculative Decoding with Large Model Enhanced Drafting,
arXiv, 2402.13720
, arxiv, pdf, cication: -1Weilin Zhao, Yuxiang Huang, Xu Han, Chaojun Xiao, Zhiyuan Liu, Maosong Sun · (Ouroboros - thunlp)
-
Speculative Streaming: Fast LLM Inference without Auxiliary Models,
arXiv, 2402.11131
, arxiv, pdf, cication: -1Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, Mahyar Najibi
-
Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding,
arXiv, 2402.12374
, arxiv, pdf, cication: -1Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen
-
Tandem Transformers for Inference Efficient LLMs,
arXiv, 2402.08644
, arxiv, pdf, cication: -1Aishwarya P S, Pranav Ajit Nair, Yashas Samaga, Toby Boyd, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli
-
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models,
arXiv, 2402.07033
, arxiv, pdf, cication: -1Keisuke Kamahori, Yile Gu, Kan Zhu, Baris Kasikci · (fiddler - efeslab)
-
SubGen: Token Generation in Sublinear Time and Memory,
arXiv, 2402.06082
, arxiv, pdf, cication: -1Amir Zandieh, Insu Han, Vahab Mirrokni, Amin Karbasi
-
Hydragen: High-Throughput LLM Inference with Shared Prefixes,
arXiv, 2402.05099
, arxiv, pdf, cication: -1Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christopher Ré, Azalia Mirhoseini
-
flashinfer - flashinfer-ai
FlashInfer: Kernel Library for LLM Serving
-
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty,
arXiv, 2401.15077
, arxiv, pdf, cication: -1Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang
-
BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models,
arXiv, 2401.12522
, arxiv, pdf, cication: -1Feng Lin, Hanling Yi, Hongbin Li, Yifan Yang, Xiaotian Yu, Guangming Lu, Rong Xiao
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads,
arXiv, 2401.10774
, arxiv, pdf, cication: -1Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao
· (medusa - fasterdecoding)
-
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference,
arXiv, 2401.08671
, arxiv, pdf, cication: -1Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko
-
Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models,
arXiv, 2401.08294
, arxiv, pdf, cication: -1Shuming Shi, Enbo Zhao, Deng Cai, Leyang Cui, Xinting Huang, Huayang Li · (inferflow - inferflow)
-
PainlessInferenceAcceleration - alipay
-
Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding,
arXiv, 2401.07851
, arxiv, pdf, cication: -1Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui
-
Efficient LLM inference solution on Intel GPU,
arXiv, 2401.05391
, arxiv, pdf, cication: -1Hui Wu, Yi Gan, Feng Yuan, Jing Ma, Wei Zhu, Yutao Xu, Hong Zhu, Yuhua Zhu, Xiaoli Liu, Jinghui Gu
-
SwiftInfer - hpcaitech
Efficient AI Inference & Serving · (qbitai)
-
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache,
arXiv, 2401.02669
, arxiv, pdf, cication: -1Bin Lin, Tao Peng, Chen Zhang, Minmin Sun, Lanbo Li, Hanyu Zhao, Wencong Xiao, Qi Xu, Xiafei Qiu, Shen Li
-
nitro - janhq
A fast, lightweight, embeddable inference engine to supercharge your apps with local AI. OpenAI-compatible API
-
jan - janhq
Jan is an open source alternative to ChatGPT that runs 100% offline on your computer
-
Fairness in Serving Large Language Models,
arXiv, 2401.00588
, arxiv, pdf, cication: -1Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, Ion Stoica · (s-lora - s-lora)
-
tricksy - austinsilveria
Fast approximate inference on a single GPU with sparsity aware offloading
-
mixtral-offloading - dvmazur
Run Mixtral-8x7B models in Colab or consumer desktops
-
LLM in a flash: Efficient Large Language Model Inference with Limited Memory,
arXiv, 2312.11514
, arxiv, pdf, cication: -1Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar
-
Efficiently Programming Large Language Models using SGLang,
arXiv, 2312.07104
, arxiv, pdf, cication: -1Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez · (sglang?tab=readme-ov-file - sgl-project)
· (lmsys)
-
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU,
arXiv, 2312.12456
, arxiv, pdf, cication: -1Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen · (PowerInfer - SJTU-IPADS)
-
Cascade Speculative Drafting for Even Faster LLM Inference,
arXiv, 2312.11462
, arxiv, pdf, cication: -1Ziyi Chen, Xiaocong Yang, Jiacheng Lin, Chenkai Sun, Jie Huang, Kevin Chen-Chuan Chang
-
LLMLingua - microsoft
To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
-
vllm - vllm-project
A high-throughput and memory-efficient inference and serving engine for LLMs
-
SparQ Attention: Bandwidth-Efficient LLM Inference,
arXiv, 2312.04985
, arxiv, pdf, cication: -1Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr
-
Notion – The all-in-one workspace for your notes, tasks, wikis, and databases.
· (yaofu.notion)
-
Optimum-NVIDIA Unlocking blazingly fast LLM inference in just 1 line of code
-
PaSS: Parallel Speculative Sampling,
arXiv, 2311.13581
, arxiv, pdf, cication: -1Giovanni Monea, Armand Joulin, Edouard Grave
-
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding | LMSYS Org
· (LookaheadDecoding - hao-ai-lab)
-
Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models,
arXiv, 2311.03687
, arxiv, pdf, cication: -1Longteng Zhang, Xiang Liu, Zeyu Li, Xinglin Pan, Peijie Dong, Ruibo Fan, Rui Guo, Xin Wang, Qiong Luo, Shaohuai Shi
· (jiqizhixin)
-
FlashDecoding++: Faster Large Language Model Inference on GPUs,
arXiv, 2311.01282
, arxiv, pdf, cication: -1Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, Kangdi Chen, Yuhan Dong, Yu Wang
-
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time,
ICML, 2023
, arxiv, pdf, cication: 16Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re
-
TensorRT-LLM - NVIDIA
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs)
-
Approximating Two-Layer Feedforward Networks for Efficient Transformers,
arXiv, 2310.10837
, arxiv, pdf, cication: -1Róbert Csordás, Kazuki Irie, Jürgen Schmidhuber
-
deepsparse - neuralmagic
Inference runtime offering GPU-class performance on CPUs and APIs to integrate ML into your application · (huggingface)
-
attention_sinks - tomaarsen
Extend existing LLMs way beyond the original training length with constant memory usage, and without retraining
-
Efficient Streaming Language Models with Attention Sinks,
arXiv, 2309.17453
, arxiv, pdf, cication: 3Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis
· (streaming-llm - mit-han-lab)
· (mp.weixin.qq)
-
Efficient Memory Management for Large Language Model Serving with PagedAttention,
proceedings of the 29th symposium on operating systems principles, 2023
, arxiv, pdf, cication: 21Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica · (jiqizhixin)
-
llama2.mojo - tairov
Inference Llama 2 in one file of pure 🔥 · (qbitai)
-
fastllm - ztxz16
纯c++的全平台llm加速库,支持python调用,chatglm-6B级模型单卡可达10000+token / s,支持glm, llama, moss基座,手机端流畅运行
-
flexflow - flexflow
A distributed deep learning framework.
-
Accelerating LLM Inference with Staged Speculative Decoding,
arXiv, 2308.04623
, arxiv, pdf, cication: 3Benjamin Spector, Chris Re
-
CTranslate2 - OpenNMT
Fast inference engine for Transformer models
-
Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding,
arXiv, 2307.15337
, arxiv, pdf, cication: 4Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, Yu Wang
-
SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference,
arXiv, 2307.02628
, arxiv, pdf, cication: -1Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal, Bin Yu, Ahmed Awadallah, Subhabrata Mukherjee
-
An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs,
arXiv, 2306.16601
, arxiv, pdf, cication: -1Haihao Shen, Hengyu Meng, Bo Dong, Zhe Wang, Ofir Zafrir, Yi Ding, Yu Luo, Hanwen Chang, Qun Gao, Ziheng Wang
-
NeuralFuse: Learning to Improve the Accuracy of Access-Limited Neural Network Inference in Low-Voltage Regimes,
arXiv, 2306.16869
, arxiv, pdf, cication: -1Hao-Lun Sun, Lei Hsiung, Nandhini Chandramoorthy, Pin-Yu Chen, Tsung-Yi Ho
-
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models,
arXiv, 2306.14048
, arxiv, pdf, cication: -1Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett · (H2O - FMInference)
· (mp.weixin.qq)
-
· (zhuanlan.zhihu)
-
vllm - vllm-project
A high-throughput and memory-efficient inference and serving engine for LLMs · (mp.weixin.qq) · (jiqizhixin)
-
SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification,
arXiv, 2305.09781
, arxiv, pdf, cication: -1Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi · (FlexFlow - flexflow)
· (mp.weixin.qq)
-
llama.cpp - ggerganov
Port of Facebook's LLaMA model in C/C++ · (ggml) · (llama.cpp - ggerganov)
-
Accelerate Mixtral 8x7B with Speculative Decoding and Quantziation on Amazon SageMaker
-
LLM Inference Provider Leaderboard
· (jiqizhixin)
-
Accelerating SD Turbo and SDXL Turbo Inference with ONNX Runtime and Olive
-
· (mp.weixin.qq)
-
Speculative execution for LLMs is an excellent inference-time optimization.
-
tvm_mlir_learn - BBuf
compiler learning resources collect. · (mp.weixin.qq)
-
不用4个H100!340亿参数Code Llama在Mac可跑,每秒20个token,代码生成最拿手|Karpathy转赞
-
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases,
arXiv, 2402.14905
, arxiv, pdf, cication: 15Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi
· (MobileLLM - facebookresearch)
-
HARE: HumAn pRiors, a key to small language model Efficiency,
arXiv, 2406.11410
, arxiv, pdf, cication: -1Lingyun Zhang, Bin jin, Gaojian Ge, Lunhui Liu, Xuewen Shen, Mingyong Wu, Houqian Zhang, Yongneng Jiang, Shiqi Chen, Shi Pu
-
Octopus v2: On-device language model for super agent,
arXiv, 2404.01744
, arxiv, pdf, cication: -1Wei Chen, Zhiyuan Li
-
Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs,
arXiv, 2403.20041
, arxiv, pdf, cication: -1Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie
-
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT,
arXiv, 2402.16840
, arxiv, pdf, cication: -1Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan
· (MobiLlama - mbzuai-oryx)
-
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases,
arXiv, 2402.14905
, arxiv, pdf, cication: -1Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi
-
MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices,
arXiv, 2312.16886
, arxiv, pdf, cication: -1Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei · (MobileVLM - Meituan-AutoML)
-
mlc-llm - mlc-ai
Enable everyone to develop, optimize and deploy AI models natively on everyone's devices. · (jiqizhixin) · (jiqizhixin)
-
transformer-heads - center-for-humans-and-machines
Toolkit for attaching, training, saving and loading of new heads for transformer models
-
quanto - huggingface
A pytorch Quantization Toolkit · (huggingface)
-
fsdp_qlora - AnswerDotAI
Training LLMs with QLoRA + FSDP · (answer) · (mp.weixin.qq)
-
GPTFast - MDK8888
Accelerate your Hugging Face Transformers 6-7x. Native to Hugging Face and PyTorch.
-
vllm - vllm-project
-
lorax - predibase
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
-
Winners 🏆 | NeurIPS Large Language Model Efficiency Challenge:1 LLM + 1GPU + 1Day
-
gigaGPT - Cerebras
a small code base for training large models · (cerebras)
-
EAGLE - SafeAILab
EAGLE: Lossless Acceleration of LLM Decoding by Feature Extrapolation · (sites.google)
· (jiqizhixin)
-
optimum-nvidia - huggingface
-
unsloth - unslothai
5X faster 50% less memory LLM finetuning
-
lit-gpt - Lightning-AI
Hackable implementation of state-of-the-art open-source LLMs based on nanoGPT. Supports flash attention, 4-bit and 8-bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.
-
gpt-fast - pytorch-labs
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
-
MS-AMP - Azure
Microsoft Automatic Mixed Precision Library
-
DeepSpeed - microsoft
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
-
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision | Tri Dao
· (flash-attention - Dao-AILab)
-
MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression,
arXiv, 2406.14909
, arxiv, pdf, cication: -1Tianyu Fu, Haofeng Huang, Xuefei Ning, Genghan Zhang, Boju Chen, Tianqi Wu, Hongyi Wang, Zixiao Huang, Shiyao Li, Shengen Yan
-
Flash Attention (Fast and Memory-Efficient Exact Attention wi
-
Block Transformer: Global-to-Local Language Modeling for Fast Inference,
arXiv, 2406.02657
, arxiv, pdf, cication: -1Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun
· (block-transformer - itsnamgyu)
-
LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models,
arXiv, 2405.18377
, arxiv, pdf, cication: -1Anthony Sarah, Sharath Nittur Sridhar, Maciej Szankin, Sairam Sundaresan
-
SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization,
arXiv, 2405.11582
, arxiv, pdf, cication: -1Jialong Guo, Xinghao Chen, Yehui Tang, Yunhe Wang · (SLAB - xinghaochen)
-
You Only Cache Once: Decoder-Decoder Architectures for Language Models,
arXiv, 2405.05254
, arxiv, pdf, cication: -1Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei · (aka) · (unilm - microsoft)
-
Is Flash Attention Stable?,
arXiv, 2405.02803
, arxiv, pdf, cication: -1Alicia Golden, Samuel Hsia, Fei Sun, Bilge Acun, Basil Hosmer, Yejin Lee, Zachary DeVito, Jeff Johnson, Gu-Yeon Wei, David Brooks
-
RecurrentGemma: Moving Past Transformers for Efficient Open Language Models,
arXiv, 2404.07839
, arxiv, pdf, cication: -1Aleksandar Botev, Soham De, Samuel L Smith, Anushan Fernando, George-Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi · (recurrentgemma - google-deepmind)
-
8bit HippoAttention: Up to 3X Faster Compared to FlashAttentionV2 | by HippoML Blog | Medium
-
LASP - OpenNLPLab
Linear Attention Sequence Parallelism (LASP)
-
Simple linear attention language models balance the recall-throughput tradeoff,
arXiv, 2402.18668
, arxiv, pdf, cication: -1Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, Christopher Ré
· (based - hazyresearch)
-
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models,
arXiv, 2402.19427
, arxiv, pdf, cication: -1Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan
· (jiqizhixin)
· (huggingface) · (twitter)
-
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition,
arXiv, 2402.15220
, arxiv, pdf, cication: -1Lu Ye, Ze Tao, Yong Huang, Yang Li
-
Linear Transformers are Versatile In-Context Learners,
arXiv, 2402.14180
, arxiv, pdf, cication: -1Max Vladymyrov, Johannes von Oswald, Mark Sandler, Rong Ge
-
The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry,
arXiv, 2402.04347
, arxiv, pdf, cication: -1Michael Zhang, Kush Bhatia, Hermann Kumbong, Christopher Ré
-
flash-linear-attention - sustcsonglin
Fast implementations of causal linear attention for autogressive language modeling (Pytorch)
-
PanGu-$π$: Enhancing Language Model Architectures via Nonlinearity Compensation,
arXiv, 2312.17276
, arxiv, pdf, cication: -1Yunhe Wang, Hanting Chen, Yehui Tang, Tianyu Guo, Kai Han, Ying Nie, Xutao Wang, Hailin Hu, Zheyuan Bai, Yun Wang
-
Agent Attention: On the Integration of Softmax and Linear Attention,
arXiv, 2312.08874
, arxiv, pdf, cication: -1Dongchen Han, Tianzhu Ye, Yizeng Han, Zhuofan Xia, Shiji Song, Gao Huang · (agent-attention - leaplabthu)
-
Weight subcloning: direct initialization of transformers using larger pretrained ones,
arXiv, 2312.09299
, arxiv, pdf, cication: -1Mohammad Samragh, Mehrdad Farajtabar, Sachin Mehta, Raviteja Vemulapalli, Fartash Faghri, Devang Naik, Oncel Tuzel, Mohammad Rastegari
-
Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models,
arXiv, 2312.07046
, arxiv, pdf, cication: -1Arnav Chavan, Nahush Lele, Deepak Gupta
-
Efficient Monotonic Multihead Attention,
arXiv, 2312.04515
, arxiv, pdf, cication: -1Xutai Ma, Anna Sun, Siqi Ouyang, Hirofumi Inaguma, Paden Tomasello
-
Mamba: Linear-Time Sequence Modeling with Selective State Spaces,
arXiv, 2312.00752
, arxiv, pdf, cication: -1Albert Gu, Tri Dao · (qbitai)
-
Simplifying Transformer Blocks,
arXiv, 2311.01906
, arxiv, pdf, cication: -1Bobby He, Thomas Hofmann · (jiqizhixin)
-
Exponentially Faster Language Modelling,
arXiv, 2311.10770
, arxiv, pdf, cication: -1Peter Belcak, Roger Wattenhofer
-
Simplifying Transformer Blocks,
arXiv, 2311.01906
, arxiv, pdf, cication: -1Bobby He, Thomas Hofmann
-
Alternating Updates for Efficient Transformers,
arXiv, 2301.13310
, arxiv, pdf, cication: -1Cenk Baykal, Dylan Cutler, Nishanth Dikkala, Nikhil Ghosh, Rina Panigrahy, Xin Wang
-
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness,
NeurIPS, 2022
, arxiv, pdf, cication: 278Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré
-
Fast Transformer Decoding: One Write-Head is All You Need,
arXiv, 1911.02150
, arxiv, pdf, cication: 61Noam Shazeer
· (zhuanlan.zhihu)
-
Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU,
arXiv, 2403.06504
, arxiv, pdf, cication: -1Changyue Liao, Mo Sun, Zihan Yang, Kaiqi Chen, Binhang Yuan, Fei Wu, Zeke Wang
-
Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers,
arXiv, 2402.04744
, arxiv, pdf, cication: -1Abhimanyu Rajeshkumar Bambhaniya, Amir Yazdanbakhsh, Suvinay Subramanian, Sheng-Chun Kao, Shivani Agrawal, Utku Evci, Tushar Krishna
-
FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs,
arXiv, 2401.03868
, arxiv, pdf, cication: -1Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang
· (jiqizhixin)
-
· (jiqizhixin)
-
Schedule | NeurIPS Large Language Model Efficiency Challenge:1 LLM + 1GPU + 1Day
-
Dynamic LoRA loading for better performance and optimized resource usage
-
Awesome-Knowledge-Distillation-of-LLMs - Tebmer
This repository collects papers for "A Survey on Knowledge Distillation of Large Language Models". We break down KD into Knowledge Elicitation and Distillation Algorithms, and explore the Skill & Vertical Distillation of LLMs.
-
Awesome-LLM-Compression - HuangOwen
Awesome LLM compression research papers and tools.