Skip to content

Latest commit

 

History

History
1077 lines (755 loc) · 108 KB

awesome_efficient_llm.md

File metadata and controls

1077 lines (755 loc) · 108 KB

Awesome efficient llm

Survey

  • Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey, arXiv, 2403.14608, arxiv, pdf, cication: 18

    Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, Sai Qian Zhang

  • A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models, arXiv, 2405.13019, arxiv, pdf, cication: -1

    Mahsa Khoshnoodi, Vinija Jain, Mingye Gao, Malavika Srikanth, Aman Chadha

  • Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models, arXiv, 2404.14897, arxiv, pdf, cication: -1

    Chen Zhang, Zhuorui Liu, Dawei Song

  • A Survey on Efficient Inference for Large Language Models, arXiv, 2404.14294, arxiv, pdf, cication: -1

    Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li

  • A Survey on Knowledge Distillation of Large Language Models, arXiv, 2402.13116, arxiv, pdf, cication: -1

    Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, Tianyi Zhou · (Awesome-Knowledge-Distillation-of-LLMs - Tebmer) Star · (jiqizhixin)

  • Efficient Exploration for LLMs, arXiv, 2402.00396, arxiv, pdf, cication: -1

    Vikranth Dwaracherla, Seyed Mohammad Asghari, Botao Hao, Benjamin Van Roy

  • A Comprehensive Survey of Compression Algorithms for Language Models, arXiv, 2401.15347, arxiv, pdf, cication: -1

    Seungcheol Park, Jaehyeon Choi, Sojin Lee, U Kang

  • A Survey of Resource-efficient LLM and Multimodal Foundation Models, arXiv, 2401.08092, arxiv, pdf, cication: -1

    Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang

    · (efficient_foundation_model_survey - ubiquitouslearning) Star

  • Understanding LLMs: A Comprehensive Overview from Training to Inference, arXiv, 2401.02038, arxiv, pdf, cication: -1

    Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong

  • Efficient Large Language Models: A Survey, arXiv, 2312.03863, arxiv, pdf, cication: -1

    Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, Mosharaf Chowdhury · (Efficient-LLMs-Survey - AIoT-MLSys-Lab) Star

  • Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, arXiv, 2312.15234, arxiv, pdf, cication: -1

    Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia · (mp.weixin.qq)

  • The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, arXiv, 2312.00678, arxiv, pdf, cication: -1

    Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang

  • Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning, arXiv, 2303.15647, arxiv, pdf, cication: -1

    Vladislav Lialin, Vijeta Deshpande, Anna Rumshisky

Efficient finetuning

  • Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models, arXiv, 2407.01906, arxiv, pdf, cication: -1

    Zihan Wang, Deli Chen, Damai Dai, Runxin Xu, Zhuoshu Li, Y. Wu

  • OLoRA: Orthonormal Low-Rank Adaptation of Large Language Models, arXiv, 2406.01775, arxiv, pdf, cication: -1

    Kerim Büyükakyüz

  • VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections, arXiv, 2405.17991, arxiv, pdf, cication: -1

    Roy Miles, Pradyumna Reddy, Ismail Elezi, Jiankang Deng

  • $\textit{Trans-LoRA}$: towards data-free Transferable Parameter Efficient Finetuning, arXiv, 2405.17258, arxiv, pdf, cication: -1

    Runqian Wang, Soumya Ghosh, David Cox, Diego Antognini, Aude Oliva, Rogerio Feris, Leonid Karlinsky

  • Towards Modular LLMs by Building and Reusing a Library of LoRAs, arXiv, 2405.11157, arxiv, pdf, cication: -1

    Oleksiy Ostapenko, Zhan Su, Edoardo Maria Ponti, Laurent Charlin, Nicolas Le Roux, Matheus Pereira, Lucas Caccia, Alessandro Sordoni

  • MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning, arXiv, 2405.12130, arxiv, pdf, cication: -1

    Ting Jiang, Shaohan Huang, Shengyue Luo, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang

  • LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report, arXiv, 2405.00732, arxiv, pdf, cication: -1

    Justin Zhao, Timothy Wang, Wael Abid, Geoffrey Angus, Arnav Garg, Jeffery Kinnison, Alex Sherstinsky, Piero Molino, Travis Addair, Devvret Rishi

  • PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models, arXiv, 2404.02948, arxiv, pdf, cication: -1

    Fanxu Meng, Zhaohui Wang, Muhan Zhang · (PiSSA - GraphPKU) Star

  • ReFT: Representation Finetuning for Language Models, arXiv, 2404.03592, arxiv, pdf, cication: -1

    Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, Christopher Potts · (pyreft - stanfordnlp) Star

    • by manipulating a small fraction of model representations it is possible to effectively steer model behavior to achieve better downstream performance at inference time; also proposes LoReFT as a drop-in replacement for PEFTs that is 10-50x more parameter efficient.
  • Model Stock: All we need is just a few fine-tuned models, arXiv, 2403.19522, arxiv, pdf, cication: -1

    Dong-Hwan Jang, Sangdoo Yun, Dongyoon Han

    • uses just two models for layer-wise weight averaging.
  • DiJiang: Efficient Large Language Models through Compact Kernelization, arXiv, 2403.19928, arxiv, pdf, cication: -1

    Hanting Chen, Zhicheng Liu, Xutao Wang, Yuchuan Tian, Yunhe Wang · (DiJiang - YuchuanTian) Star

  • LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning, arXiv, 2403.17919, arxiv, pdf, cication: -1

    Rui Pan, Xiang Liu, Shizhe Diao, Renjie Pi, Jipeng Zhang, Chi Han, Tong Zhang

    · (jiqizhixin) · (LMFlow - OptimalScale) Star

    • randomly freezing middle layers during training based on importance sampling, which is efficient and can outperform both LoRA and and full LLM finetuning by a noticeable margin in terms of model performance.
  • Mixture-of-LoRAs: An Efficient Multitask Tuning for Large Language Models, arXiv, 2403.03432, arxiv, pdf, cication: -1

    Wenfeng Feng, Chuzhan Hao, Yuewei Zhang, Yu Han, Hao Wang

  • GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection, arXiv, 2403.03507, arxiv, pdf, cication: -1

    Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, Yuandong Tian · (galore - jiaweizzhao) Star · (huggingface)

    • Gradient Low-Rank Projection (GaLore) is a new training strategy that significantly reduces memory usage by up to 65.5% for optimizer states during the training of LLMs, without sacrificing performance.

    · (mp.weixin.qq)

  • LoRA+: Efficient Low Rank Adaptation of Large Models, arXiv, 2402.12354, arxiv, pdf, cication: -1

    Soufiane Hayou, Nikhil Ghosh, Bin Yu

  • DoRA: Weight-Decomposed Low-Rank Adaptation, arXiv, 2402.09353, arxiv, pdf, cication: -1

    Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Min-Hung Chen · (dora - catid) Star

    · (magazine.sebastianraschka)

  • Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models, arXiv, 2401.00788, arxiv, pdf, cication: -1

    Terry Yue Zhuo, Armel Zebaze, Nitchakarn Suppattarachai, Leandro von Werra, Harm de Vries, Qian Liu, Niklas Muennighoff

  • Parameter Efficient Tuning Allows Scalable Personalization of LLMs for Text Entry: A Case Study on Abbreviation Expansion, arXiv, 2312.14327, arxiv, pdf, cication: -1

    Katrin Tomanek, Shanqing Cai, Subhashini Venugopalan

  • Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation)

    · (jiqizhixin)

  • MultiLoRA: Democratizing LoRA for Better Multi-Task Learning, arXiv, 2311.11501, arxiv, pdf, cication: -1

    Yiming Wang, Yu Lin, Xiaodong Zeng, Guannan Zhang

  • Tied-Lora: Enhacing parameter efficiency of LoRA with weight tying, arXiv, 2311.09578, arxiv, pdf, cication: -1

    Adithya Renduchintala, Tugrul Konuk, Oleksii Kuchaiev

  • SiRA: Sparse Mixture of Low Rank Adaptation, arXiv, 2311.09179, arxiv, pdf, cication: -1

    Yun Zhu, Nevan Wichers, Chu-Cheng Lin, Xinyi Wang, Tianlong Chen, Lei Shu, Han Lu, Canoee Liu, Liangchen Luo, Jindong Chen

  • Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization, arXiv, 2311.06243, arxiv, pdf, cication: -1

    Weiyang Liu, Zeju Qiu, Yao Feng, Yuliang Xiu, Yuxuan Xue, Longhui Yu, Haiwen Feng, Zhen Liu, Juyeon Heo, Songyou Peng · (boft.wyliu)

  • Punica: Multi-Tenant LoRA Serving, arXiv, 2310.18547, arxiv, pdf, cication: -1

    Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, Arvind Krishnamurthy · (punica - punica-ai) Star

  • S-LoRA: Serving Thousands of Concurrent LoRA Adapters, arXiv, 2311.03285, arxiv, pdf, cication: -1

    Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer · (s-lora - s-lora) Star

  • LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery, arXiv, 2310.18356, arxiv, pdf, cication: -1

    Tianyi Chen, Tianyu Ding, Badal Yadav, Ilya Zharkov, Luming Liang

  • VeRA: Vector-based Random Matrix Adaptation, arXiv, 2310.11454, arxiv, pdf, cication: -1

    Dawid Jan Kopiczko, Tijmen Blankevoort, Yuki Markus Asano · (mp.weixin.qq)

  • LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models, arXiv, 2310.08659, arxiv, pdf, cication: 1

    Yixiao Li, Yifan Yu, Chen Liang, Pengcheng He, Nikos Karampatziakis, Weizhu Chen, Tuo Zhao

    · (peft - huggingface) Star

  • QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models, arXiv, 2309.14717, arxiv, pdf, cication: -1

    Yuhui Xu, Lingxi Xie, Xiaotao Gu, Xin Chen, Heng Chang, Hengheng Zhang, Zhengsu Chen, Xiaopeng Zhang, Qi Tian · (qa-lora - yuhuixu1993) Star

  • LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, arXiv, 2309.12307, arxiv, pdf, cication: 5

    Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia · (LongLoRA - dvlab-research) Star

  • LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition, arXiv, 2307.13269, arxiv, pdf, cication: 6

    Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, Min Lin

  • Stack More Layers Differently: High-Rank Training Through Low-Rank Updates, arXiv, 2307.05695, arxiv, pdf, cication: 2

    Vladislav Lialin, Namrata Shivagunde, Sherin Muckatira, Anna Rumshisky · (peft_pretraining - guitaricet) Star

  • LLaMA-Efficient-Tuning - hiyouga Star

    Fine-tuning LLaMA with PEFT (PT+SFT+RLHF with QLoRA)

  • InRank: Incremental Low-Rank Learning, arXiv, 2306.11250, arxiv, pdf, cication: 2

    Jiawei Zhao, Yifei Zhang, Beidi Chen, Florian Schäfer, Anima Anandkumar · (inrank - jiaweizzhao) Star

  • One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning, arXiv, 2306.07967, arxiv, pdf, cication: -1

    Arnav Chavan, Zhuang Liu, Deepak Gupta, Eric Xing, Zhiqiang Shen · (ViT-Slim - Arnav0400) Star

  • Full Parameter Fine-tuning for Large Language Models with Limited Resources, arXiv, 2306.09782, arxiv, pdf, cication: -1

    Kai Lv, Yuqing Yang, Tengxiao Liu, Qinghui Gao, Qipeng Guo, Xipeng Qiu · (LOMO - OpenLMLab) Star

  • PockEngine: Sparse and Efficient Fine-tuning in a Pocket, arXiv, 2310.17752, arxiv, pdf, cication: -1

    Ligeng Zhu, Lanxiang Hu, Ji Lin, Wei-Chen Wang, Wei-Ming Chen, Chuang Gan, Song Han

  • AI and Memory Wall. (This blogpost has been written in… | by Amir Gholami | riselab | Medium

Adapter

  • Adapters: A Unified Library for Parameter-Efficient and Modular Transfer Learning, arXiv, 2311.11077, arxiv, pdf, cication: -1

    Clifton Poth, Hannah Sterz, Indraneil Paul, Sukannya Purkayastha, Leon Engländer, Timo Imhof, Ivan Vulić, Sebastian Ruder, Iryna Gurevych, Jonas Pfeiffer · (adapterhub)

  • Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models, arXiv, 2305.15023, arxiv, pdf, cication: 18

    Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, Rongrong Ji · (LaVIN - luogen1996) Star · (mp.weixin.qq)

  • LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model, arXiv, 2304.15010, arxiv, pdf, cication: 82

    Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue · (mp.weixin.qq)

Other

Quantization

Survey

  • A Performance Evaluation of a Quantized Large Language Model on Various Smartphones, arXiv, 2312.12472, arxiv, pdf, cication: -1

    Tolga Çöplü, Marc Loedi, Arto Bendiken, Mykhailo Makohin, Joshua J. Bouw, Stephen Cobb

  • A Survey on Model Compression for Large Language Models, arXiv, 2308.07633, arxiv, pdf, cication: -1

    Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang · (jiqizhixin)

Papers

  • 4-bit Shampoo for Memory-Efficient Network Training, arXiv, 2405.18144, arxiv, pdf, cication: -1

    Sike Wang, Jia Li, Pan Zhou, Hua Huang

  • QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving, arXiv, 2405.04532, arxiv, pdf, cication: -1

    Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han · (qserve - mit-han-lab) Star

  • QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs, arXiv, 2404.00456, arxiv, pdf, cication: -1

    Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Martin Jaggi, Dan Alistarh, Torsten Hoefler, James Hensman · (QuaRot - spcl) Star

  • EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs, arXiv, 2403.02775, arxiv, pdf, cication: -1

    Hanlin Tang, Yifu Sun, Decheng Wu, Kai Liu, Jianchen Zhu, Zhanhui Kang

  • The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits, arXiv, 2402.17764, arxiv, pdf, cication: -1

    Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu Wei

    · (unilm - microsoft) Star · (unilm - microsoft) Star

  • GPTVQ: The Blessing of Dimensionality for LLM Quantization, arXiv, 2402.15319, arxiv, pdf, cication: -1

    Mart van Baalen, Andrey Kuzmin, Markus Nagel, Peter Couperus, Cedric Bastoul, Eric Mahurin, Tijmen Blankevoort, Paul Whatmough

  • OneBit: Towards Extremely Low-bit Large Language Models, arXiv, 2402.11295, arxiv, pdf, cication: -1

    Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, Wanxiang Che

  • Extreme Compression of Large Language Models via Additive Quantization, arXiv, 2401.06118, arxiv, pdf, cication: -1

    Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh · (aqlm - vahe1994) Star

  • KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache, arXiv, 2402.02750, arxiv, pdf, cication: -1

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu · (kivi - jy-yuan) Star

  • Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers, arXiv, 2402.08958, arxiv, pdf, cication: -1

    Junhan Kim, Kyungphil Park, Chungman Lee, Ho-young Kim, Joonyoung Kim, Yongkweon Jeon

  • TP-Aware Dequantization, arXiv, 2402.04925, arxiv, pdf, cication: -1

    Adnan Hoque, Mudhakar Srivatsa, Chih-Chieh Yang, Raghu Ganti

  • BiLLM: Pushing the Limit of Post-Training Quantization for LLMs, arXiv, 2402.04291, arxiv, pdf, cication: -1

    Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, Xiaojuan Qi

  • FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design, arXiv, 2401.14112, arxiv, pdf, cication: -1

    Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou

  • ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks, arXiv, 2312.08583, arxiv, pdf, cication: -1

    Xiaoxia Wu, Haojun Xia, Stephen Youn, Zhen Zheng, Shiyang Chen, Arash Bakhtiari, Michael Wyatt, Yuxiong He, Olatunji Ruwase, Leon Song

  • LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning, arXiv, 2311.12023, arxiv, pdf, cication: -1

    Han Guo, Philip Greengard, Eric P. Xing, Yoon Kim · (lq-lora - hanguo97) Star

  • QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models, arXiv, 2310.09259, arxiv, pdf, cication: -1

    Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, Dan Alistarh · (quik - ist-daslab) Star

  • Atom: Low-bit Quantization for Efficient and Accurate LLM Serving, arXiv, 2310.19102, arxiv, pdf, cication: -1

    Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci

  • FP8-LM: Training FP8 Large Language Models, arXiv, 2310.18313, arxiv, pdf, cication: -1

    Houwen Peng, Kan Wu, Yixuan Wei, Guoshuai Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue Yang, Bolin Ni, Jingcheng Hu

  • LLM-FP4: 4-Bit Floating-Point Quantized Transformers, arXiv, 2310.16836, arxiv, pdf, cication: -1

    Shih-yang Liu, Zechun Liu, Xijie Huang, Pingcheng Dong, Kwang-Ting Cheng

  • QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models, arXiv, 2310.16795, arxiv, pdf, cication: -1

    Elias Frantar, Dan Alistarh · (mp.weixin.qq)

  • BitNet: Scaling 1-bit Transformers for Large Language Models, arXiv, 2310.11453, arxiv, pdf, cication: -1

    Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, Furu Wei

  • Atom: Low-bit Quantization for Efficient and Accurate LLM Serving, arXiv, 2310.19102, arxiv, pdf, cication: -1

    Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci · (atom - efeslab) Star

  • TEQ: Trainable Equivalent Transformation for Quantization of LLMs, arXiv, 2310.10944, arxiv, pdf, cication: 1

    Wenhua Cheng, Yiyang Cai, Kaokao Lv, Haihao Shen

  • Efficient Post-training Quantization with FP8 Formats, arXiv, 2309.14592, arxiv, pdf, cication: -1

    Haihao Shen, Naveen Mellempudi, Xin He, Qun Gao, Chang Wang, Mengni Wang · (neural-compressor - intel) Star

  • QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models, arXiv, 2309.14717, arxiv, pdf, cication: -1

    Yuhui Xu, Lingxi Xie, Xiaotao Gu, Xin Chen, Heng Chang, Hengheng Zhang, Zhengsu Chen, Xiaopeng Zhang, Qi Tian

  • Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs, arXiv, 2309.05516, arxiv, pdf, cication: -1

    Wenhua Cheng, Weiwei Zhang, Haihao Shen, Yiyang Cai, Xin He, Kaokao Lv

  • Memory Efficient Optimizers with 4-bit States, arXiv, 2309.01507, arxiv, pdf, cication: 1

    Bingrui Li, Jianfei Chen, Jun Zhu · (jiqizhixin)

  • OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models, arXiv, 2308.13137, arxiv, pdf, cication: 2

    Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, Ping Luo · (OmniQuant - OpenGVLab) Star

  • FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search, arXiv, 2308.03290, arxiv, pdf, cication: -1

    Jordan Dotzel, Gang Wu, Andrew Li, Muhammad Umar, Yun Ni, Mohamed S. Abdelfattah, Zhiru Zhang, Liqun Cheng, Martin G. Dixon, Norman P. Jouppi

  • QuIP: 2-Bit Quantization of Large Language Models With Guarantees, arXiv, 2307.13304, arxiv, pdf, cication: -1

    Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, Christopher De Sa · (quip - jerry-chee) Star

  • Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing, arXiv, 2306.12929, arxiv, pdf, cication: -1

    Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort

  • Training Transformers with 4-bit Integers, arXiv, 2306.11987, arxiv, pdf, cication: -1

    Haocheng Xi, Changhao Li, Jianfei Chen, Jun Zhu · (jiqizhixin)

  • SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression, arXiv, 2306.03078, arxiv, pdf, cication: -1

    Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan Alistarh · (jiqizhixin)

  • AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration, arXiv, 2306.00978, arxiv, pdf, cication: -1

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, Chuang Gan, Song Han

  • GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers, arXiv, 2210.17323, arxiv, pdf, cication: -1

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh · (gptq - IST-DASLab) Star

  • [2208.07339] LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    · (bitsandbytes - timdettmers) Star

Projects

  • lmstudio-community (LM Studio Community)

  • BitMat - astramind-ai Star

    An efficent implementation of the method proposed in "The Era of 1-bit LLMs"

  • 1bitLLM (1bitLLM)

  • QLLM - wejoncy Star

    A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.

  • hqq - mobiusml Star

    Official implementation of Half-Quadratic Quantization (HQQ) · (mobiusml.github)

  • exllamav2 - turboderp Star

    A fast inference library for running LLMs locally on modern consumer-class GPUs · (mp.weixin.qq)

  • PB-LLM - hahnyuan Star

    PB-LLM: Partially Binarized Large Language Models

  • AttentionIsOFFByOne - kyegomez Star

    Implementation of "Attention Is Off By One" by Evan Miller · (evanmiller) · (jiqizhixin)

  • llama.cpp - ggerganov Star

    Port of Facebook's LLaMA model in C/C++ · (finbarr)

  • llama2-webui - liltom-eth Star

    Run Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac). Supporting Llama-2-7B/13B/70B with 8-bit, 4-bit. Supporting GPU inference (6 GB VRAM) and CPU inference.

  • neural-compressor - intel Star

    Provide unified APIs for SOTA model compression techniques, such as low precision (INT8/INT4/FP4/NF4) quantization, sparsity, pruning, and knowledge distillation on mainstream AI frameworks such as TensorFlow, PyTorch, and ONNX Runtime. · (neural-compressor - intel) Star · (mp.weixin.qq)

  • exllama - turboderp Star

    A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.

  • squeezellm - squeezeailab Star

    SqueezeLLM: Dense-and-Sparse Quantization

Other

Distillation

  • PLaD: Preference-based Large Language Model Distillation with Pseudo-Preference Pairs, arXiv, 2406.02886, arxiv, pdf, cication: -1

    Rongzhi Zhang, Jiaming Shen, Tianqi Liu, Haorui Wang, Zhen Qin, Feng Han, Jialu Liu, Simon Baumgartner, Michael Bendersky, Chao Zhang

  • Divide-or-Conquer? Which Part Should You Distill Your LLM?, arXiv, 2402.15000, arxiv, pdf, cication: -1

    Zhuofeng Wu, He Bai, Aonan Zhang, Jiatao Gu, VG Vinod Vydiswaran, Navdeep Jaitly, Yizhe Zhang

  • DistiLLM: Towards Streamlined Distillation for Large Language Models, arXiv, 2402.03898, arxiv, pdf, cication: -1

    Jongwoo Ko, Sungnyun Kim, Tianyi Chen, Se-Young Yun · (distillm - jongwooko) Star

  • Scavenging Hyena: Distilling Transformers into Long Convolution Models, arXiv, 2401.17574, arxiv, pdf, cication: -1

    Tokiniaina Raharison Ralambomihanta, Shahrad Mohammadzadeh, Mohammad Sami Nur Islam, Wassim Jabbour, Laurence Liang

  • Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning - ACL Anthology

    · (twitter)

  • Initializing Models with Larger Ones, arXiv, 2311.18823, arxiv, pdf, cication: -1

    Zhiqiu Xu, Yanjie Chen, Kirill Vishniakov, Yida Yin, Zhiqiang Shen, Trevor Darrell, Lingjie Liu, Zhuang Liu · (weight-selection - oscarxzq) Star

  • Tailoring Self-Rationalizers with Multi-Reward Distillation, arXiv, 2311.02805, arxiv, pdf, cication: -1

    Sahana Ramnath, Brihi Joshi, Skyler Hallinan, Ximing Lu, Liunian Harold Li, Aaron Chan, Jack Hessel, Yejin Choi, Xiang Ren

  • Co-training and Co-distillation for Quality Improvement and Compression of Language Models, arXiv, 2311.02849, arxiv, pdf, cication: -1

    Hayeon Lee, Rui Hou, Jongpil Kim, Davis Liang, Hongbo Zhang, Sung Ju Hwang, Alexander Min

  • TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise, arXiv, 2310.19019, arxiv, pdf, cication: -1

    Nan He, Hanyu Lai, Chenyang Zhao, Zirui Cheng, Junting Pan, Ruoyu Qin, Ruofan Lu, Rui Lu, Yunchen Zhang, Gangming Zhao

  • Farzi Data: Autoregressive Data Distillation, arXiv, 2310.09983, arxiv, pdf, cication: -1

    Noveen Sachdeva, Zexue He, Wang-Cheng Kang, Jianmo Ni, Derek Zhiyuan Cheng, Julian McAuley

  • Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes, arXiv, 2305.02301, arxiv, pdf, cication: 48

    Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, Tomas Pfister

  • Composable Function-preserving Expansions for Transformer Architectures, arXiv, 2308.06103, arxiv, pdf, cication: 1

    Andrea Gesmundo, Kaitlin Maile

  • UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition, arXiv, 2308.03279, arxiv, pdf, cication: 2

    Wenxuan Zhou, Sheng Zhang, Yu Gu, Muhao Chen, Hoifung Poon

  • Generalized Knowledge Distillation for Auto-regressive Language Models, arXiv, 2306.13649, arxiv, pdf, cication: -1

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, Olivier Bachem

  • Knowledge Distillation of Large Language Models, arXiv, 2306.08543, arxiv, pdf, cication: -1

    Yuxian Gu, Li Dong, Furu Wei, Minlie Huang

Pruning

  • ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization, arXiv, 2406.05981, arxiv, pdf, cication: -1

    Haoran You, Yipin Guo, Yichao Fu, Wei Zhou, Huihong Shi, Xiaofan Zhang, Souvik Kundu, Amir Yazdanbakhsh, Yingyan, Lin · (ShiftAddLLM - GATECH-EIC) Star

  • Scalable MatMul-free Language Modeling, arXiv, 2406.02528, arxiv, pdf, cication: -1

    Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, Jason K. Eshraghian · (matmulfreellm - ridgerchu) Star

  • Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models, arXiv, 2405.20541, arxiv, pdf, cication: -1

    Zachary Ankner, Cody Blakeney, Kartik Sreenivasan, Max Marion, Matthew L. Leavitt, Mansheej Paul

  • PruneGPT - nyunAI Star

    · (huggingface)

  • The Unreasonable Ineffectiveness of the Deeper Layers, arXiv, 2403.17887, arxiv, pdf, cication: -1

    Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, Daniel A. Roberts

    • selectively pruning up to half the layers of pretrained LLMs, followed by strategic finetuning with quantization and QLoRA, minimally impacts performance on question-answering tasks.
  • ShortGPT: Layers in Large Language Models are More Redundant Than You Expect, arXiv, 2403.03853, arxiv, pdf, cication: -1

    Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, Weipeng Chen

    • This study introduces the Block Influence (BI) metric to assess each layer's importance in LLMs and proposes ShortGPT, a pruning approach that removes redundant layers based on BI scores.
  • Shortened LLaMA: A Simple Depth Pruning for Large Language Models, arXiv, 2402.02834, arxiv, pdf, cication: -1

    Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, Hyoung-Kyu Song

  • SliceGPT: Compress Large Language Models by Deleting Rows and Columns, arXiv, 2401.15024, arxiv, pdf, cication: -1

    Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman

  • Fast Llama 2 on CPUs With Sparse Fine-Tuning and DeepSparse - Neural Magic

  • The LLM Surgeon, arXiv, 2312.17244, arxiv, pdf, cication: -1

    Tycho F. A. van der Ouderaa, Markus Nagel, Mart van Baalen, Yuki M. Asano, Tijmen Blankevoort

  • Mini-GPTs: Efficient Large Language Models through Contextual Pruning, arXiv, 2312.12682, arxiv, pdf, cication: -1

    Tim Valicenti, Justice Vidal, Ritik Patnaik

  • Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning, arXiv, 2310.06694, arxiv, pdf, cication: 2

    Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen · (qbitai) · (xiamengzhou.github) · (llm-shearing - princeton-nlp) Star

  • wanda - locuslab Star

    A simple and effective LLM pruning approach.

  • ResiDual: Transformer with Dual Residual Connections, arXiv, 2304.14802, arxiv, pdf, cication: 7

    Shufang Xie, Huishuai Zhang, Junliang Guo, Xu Tan, Jiang Bian, Hany Hassan Awadalla, Arul Menezes, Tao Qin, Rui Yan · (ResiDual - microsoft) Star

Efficient Inference

  • MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention, arXiv, 2407.02490, arxiv, pdf, cication: -1

    Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin

    · (MInference - microsoft) Star · (aka)

  • Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters, arXiv, 2406.16758, arxiv, pdf, cication: -1

    Euiin Yi, Taehyeon Kim, Hongseok Jeung, Du-Seong Chang, Se-Young Yun · (Multilingual-SpecBench - Kthyeon) Star

  • EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees, arXiv, 2406.16858, arxiv, pdf, cication: -1

    Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang

  • Optimizing AI Inference at Character.AI

  • A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression, arXiv, 2406.11430, arxiv, pdf, cication: -1

    Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini

  • PowerInfer-2: Fast Large Language Model Inference on a Smartphone, arXiv, 2406.06282, arxiv, pdf, cication: -1

    Zhenliang Xue, Yixin Song, Zeyu Mi, Le Chen, Yubin Xia, Haibo Chen

  • PowerInfer-2: Fast Large Language Model Inference on a Smartphone | PowerInfer

  • Talaria: Interactively Optimizing Machine Learning Models for Efficient Inference, proceedings of the chi conference on human factors in computing systems, 2024, arxiv, pdf, cication: 1

    Fred Hohman, Chaoqun Wang, Jinmook Lee, Jochen Görtler, Dominik Moritz, Jeffrey P Bigham, Zhile Ren, Cecile Foret, Qi Shan, Xiaoyi Zhang · (machinelearning.apple)

  • Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters, arXiv, 2406.05955, arxiv, pdf, cication: -1

    Yixin Song, Haotong Xie, Zhengyan Zhang, Bo Wen, Li Ma, Zeyu Mi, Haibo Chen · (huggingface)

  • Parrot: Efficient Serving of LLM-based Applications with Semantic Variable, arXiv, 2405.19888, arxiv, pdf, cication: 1

    Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, Lili Qiu

  • Nearest Neighbor Speculative Decoding for LLM Generation and Attribution, arXiv, 2405.19325, arxiv, pdf, cication: -1

    Minghan Li, Xilun Chen, Ari Holtzman, Beidi Chen, Jimmy Lin, Wen-tau Yih, Xi Victoria Lin

  • Hamel’s Blog - Optimizing latency

  • Distributed Speculative Inference of Large Language Models, arXiv, 2405.14105, arxiv, pdf, cication: -1

    Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Oren Pereg, Moshe Wasserblat, Tomer Galanti, Michal Gordon, David Harel

  • Reducing Transformer Key-Value Cache Size with Cross-Layer Attention, arXiv, 2405.12981, arxiv, pdf, cication: -1

    William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly

  • PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference, arXiv, 2405.12532, arxiv, pdf, cication: -1

    Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, Hai Zhao

  • Layer-Condensed KV Cache for Efficient Inference of Large Language Models, arXiv, 2405.10637, arxiv, pdf, cication: -1

    Haoyi Wu, Kewei Tu · (LCKV - whyNLP) Star

  • vidur - microsoft Star

    A large-scale simulation framework for LLM inference

  • Conv-Basis: A New Paradigm for Efficient Attention Inference and Gradient Computation in Transformers, arXiv, 2405.05219, arxiv, pdf, cication: -1

    Jiuxiang Gu, Yingyu Liang, Heshan Liu, Zhenmei Shi, Zhao Song, Junze Yin

  • ThunderKittens - HazyResearch Star

    Tile primitives for speedy kernels · (hazyresearch.stanford)

  • vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention, arXiv, 2405.04437, arxiv, pdf, cication: -1

    Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar

  • Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing, arXiv, 2404.14618, arxiv, pdf, cication: -1

    Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V. S. Lakshmanan, Ahmed Hassan Awadallah

  • Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge, arXiv, 2405.00263, arxiv, pdf, cication: -1

    Bin Xiao, Chunan Shi, Xiaonan Nie, Fan Yang, Xiangwei Deng, Lei Su, Weipeng Chen, Bin Cui

  • Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting, arXiv, 2404.18911, arxiv, pdf, cication: -1

    Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Kai Han, Yunhe Wang · (Kangaroo - Equationliu) Star

  • Better & Faster Large Language Models via Multi-token Prediction, arXiv, 2404.19737, arxiv, pdf, cication: -1

    Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve · (qbitai)

  • Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding, arXiv, 2404.16710, arxiv, pdf, cication: -1

    Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman

  • BASS: Batched Attention-optimized Speculative Sampling, arXiv, 2404.15778, arxiv, pdf, cication: -1

    Haifeng Qian, Sujan Kumar Gonugondla, Sungsoo Ha, Mingyue Shang, Sanjay Krishna Gouda, Ramesh Nallapati, Sudipta Sengupta, Xiaofei Ma, Anoop Deoras

  • XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference, arXiv, 2404.15420, arxiv, pdf, cication: -1

    João Monteiro, Étienne Marcotte, Pierre-André Noël, Valentina Zantedeschi, David Vázquez, Nicolas Chapados, Christopher Pal, Perouz Taslakian

  • TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding, arXiv, 2404.11912, arxiv, pdf, cication: -1

    Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, Beidi Chen · (TriForce - Infini-AI-Lab) Star

  • Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models, arXiv, 2404.09529, arxiv, pdf, cication: -1

    Siyan Zhao, Daniel Israel, Guy Van den Broeck, Aditya Grover · (prepacking - siyan-zhao) Star

  • DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving, arXiv, 2401.09670, arxiv, pdf, cication: -1

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang · (hao-ai-lab.github)

  • Recurrent Drafter for Fast Speculative Decoding in Large Language Models, arXiv, 2403.09919, arxiv, pdf, cication: -1

    Aonan Zhang, Chong Wang, Yi Wang, Xuanyu Zhang, Yunfei Cheng

  • Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference, arXiv, 2403.09636, arxiv, pdf, cication: -1

    Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti

  • CLLMs: Consistency Large Language Models, arXiv, 2403.00835, arxiv, pdf, cication: -1

    Siqi Kou, Lanxiang Hu, Zhezhi He, Zhijie Deng, Hao Zhang · (Consistency_LLM - hao-ai-lab) Star · (hao-ai-lab.github)

  • Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference, arXiv, 2403.09054, arxiv, pdf, cication: -1

    Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya Soloveychik, Purushotham Kamath

  • Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding, arXiv, 2402.12374, arxiv, pdf, cication: -1

    Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen · (Sequoia - Infini-AI-Lab) Star

  • An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models, arXiv, 2403.06764, arxiv, pdf, cication: -1

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, Baobao Chang

    · (FastV - pkunlp-icler) Star

  • GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM, arXiv, 2403.05527, arxiv, pdf, cication: -1

    Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao · (GEAR - HaoKang-Timmy) Star

  • DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving, arXiv, 2403.01876, arxiv, pdf, cication: -1

    Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Tarnawski, Ana Klimovic

  • Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding, arXiv, 2402.16844, arxiv, pdf, cication: -1

    Benjamin Bergner, Andrii Skliar, Amelie Royer, Tijmen Blankevoort, Yuki Asano, Babak Ehteshami Bejnordi

  • Ouroboros: Speculative Decoding with Large Model Enhanced Drafting, arXiv, 2402.13720, arxiv, pdf, cication: -1

    Weilin Zhao, Yuxiang Huang, Xu Han, Chaojun Xiao, Zhiyuan Liu, Maosong Sun · (Ouroboros - thunlp) Star

  • Speculative Streaming: Fast LLM Inference without Auxiliary Models, arXiv, 2402.11131, arxiv, pdf, cication: -1

    Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, Mahyar Najibi

  • Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding, arXiv, 2402.12374, arxiv, pdf, cication: -1

    Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen

  • Tandem Transformers for Inference Efficient LLMs, arXiv, 2402.08644, arxiv, pdf, cication: -1

    Aishwarya P S, Pranav Ajit Nair, Yashas Samaga, Toby Boyd, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli

  • Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models, arXiv, 2402.07033, arxiv, pdf, cication: -1

    Keisuke Kamahori, Yile Gu, Kan Zhu, Baris Kasikci · (fiddler - efeslab) Star

  • SubGen: Token Generation in Sublinear Time and Memory, arXiv, 2402.06082, arxiv, pdf, cication: -1

    Amir Zandieh, Insu Han, Vahab Mirrokni, Amin Karbasi

  • Hydragen: High-Throughput LLM Inference with Shared Prefixes, arXiv, 2402.05099, arxiv, pdf, cication: -1

    Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christopher Ré, Azalia Mirhoseini

  • flashinfer - flashinfer-ai Star

    FlashInfer: Kernel Library for LLM Serving

  • EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty, arXiv, 2401.15077, arxiv, pdf, cication: -1

    Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang

  • BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models, arXiv, 2401.12522, arxiv, pdf, cication: -1

    Feng Lin, Hanling Yi, Hongbin Li, Yifan Yang, Xiaotian Yu, Guangming Lu, Rong Xiao

  • Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads, arXiv, 2401.10774, arxiv, pdf, cication: -1

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao

    · (medusa - fasterdecoding) Star

  • DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference, arXiv, 2401.08671, arxiv, pdf, cication: -1

    Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko

  • Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models, arXiv, 2401.08294, arxiv, pdf, cication: -1

    Shuming Shi, Enbo Zhao, Deng Cai, Leyang Cui, Xinting Huang, Huayang Li · (inferflow - inferflow) Star

  • PainlessInferenceAcceleration - alipay Star

  • Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding, arXiv, 2401.07851, arxiv, pdf, cication: -1

    Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui

  • Efficient LLM inference solution on Intel GPU, arXiv, 2401.05391, arxiv, pdf, cication: -1

    Hui Wu, Yi Gan, Feng Yuan, Jing Ma, Wei Zhu, Yutao Xu, Hong Zhu, Yuhua Zhu, Xiaoli Liu, Jinghui Gu

  • SwiftInfer - hpcaitech Star

    Efficient AI Inference & Serving · (qbitai)

  • Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache, arXiv, 2401.02669, arxiv, pdf, cication: -1

    Bin Lin, Tao Peng, Chen Zhang, Minmin Sun, Lanbo Li, Hanyu Zhao, Wencong Xiao, Qi Xu, Xiafei Qiu, Shen Li

  • nitro - janhq Star

    A fast, lightweight, embeddable inference engine to supercharge your apps with local AI. OpenAI-compatible API

  • jan - janhq Star

    Jan is an open source alternative to ChatGPT that runs 100% offline on your computer

  • Fairness in Serving Large Language Models, arXiv, 2401.00588, arxiv, pdf, cication: -1

    Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, Ion Stoica · (s-lora - s-lora) Star

  • tricksy - austinsilveria Star

    Fast approximate inference on a single GPU with sparsity aware offloading

  • mixtral-offloading - dvmazur Star

    Run Mixtral-8x7B models in Colab or consumer desktops

  • LLM in a flash: Efficient Large Language Model Inference with Limited Memory, arXiv, 2312.11514, arxiv, pdf, cication: -1

    Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar

  • Efficiently Programming Large Language Models using SGLang, arXiv, 2312.07104, arxiv, pdf, cication: -1

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez · (sglang?tab=readme-ov-file - sgl-project) Star · (lmsys)

  • PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU, arXiv, 2312.12456, arxiv, pdf, cication: -1

    Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen · (PowerInfer - SJTU-IPADS) Star

  • Cascade Speculative Drafting for Even Faster LLM Inference, arXiv, 2312.11462, arxiv, pdf, cication: -1

    Ziyi Chen, Xiaocong Yang, Jiacheng Lin, Chenkai Sun, Jie Huang, Kevin Chen-Chuan Chang

  • LLMLingua - microsoft Star

    To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.

  • vllm - vllm-project Star

    A high-throughput and memory-efficient inference and serving engine for LLMs

  • SparQ Attention: Bandwidth-Efficient LLM Inference, arXiv, 2312.04985, arxiv, pdf, cication: -1

    Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr

  • Notion – The all-in-one workspace for your notes, tasks, wikis, and databases.

    · (yaofu.notion)

  • Optimum-NVIDIA Unlocking blazingly fast LLM inference in just 1 line of code

  • PaSS: Parallel Speculative Sampling, arXiv, 2311.13581, arxiv, pdf, cication: -1

    Giovanni Monea, Armand Joulin, Edouard Grave

  • Break the Sequential Dependency of LLM Inference Using Lookahead Decoding | LMSYS Org

    · (LookaheadDecoding - hao-ai-lab) Star

  • Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models, arXiv, 2311.03687, arxiv, pdf, cication: -1

    Longteng Zhang, Xiang Liu, Zeyu Li, Xinglin Pan, Peijie Dong, Ruibo Fan, Rui Guo, Xin Wang, Qiong Luo, Shaohuai Shi

    · (jiqizhixin)

  • FlashDecoding++: Faster Large Language Model Inference on GPUs, arXiv, 2311.01282, arxiv, pdf, cication: -1

    Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, Kangdi Chen, Yuhan Dong, Yu Wang

  • Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time, ICML, 2023, arxiv, pdf, cication: 16

    Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re

  • TensorRT-LLM - NVIDIA Star

    TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs)

  • Approximating Two-Layer Feedforward Networks for Efficient Transformers, arXiv, 2310.10837, arxiv, pdf, cication: -1

    Róbert Csordás, Kazuki Irie, Jürgen Schmidhuber

  • deepsparse - neuralmagic Star

    Inference runtime offering GPU-class performance on CPUs and APIs to integrate ML into your application · (huggingface)

  • attention_sinks - tomaarsen Star

    Extend existing LLMs way beyond the original training length with constant memory usage, and without retraining

  • Efficient Streaming Language Models with Attention Sinks, arXiv, 2309.17453, arxiv, pdf, cication: 3

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis

    · (streaming-llm - mit-han-lab) Star

    · (mp.weixin.qq)

  • Efficient Memory Management for Large Language Model Serving with PagedAttention, proceedings of the 29th symposium on operating systems principles, 2023, arxiv, pdf, cication: 21

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica · (jiqizhixin)

  • llama2.mojo - tairov Star

    Inference Llama 2 in one file of pure 🔥 · (qbitai)

  • fastllm - ztxz16 Star

    纯c++的全平台llm加速库,支持python调用,chatglm-6B级模型单卡可达10000+token / s,支持glm, llama, moss基座,手机端流畅运行

  • flexflow - flexflow Star

    A distributed deep learning framework.

  • Accelerating LLM Inference with Staged Speculative Decoding, arXiv, 2308.04623, arxiv, pdf, cication: 3

    Benjamin Spector, Chris Re

  • CTranslate2 - OpenNMT Star

    Fast inference engine for Transformer models

  • Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding, arXiv, 2307.15337, arxiv, pdf, cication: 4

    Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, Yu Wang

  • SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference, arXiv, 2307.02628, arxiv, pdf, cication: -1

    Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal, Bin Yu, Ahmed Awadallah, Subhabrata Mukherjee

  • An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs, arXiv, 2306.16601, arxiv, pdf, cication: -1

    Haihao Shen, Hengyu Meng, Bo Dong, Zhe Wang, Ofir Zafrir, Yi Ding, Yu Luo, Hanwen Chang, Qun Gao, Ziheng Wang

  • NeuralFuse: Learning to Improve the Accuracy of Access-Limited Neural Network Inference in Low-Voltage Regimes, arXiv, 2306.16869, arxiv, pdf, cication: -1

    Hao-Lun Sun, Lei Hsiung, Nandhini Chandramoorthy, Pin-Yu Chen, Tsung-Yi Ho

  • H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models, arXiv, 2306.14048, arxiv, pdf, cication: -1

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett · (H2O - FMInference) Star

    · (mp.weixin.qq)

  • DeepSpeed ZeRO++: A leap in speed for LLM and chat model training with 4X less communication - Microsoft Research

    · (zhuanlan.zhihu)

  • vllm - vllm-project Star

    A high-throughput and memory-efficient inference and serving engine for LLMs · (mp.weixin.qq) · (jiqizhixin)

  • SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification, arXiv, 2305.09781, arxiv, pdf, cication: -1

    Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi · (FlexFlow - flexflow) Star · (mp.weixin.qq)

  • llama.cpp - ggerganov Star

    Port of Facebook's LLaMA model in C/C++ · (ggml) · (llama.cpp - ggerganov) Star

Other

Mobile

  • MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases, arXiv, 2402.14905, arxiv, pdf, cication: 15

    Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi

    · (MobileLLM - facebookresearch) Star

  • HARE: HumAn pRiors, a key to small language model Efficiency, arXiv, 2406.11410, arxiv, pdf, cication: -1

    Lingyun Zhang, Bin jin, Gaojian Ge, Lunhui Liu, Xuewen Shen, Mingyong Wu, Houqian Zhang, Yongneng Jiang, Shiqi Chen, Shi Pu

  • Octopus v2: On-device language model for super agent, arXiv, 2404.01744, arxiv, pdf, cication: -1

    Wei Chen, Zhiyuan Li

  • Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs, arXiv, 2403.20041, arxiv, pdf, cication: -1

    Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie

  • MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT, arXiv, 2402.16840, arxiv, pdf, cication: -1

    Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan

    · (MobiLlama - mbzuai-oryx) Star

  • MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases, arXiv, 2402.14905, arxiv, pdf, cication: -1

    Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi

  • MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices, arXiv, 2312.16886, arxiv, pdf, cication: -1

    Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei · (MobileVLM - Meituan-AutoML) Star

  • mlc-llm - mlc-ai Star

    Enable everyone to develop, optimize and deploy AI models natively on everyone's devices. · (jiqizhixin) · (jiqizhixin)

Toolkits

  • transformer-heads - center-for-humans-and-machines Star

    Toolkit for attaching, training, saving and loading of new heads for transformer models

  • quanto - huggingface Star

    A pytorch Quantization Toolkit · (huggingface)

  • fsdp_qlora - AnswerDotAI Star

    Training LLMs with QLoRA + FSDP · (answer) · (mp.weixin.qq)

  • GPTFast - MDK8888 Star

    Accelerate your Hugging Face Transformers 6-7x. Native to Hugging Face and PyTorch.

  • vllm - vllm-project Star

  • lorax - predibase Star

    Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

  • Winners 🏆 | NeurIPS Large Language Model Efficiency Challenge:1 LLM + 1GPU + 1Day

  • gigaGPT - Cerebras Star

    a small code base for training large models · (cerebras)

  • EAGLE - SafeAILab Star

    EAGLE: Lossless Acceleration of LLM Decoding by Feature Extrapolation · (sites.google)

    · (jiqizhixin)

  • optimum-nvidia - huggingface Star

  • unsloth - unslothai Star

    5X faster 50% less memory LLM finetuning

  • lit-gpt - Lightning-AI Star

    Hackable implementation of state-of-the-art open-source LLMs based on nanoGPT. Supports flash attention, 4-bit and 8-bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.

  • gpt-fast - pytorch-labs Star

    Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

  • MS-AMP - Azure Star

    Microsoft Automatic Mixed Precision Library

  • DeepSpeed - microsoft Star

    DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Efficient transformer

  • FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision | Tri Dao

    · (flash-attention - Dao-AILab) Star

  • MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression, arXiv, 2406.14909, arxiv, pdf, cication: -1

    Tianyu Fu, Haofeng Huang, Xuefei Ning, Genghan Zhang, Boju Chen, Tianqi Wu, Hongyi Wang, Zixiao Huang, Shiyao Li, Shengen Yan

  • Flash Attention (Fast and Memory-Efficient Exact Attention wi

  • Block Transformer: Global-to-Local Language Modeling for Fast Inference, arXiv, 2406.02657, arxiv, pdf, cication: -1

    Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun

    · (block-transformer - itsnamgyu) Star

  • LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models, arXiv, 2405.18377, arxiv, pdf, cication: -1

    Anthony Sarah, Sharath Nittur Sridhar, Maciej Szankin, Sairam Sundaresan

  • SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization, arXiv, 2405.11582, arxiv, pdf, cication: -1

    Jialong Guo, Xinghao Chen, Yehui Tang, Yunhe Wang · (SLAB - xinghaochen) Star

  • You Only Cache Once: Decoder-Decoder Architectures for Language Models, arXiv, 2405.05254, arxiv, pdf, cication: -1

    Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei · (aka) · (unilm - microsoft) Star

  • Is Flash Attention Stable?, arXiv, 2405.02803, arxiv, pdf, cication: -1

    Alicia Golden, Samuel Hsia, Fei Sun, Bilge Acun, Basil Hosmer, Yejin Lee, Zachary DeVito, Jeff Johnson, Gu-Yeon Wei, David Brooks

  • RecurrentGemma: Moving Past Transformers for Efficient Open Language Models, arXiv, 2404.07839, arxiv, pdf, cication: -1

    Aleksandar Botev, Soham De, Samuel L Smith, Anushan Fernando, George-Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi · (recurrentgemma - google-deepmind) Star

  • 8bit HippoAttention: Up to 3X Faster Compared to FlashAttentionV2 | by HippoML Blog | Medium

  • LASP - OpenNLPLab Star

    Linear Attention Sequence Parallelism (LASP)

  • Simple linear attention language models balance the recall-throughput tradeoff, arXiv, 2402.18668, arxiv, pdf, cication: -1

    Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, Christopher Ré

    · (based - hazyresearch) Star

  • Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models, arXiv, 2402.19427, arxiv, pdf, cication: -1

    Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan

    · (jiqizhixin)

    · (huggingface) · (twitter)

  • ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition, arXiv, 2402.15220, arxiv, pdf, cication: -1

    Lu Ye, Ze Tao, Yong Huang, Yang Li

  • Linear Transformers are Versatile In-Context Learners, arXiv, 2402.14180, arxiv, pdf, cication: -1

    Max Vladymyrov, Johannes von Oswald, Mark Sandler, Rong Ge

  • The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry, arXiv, 2402.04347, arxiv, pdf, cication: -1

    Michael Zhang, Kush Bhatia, Hermann Kumbong, Christopher Ré

  • FireAttention — Serving Open Source Models 4x faster than vLLM by quantizing with ~no tradeoffs | by Fireworks.ai | Jan, 2024 | Medium

  • flash-linear-attention - sustcsonglin Star

    Fast implementations of causal linear attention for autogressive language modeling (Pytorch)

  • PanGu-$π$: Enhancing Language Model Architectures via Nonlinearity Compensation, arXiv, 2312.17276, arxiv, pdf, cication: -1

    Yunhe Wang, Hanting Chen, Yehui Tang, Tianyu Guo, Kai Han, Ying Nie, Xutao Wang, Hailin Hu, Zheyuan Bai, Yun Wang

  • Agent Attention: On the Integration of Softmax and Linear Attention, arXiv, 2312.08874, arxiv, pdf, cication: -1

    Dongchen Han, Tianzhu Ye, Yizeng Han, Zhuofan Xia, Shiji Song, Gao Huang · (agent-attention - leaplabthu) Star

  • Weight subcloning: direct initialization of transformers using larger pretrained ones, arXiv, 2312.09299, arxiv, pdf, cication: -1

    Mohammad Samragh, Mehrdad Farajtabar, Sachin Mehta, Raviteja Vemulapalli, Fartash Faghri, Devang Naik, Oncel Tuzel, Mohammad Rastegari

  • Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models, arXiv, 2312.07046, arxiv, pdf, cication: -1

    Arnav Chavan, Nahush Lele, Deepak Gupta

  • Paving the way to efficient architectures: StripedHyena-7B, open source models offering a glimpse into a world beyond Transformers

  • Efficient Monotonic Multihead Attention, arXiv, 2312.04515, arxiv, pdf, cication: -1

    Xutai Ma, Anna Sun, Siqi Ouyang, Hirofumi Inaguma, Paden Tomasello

  • Mamba: Linear-Time Sequence Modeling with Selective State Spaces, arXiv, 2312.00752, arxiv, pdf, cication: -1

    Albert Gu, Tri Dao · (qbitai)

  • Simplifying Transformer Blocks, arXiv, 2311.01906, arxiv, pdf, cication: -1

    Bobby He, Thomas Hofmann · (jiqizhixin)

  • Exponentially Faster Language Modelling, arXiv, 2311.10770, arxiv, pdf, cication: -1

    Peter Belcak, Roger Wattenhofer

  • Simplifying Transformer Blocks, arXiv, 2311.01906, arxiv, pdf, cication: -1

    Bobby He, Thomas Hofmann

  • Alternating Updates for Efficient Transformers, arXiv, 2301.13310, arxiv, pdf, cication: -1

    Cenk Baykal, Dylan Cutler, Nishanth Dikkala, Nikhil Ghosh, Rina Panigrahy, Xin Wang

  • FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, NeurIPS, 2022, arxiv, pdf, cication: 278

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré

  • Fast Transformer Decoding: One Write-Head is All You Need, arXiv, 1911.02150, arxiv, pdf, cication: 61

    Noam Shazeer

    · (zhuanlan.zhihu)

Hardware

  • Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU, arXiv, 2403.06504, arxiv, pdf, cication: -1

    Changyue Liao, Mo Sun, Zihan Yang, Kaiqi Chen, Binhang Yuan, Fei Wu, Zeke Wang

  • Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers, arXiv, 2402.04744, arxiv, pdf, cication: -1

    Abhimanyu Rajeshkumar Bambhaniya, Amir Yazdanbakhsh, Suvinay Subramanian, Sheng-Chun Kao, Shivani Agrawal, Utku Evci, Tushar Krishna

  • FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs, arXiv, 2401.03868, arxiv, pdf, cication: -1

    Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang

    · (jiqizhixin)

  • 大模型最快推理芯片一夜易主:每秒500tokens干翻GPU!谷歌TPU人马打造

Other

Courses

EfficientML

Extra Reference

  • Awesome-Knowledge-Distillation-of-LLMs - Tebmer Star

    This repository collects papers for "A Survey on Knowledge Distillation of Large Language Models". We break down KD into Knowledge Elicitation and Distillation Algorithms, and explore the Skill & Vertical Distillation of LLMs.

  • Awesome-LLM-Compression - HuangOwen Star

    Awesome LLM compression research papers and tools.