Skip to content

📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

License

Notifications You must be signed in to change notification settings

clevercool/Awesome-LLM-Inference

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

llm-inference

📒Introduction

Awesome-LLM-Inference: A curated list of 📙Awesome LLM Inference Papers with Codes. For Awesome SD Distributed Inference(Multi-GPUs), please check 📖Awesome-SD-Distributed-Inference

©️Citations

@misc{Awesome-LLM-Inference@2024,
  title={Awesome-LLM-Inference: A curated list of Awesome LLM Inference Papers with codes},
  url={https://github.com/DefTruth/Awesome-LLM-Inference},
  note={Open-source software available at https://github.com/DefTruth/Awesome-LLM-Inference},
  author={DefTruth, liyucheng09 etc},
  year={2024}
}

📙Awesome LLM Inference Papers with Codes

LLM Inference

🎉Download PDFs

Awesome LLM Inference for Beginners.pdf: 500 pages, FastServe, FlashAttention 1/2, FlexGen, FP8, LLM.int8(), PagedAttention, RoPE, SmoothQuant, WINT8/4, Continuous Batching, ZeroQuant 1/2/FP, AWQ etc.

📖Contents

📖Trending LLM/LVLMs Topics (©️back👆🏻)

Date Title Paper Code Recom
2024.04 🔥🔥🔥[Open-Sora] Open-Sora: Democratizing Efficient Video Production for All(@hpcaitech) [docs] [Open-Sora] ⭐️⭐️
2024.04 🔥🔥🔥[Open-Sora Plan] Open-Sora Plan: This project aim to reproduce Sora (Open AI T2V model)(@PKU) [report] [Open-Sora-Plan] ⭐️⭐️
2024.05 🔥🔥🔥[DeepSeek-V2] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model(@DeepSeek-AI) [pdf] [DeepSeek-V2] ⭐️⭐️
2024.05 🔥🔥[YOCO] You Only Cache Once: Decoder-Decoder Architectures for Language Models(@Microsoft) [pdf] [unilm-YOCO] ⭐️⭐️
2024.06 🔥[Mooncake] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving(@Moonshot AI) [pdf] [Mooncake] ⭐️⭐️
2024.07 🔥🔥[FlashAttention-3] FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision(@TriDao etc) [pdf] [flash-attention] ⭐️⭐️
2024.07 🔥🔥[MInference 1.0] MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention(@Microsoft) [pdf] [MInference 1.0] ⭐️⭐️

📖LLM Algorithmic/Eval Survey (©️back👆🏻)

Date Title Paper Code Recom
2023.10 [Evaluating] Evaluating Large Language Models: A Comprehensive Survey(@tju.edu.cn) [pdf] [Awesome-LLMs-Evaluation] ⭐️
2023.11 🔥[Runtime Performance] Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models(@hkust-gz.edu.cn) [pdf] ⚠️ ⭐️⭐️
2023.11 [ChatGPT Anniversary] ChatGPT’s One-year Anniversary: Are Open-Source Large Language Models Catching up?(@e.ntu.edu.sg) [pdf] ⚠️ ⭐️
2023.12 [Algorithmic Survey] The Efficiency Spectrum of Large Language Models: An Algorithmic Survey(@Microsoft) [pdf] ⚠️ ⭐️
2023.12 [Security and Privacy] A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly(@Drexel University) [pdf] ⚠️ ⭐️
2023.12 🔥[LLMCompass] A Hardware Evaluation Framework for Large Language Model Inference(@princeton.edu) [pdf] ⚠️ ⭐️⭐️
2023.12 🔥[Efficient LLMs] Efficient Large Language Models: A Survey(@Ohio State University etc) [pdf] [Efficient-LLMs-Survey] ⭐️⭐️
2023.12 [Serving Survey] Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems(@Carnegie Mellon University) [pdf] ⚠️ ⭐️⭐️
2024.01 [Understanding LLMs] Understanding LLMs: A Comprehensive Overview from Training to Inference(@Shaanxi Normal University etc) [pdf] ⚠️ ⭐️⭐️
2024.02 [LLM-Viewer] LLM Inference Unveiled: Survey and Roofline Model Insights(@Zhihang Yuan etc) [pdf] [LLM-Viewer] ⭐️⭐️
2024.07 [Internal Consistency & Self-Feedback] Internal Consistency and Self-Feedback in Large Language Models: A Survey [pdf] [ICSF-Survey] ⭐️⭐️

📖LLM Train/Inference Framework (©️back👆🏻)

Date Title Paper Code Recom
2020.05 🔥[Megatron-LM] Training Multi-Billion Parameter Language Models Using Model Parallelism(@NVIDIA) [pdf] [Megatron-LM] ⭐️⭐️
2023.03 [FlexGen] High-Throughput Generative Inference of Large Language Models with a Single GPU(@Stanford University etc) [pdf] [FlexGen] ⭐️
2023.05 [SpecInfer] Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification(@Peking University etc) [pdf] [FlexFlow] ⭐️
2023.05 [FastServe] Fast Distributed Inference Serving for Large Language Models(@Peking University etc) [pdf] ⚠️ ⭐️
2023.09 🔥[vLLM] Efficient Memory Management for Large Language Model Serving with PagedAttention(@UC Berkeley etc) [pdf] [vllm] ⭐️⭐️
2023.09 [StreamingLLM] EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS(@Meta AI etc) [pdf] [streaming-llm] ⭐️
2023.09 [Medusa] Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads(@Tianle Cai etc) [blog] [Medusa] ⭐️
2023.10 🔥[TensorRT-LLM] NVIDIA TensorRT LLM(@NVIDIA) [docs] [TensorRT-LLM] ⭐️⭐️
2023.11 🔥[DeepSpeed-FastGen 2x vLLM?] DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference(@Microsoft) [pdf] [deepspeed-fastgen] ⭐️⭐️
2023.12 🔥[PETALS] Distributed Inference and Fine-tuning of Large Language Models Over The Internet(@HSE Univesity etc) [pdf] [petals] ⭐️⭐️
2023.10 [LightSeq] LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers(@UC Berkeley etc) [pdf] [LightSeq] ⭐️
2023.12 [PowerInfer] PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU(@SJTU) [pdf] [PowerInfer] ⭐️
2024.01 [inferflow]INFERFLOW: AN EFFICIENT AND HIGHLY CONFIGURABLE INFERENCE ENGINE FOR LARGE LANGUAGE MODELS(@Tencent AI Lab) [pdf] [inferflow] ⭐️
2024.06 🔥[Mooncake] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving(@Moonshot AI) [pdf] [Mooncake] ⭐️⭐️
2023.06 🔥[LMDeploy] LMDeploy: LMDeploy is a toolkit for compressing, deploying, and serving LLMs(@InternLM) [docs] [lmdeploy] ⭐️⭐️
2023.05 🔥[MLC-LLM]Universal LLM Deployment Engine with ML Compilation(@mlc-ai) [docs] [mlc-llm] ⭐️⭐️
2023.08 🔥[LightLLM] LightLLM is a Python-based LLM (Large Language Model) inference and serving framework(@ModelTC) [docs] [lightllm] ⭐️⭐️
2023.03 🔥[llama.cpp] llama.cpp: Inference of Meta's LLaMA model (and others) in pure C/C++(@ggerganov) [docs] [llama.cpp] ⭐️⭐️

📖Continuous/In-flight Batching (©️back👆🏻)

Date Title Paper Code Recom
2022.07 🔥[Continuous Batching] Orca: A Distributed Serving System for Transformer-Based Generative Models(@Seoul National University etc) [pdf] ⚠️ ⭐️⭐️
2023.10 🔥[In-flight Batching] NVIDIA TensorRT LLM Batch Manager(@NVIDIA) [docs] [TensorRT-LLM] ⭐️⭐️
2023.11 🔥[DeepSpeed-FastGen 2x vLLM?] DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference(@Microsoft) [blog] [deepspeed-fastgen] ⭐️⭐️
2023.11 [Splitwise] Splitwise: Efficient Generative LLM Inference Using Phase Splitting(@Microsoft etc) [pdf] ⚠️ ⭐️
2023.12 [SpotServe] SpotServe: Serving Generative Large Language Models on Preemptible Instances(@cmu.edu etc) [pdf] [SpotServe] ⭐️
2023.10 [LightSeq] LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers(@UC Berkeley etc) [pdf] [LightSeq] ⭐️
2024.05 🔥[vAttention] vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention(@Microsoft Research India) [pdf] ⚠️ ⭐️⭐️
2024.07 🔥🔥[vTensor] vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving(@Shanghai Jiao Tong University etc) [pdf] ⚠️ ⭐️⭐️

📖Weight/Activation Quantize/Compress (©️back👆🏻)

Date Title Paper Code Recom
2022.06 🔥[ZeroQuant] Efficient and Affordable Post-Training Quantization for Large-Scale Transformers(@Microsoft) [pdf] [DeepSpeed] ⭐️⭐️
2022.08 [FP8-Quantization] FP8 Quantization: The Power of the Exponent(@Qualcomm AI Research) [pdf] [FP8-quantization] ⭐️
2022.08 [LLM.int8()] 8-bit Matrix Multiplication for Transformers at Scale(@Facebook AI Research etc) [pdf] [bitsandbytes] ⭐️
2022.10 🔥[GPTQ] GPTQ: ACCURATE POST-TRAINING QUANTIZATION FOR GENERATIVE PRE-TRAINED TRANSFORMERS(@IST Austria etc) [pdf] [gptq] ⭐️⭐️
2022.11 🔥[WINT8/4] Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production(@NVIDIA&Microsoft) [pdf] [FasterTransformer] ⭐️⭐️
2022.11 🔥[SmoothQuant] Accurate and Efficient Post-Training Quantization for Large Language Models(@MIT etc) [pdf] [smoothquant] ⭐️⭐️
2023.03 [ZeroQuant-V2] Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation(@Microsoft) [pdf] [DeepSpeed] ⭐️
2023.06 🔥[AWQ] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration(@MIT etc) [pdf] [llm-awq] ⭐️⭐️
2023.06 [SpQR] SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression(@University of Washington etc) [pdf] [SpQR] ⭐️
2023.06 [SqueezeLLM] SQUEEZELLM: DENSE-AND-SPARSE QUANTIZATION(@berkeley.edu) [pdf] [SqueezeLLM] ⭐️
2023.07 [ZeroQuant-FP] A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats(@Microsoft) [pdf] [DeepSpeed] ⭐️
2023.09 [KV Cache FP8 + WINT4] Exploration on LLM inference performance optimization(@HPC4AI) [blog] ⚠️ ⭐️
2023.10 [FP8-LM] FP8-LM: Training FP8 Large Language Models(@Microsoft etc) [pdf] [MS-AMP] ⭐️
2023.10 [LLM-Shearing] SHEARED LLAMA: ACCELERATING LANGUAGE MODEL PRE-TRAINING VIA STRUCTURED PRUNING(@cs.princeton.edu etc) [pdf] [LLM-Shearing] ⭐️
2023.10 [LLM-FP4] LLM-FP4: 4-Bit Floating-Point Quantized Transformers(@ust.hk&meta etc) [pdf] [LLM-FP4] ⭐️
2023.11 [2-bit LLM] Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization(@Shanghai Jiao Tong University etc) [pdf] ⚠️ ⭐️
2023.12 [SmoothQuant+] SmoothQuant+: Accurate and Efficient 4-bit Post-Training Weight Quantization for LLM(@ZTE Corporation) [pdf] [smoothquantplus] ⭐️
2023.11 [OdysseyLLM W4A8] A Speed Odyssey for Deployable Quantization of LLMs(@meituan.com) [pdf] ⚠️ ⭐️
2023.12 🔥[SparQ] SPARQ ATTENTION: BANDWIDTH-EFFICIENT LLM INFERENCE(@graphcore.ai) [pdf] ⚠️ ⭐️⭐️
2023.12 [Agile-Quant] Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge(@Northeastern University&Oracle) [pdf] ⚠️ ⭐️
2023.12 [CBQ] CBQ: Cross-Block Quantization for Large Language Models(@ustc.edu.cn) [pdf] ⚠️ ⭐️
2023.10 [QLLM] QLLM: ACCURATE AND EFFICIENT LOW-BITWIDTH QUANTIZATION FOR LARGE LANGUAGE MODELS(@ZIP Lab&SenseTime Research etc) [pdf] ⚠️ ⭐️
2024.01 [FP6-LLM] FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design(@Microsoft etc) [pdf] ⚠️ ⭐️
2024.05 🔥🔥[W4A8KV4] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving(@MIT&NVIDIA) [pdf] [qserve] ⭐️⭐️
2024.05 🔥[SpinQuant] SpinQuant: LLM Quantization with Learned Rotations(@Meta) [pdf] ⚠️ ⭐️
2024.05 🔥[I-LLM] I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models(@Houmo AI) [pdf] ⚠️ ⭐️
2024.06 🔥[OutlierTune] OutlierTune: Efficient Channel-Wise Quantization for Large Language Models(@Beijing University) [pdf] ⚠️ ⭐️
2024.06 🔥[GPTQT] GPTQT: Quantize Large Language Models Twice to Push the Efficiency(@zju) [pdf] ⚠️ ⭐️

📖IO/FLOPs-Aware/Sparse Attention (©️back👆🏻)

Date Title Paper Code Recom
2018.05 [Online Softmax] Online normalizer calculation for softmax(@NVIDIA) [pdf] ⚠️ ⭐️
2019.11 🔥[MQA] Fast Transformer Decoding: One Write-Head is All You Need(@Google) [pdf] ⚠️ ⭐️⭐️
2020.10 [Hash Attention] REFORMER: THE EFFICIENT TRANSFORMER(@Google) [pdf] [reformer] ⭐️⭐️
2022.05 🔥[FlashAttention] Fast and Memory-Efficient Exact Attention with IO-Awareness(@Stanford University etc) [pdf] [flash-attention] ⭐️⭐️
2022.10 [Online Softmax] SELF-ATTENTION DOES NOT NEED O(n^2) MEMORY(@Google) [pdf] ⚠️ ⭐️
2023.05 [FlashAttention] From Online Softmax to FlashAttention(@cs.washington.edu) [pdf] ⚠️ ⭐️⭐️
2023.05 [FLOP, I/O] Dissecting Batching Effects in GPT Inference(@Lequn Chen) [blog] ⚠️ ⭐️
2023.05 🔥🔥[GQA] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints(@Google) [pdf] [flaxformer] ⭐️⭐️
2023.06 [Sparse FlashAttention] Faster Causal Attention Over Large Sequences Through Sparse Flash Attention(@EPFL etc) [pdf] [dynamic-sparse-flash-attention] ⭐️
2023.07 🔥[FlashAttention-2] Faster Attention with Better Parallelism and Work Partitioning(@Stanford University etc) [pdf] [flash-attention] ⭐️⭐️
2023.10 🔥[Flash-Decoding] Flash-Decoding for long-context inference(@Stanford University etc) [blog] [flash-attention] ⭐️⭐️
2023.11 [Flash-Decoding++] FLASHDECODING++: FASTER LARGE LANGUAGE MODEL INFERENCE ON GPUS(@Tsinghua University&Infinigence-AI) [pdf] ⚠️ ⭐️
2023.01 [SparseGPT] SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot(@ISTA etc) [pdf] [sparsegpt] ⭐️
2023.12 🔥[GLA] Gated Linear Attention Transformers with Hardware-Efficient Training(@MIT-IBM Watson AI) [pdf] gated_linear_attention ⭐️⭐️
2023.12 [SCCA] SCCA: Shifted Cross Chunk Attention for long contextual semantic expansion(@Beihang University) [pdf] ⚠️ ⭐️
2023.12 🔥[FlashLLM] LLM in a flash: Efficient Large Language Model Inference with Limited Memory(@Apple) [pdf] ⚠️ ⭐️⭐️
2024.03 🔥🔥[CHAI] CHAI: Clustered Head Attention for Efficient LLM Inference(@cs.wisc.edu etc) [pdf] ⚠️ ⭐️⭐️
2024.04 🔥🔥[DeFT] DeFT: Decoding with Flash Tree-Attention for Efficient Tree-structured LLM Inference(@Westlake University etc) [pdf] ⚠️ ⭐️⭐️
2024.04 [MoA] MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression(@thu et el.) [pdf] [MoA] ⭐️
2024.07 🔥🔥[FlashAttention-3] FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision(@TriDao etc) [pdf] [flash-attention] ⭐️⭐️
2024.07 🔥🔥[MInference 1.0] MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention(@Microsoft) [pdf] [MInference 1.0] ⭐️⭐️
2024.07 🔥🔥[Shared Attention] Beyond KV Caching: Shared Attention for Efficient LLMs(@Kyushu University etc) [pdf] [shareAtt] ⭐️

📖KV Cache Scheduling/Quantize/Dropping (©️back👆🏻)

Date Title Paper Code Recom
2019.11 🔥[MQA] Fast Transformer Decoding: One Write-Head is All You Need(@Google) [pdf] ⚠️ ⭐️⭐️
2022.06 [LTP] Learned Token Pruning for Transformers(@UC Berkeley etc) [pdf] [LTP] ⭐️
2023.05 🔥🔥[GQA] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints(@Google) [pdf] [flaxformer] ⭐️⭐️
2023.05 [KV Cache Compress] Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time(@) [pdf] ⚠️ ⭐️⭐️
2023.06 [H2O] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models(@Rice University etc) [pdf] [H2O] ⭐️
2023.06 [QK-Sparse/Dropping Attention] Faster Causal Attention Over Large Sequences Through Sparse Flash Attention(@EPFL etc) [pdf] [dynamic-sparse-flash-attention] ⭐️
2023.08 🔥🔥[Chunked Prefills] SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills(@Microsoft etc) [pdf] ⚠️ ⭐️⭐️
2023.09 🔥🔥[PagedAttention] Efficient Memory Management for Large Language Model Serving with PagedAttention(@UC Berkeley etc) [pdf] [vllm] ⭐️⭐️
2023.09 [KV Cache FP8 + WINT4] Exploration on LLM inference performance optimization(@HPC4AI) [blog] ⚠️ ⭐️
2023.10 🔥[TensorRT-LLM KV Cache FP8] NVIDIA TensorRT LLM(@NVIDIA) [docs] [TensorRT-LLM] ⭐️⭐️
2023.10 🔥[Adaptive KV Cache Compress] MODEL TELLS YOU WHAT TO DISCARD: ADAPTIVE KV CACHE COMPRESSION FOR LLMS(@illinois.edu&microsoft) [pdf] ⚠️ ⭐️⭐️
2023.10 [CacheGen] CacheGen: Fast Context Loading for Language Model Applications(@Chicago University&Microsoft) [pdf] ⚠️ ⭐️
2023.12 [KV-Cache Optimizations] Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO(@Haim Barad etc) [pdf] ⚠️ ⭐️
2023.12 [KV Cache Compress with LoRA] Compressed Context Memory for Online Language Model Interaction (@SNU & NAVER AI) [pdf] [Compressed-Context-Memory] ⭐️⭐️
2023.12 🔥🔥[RadixAttention] Efficiently Programming Large Language Models using SGLang(@Stanford University etc) [pdf] [sglang] ⭐️⭐️
2024.01 🔥🔥[DistKV-LLM] Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache(@Alibaba etc) [pdf] ⚠️ ⭐️⭐️
2024.02 🔥🔥[Prompt Caching] Efficient Prompt Caching via Embedding Similarity(@UC Berkeley) [pdf] ⚠️ ⭐️⭐️
2024.02 🔥🔥[Less] Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference(@CMU etc) [pdf] ⚠️ ⭐️
2024.02 🔥🔥[MiKV] No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization(@KAIST) [pdf] ⚠️ ⭐️
2024.02 🔥🔥[Shared Prefixes] Hydragen: High-Throughput LLM Inference with Shared Prefixes [pdf] ⚠️ ⭐️⭐️
2024.02 🔥🔥[ChunkAttention] ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition(@microsoft.com) [pdf] [chunk-attention] ⭐️⭐️
2024.03 🔥[QAQ] QAQ: Quality Adaptive Quantization for LLM KV Cache(@@smail.nju.edu.cn) [pdf] [QAQ-KVCacheQuantization] ⭐️⭐️
2024.03 🔥🔥[DMC] Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference(@NVIDIA etc) [pdf] ⚠️ ⭐️⭐️
2024.03 🔥🔥[Keyformer] Keyformer: KV Cache reduction through key tokens selection for Efficient Generative Inference(@ece.ubc.ca etc) [pdf] [Keyformer] ⭐️⭐️
2024.03 [FASTDECODE] FASTDECODE: High-Throughput GPU-Efficient LLM Serving using Heterogeneous(@Tsinghua University) [pdf] ⚠️ ⭐️⭐️
2024.03 [Sparsity-Aware KV Caching] ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching(@ucf.edu) [pdf] ⚠️ ⭐️⭐️
2024.03 🔥[GEAR] GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM(@gatech.edu) [pdf] [GEAR] ⭐️
2024.04 [SqueezeAttention] SQUEEZEATTENTION: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget(@lzu.edu.cn etc) [pdf] [SqueezeAttention] ⭐️⭐️
2024.04 [SnapKV] SnapKV: LLM Knows What You are Looking for Before Generation(@UIUC) [pdf] [SnapKV] ⭐️
2024.05 🔥[vAttention] vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention(@Microsoft Research India) [pdf] ⚠️ ⭐️⭐️
2024.05 🔥[KVCache-1Bit] KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization(@Rice University) [pdf] ⚠️ ⭐️⭐️
2024.05 🔥[KV-Runahead] KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation(@Apple etc) [pdf] ⚠️ ⭐️⭐️
2024.05 🔥[ZipCache] ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification(@Zhejiang University etc) [pdf] ⚠️ ⭐️⭐️
2024.05 🔥[MiniCache] MiniCache: KV Cache Compression in Depth Dimension for Large Language Models(@ZIP Lab) [pdf] ⚠️ ⭐️⭐️
2024.05 🔥[CacheBlend] CacheBlend: Fast Large Language Model Serving with Cached Knowledge Fusion(@University of Chicago) [pdf] ⚠️ ⭐️⭐️
2024.06 🔥[CompressKV] Effectively Compress KV Heads for LLM(@alibaba etc) [pdf] ⚠️ ⭐️⭐️
2024.06 🔥[MemServe] MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool(@Huawei Cloud etc) [pdf] ⚠️ ⭐️⭐️
2024.07 [MLKV] MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding(@Institut Teknologi Bandung) [pdf] [pythia-mlkv] ⭐️
2024.07 🔥 [ElasticCache-LVLM] Efficient Inference of Vision Instruction-Following Models with Elastic Cache(@Tsinghua University etc) [pdf] [ElasticCache] ⭐️
2024.07 🔥 [RazorAttention] RAZORATTENTION: EFFICIENT KV CACHE COMPRESSION THROUGH RETRIEVAL HEADS(@Huawei Technologies Co., Ltd) [pdf] ⚠️ ⭐️⭐️

📖Prompt/Context Compression (©️back👆🏻)

Date Title Paper Code Recom
2023.04 🔥[Selective-Context] Compressing Context to Enhance Inference Efficiency of Large Language Models(@Surrey) [pdf] Selective-Context ⭐️⭐️
2023.05 [AutoCompressor] Adapting Language Models to Compress Contextss(@Princeton) [pdf] AutoCompressor ⭐️
2023.10 🔥[LLMLingua] LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models(@Microsoft) [pdf] LLMLingua ⭐️⭐️
2023.10 🔥🔥[LongLLMLingua] LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression(@Microsoft) [pdf] LLMLingua ⭐️⭐️
2024.03 🔥[LLMLingua-2] LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression(@Microsoft) [pdf] LLMLingua series ⭐️

📖Long Context Attention/KV Cache Optimization (©️back👆🏻)

Date Title Paper Code Recom
2023.05 🔥🔥[Blockwise Attention] Blockwise Parallel Transformer for Large Context Models(@UC Berkeley) [pdf] ⚠️ ⭐️⭐️
2023.05 🔥[Landmark Attention] Random-Access Infinite Context Length for Transformers(@epfl.ch) [pdf] landmark-attention ⭐️⭐️
2023.07 🔥[LightningAttention-1] TRANSNORMERLLM: A FASTER AND BETTER LARGE LANGUAGE MODEL WITH IMPROVED TRANSNORMER(@OpenNLPLab) [pdf] TransnormerLLM ⭐️⭐️
2023.07 🔥[LightningAttention-2] Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models(@OpenNLPLab) [pdf] lightning-attention ⭐️⭐️
2023.10 🔥🔥[RingAttention] Ring Attention with Blockwise Transformers for Near-Infinite Context(@UC Berkeley) [pdf] [RingAttention] ⭐️⭐️
2023.11 🔥[HyperAttention] HyperAttention: Long-context Attention in Near-Linear Time(@yale&Google) [pdf] hyper-attn ⭐️⭐️
2023.11 [Streaming Attention] One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space(@Adobe Research etc) [pdf] ⚠️ ⭐️
2023.11 🔥[Prompt Cache] PROMPT CACHE: MODULAR ATTENTION REUSE FOR LOW-LATENCY INFERENCE(@Yale University etc) [pdf] ⚠️ ⭐️⭐️
2023.11 🔥🔥[StripedAttention] STRIPED ATTENTION: FASTER RING ATTENTION FOR CAUSAL TRANSFORMERS(@MIT etc) [pdf] [striped_attention] ⭐️⭐️
2024.01 🔥🔥[KVQuant] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization(@UC Berkeley) [pdf] [KVQuant] ⭐️⭐️
2024.02 🔥[RelayAttention] RelayAttention for Efficient Large Language Model Serving with Long System Prompts(@sensetime.com etc) [pdf] ⚠️ ⭐️⭐️
2024.04 🔥🔥[Infini-attention] Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention(@Google) [pdf] ⚠️ ⭐️⭐️
2024.04 🔥🔥[RAGCache] RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation(@Peking University&ByteDance Inc) [pdf] ⚠️ ⭐️⭐️
2024.04 🔥🔥[KCache] EFFICIENT LLM INFERENCE WITH KCACHE(@Qiaozhi He, Zhihua Wu) [pdf] ⚠️ ⭐️⭐️
2024.05 🔥🔥[YOCO] You Only Cache Once: Decoder-Decoder Architectures for Language Models(@Microsoft) [pdf] [unilm-YOCO] ⭐️⭐️
2024.05 🔥🔥[SKVQ] SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models(@Shanghai AI Laboratory) [pdf] ⚠️ ⭐️⭐️
2024.05 🔥🔥[CLA] Reducing Transformer Key-Value Cache Size with Cross-Layer Attention(@MIT-IBM) [pdf] ⚠️ ⭐️⭐️
2024.06 🔥[LOOK-M] LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference(@osu.edu etc) [pdf] [LOOK-M] ⭐️⭐️
2024.06 🔥🔥[MInference] MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention(@Microsoft etc) [pdf] [MInference] ⭐️⭐️
2024.06 🔥🔥[InfiniGen] InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management(@snu) [pdf] ⚠️ ⭐️⭐️
2024.06 🔥🔥[Quest] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference(@mit-han-lab etc) [pdf] [Quest] ⭐️⭐️
2024.07 🔥[PQCache] PQCache: Product Quantization-based KVCache for Long Context LLM Inference(@PKU etc) [pdf] ⚠️ ⭐️⭐️

📖Early-Exit/Intermediate Layer Decoding (©️back👆🏻)

Date Title Paper Code Recom
2020.04 [DeeBERT] DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference(@uwaterloo.ca) [pdf] ⚠️ ⭐️
2021.06 [BERxiT] BERxiT: Early Exiting for BERT with Better Fine-Tuning and Extension to Regression(@uwaterloo.ca) [pdf] [berxit] ⭐️
2023.10 🔥[LITE] Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via Instruction Tuning with LITE(@Arizona State University) [pdf] ⚠️ ⭐️⭐️
2023.12 🔥🔥[EE-LLM] EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism(@alibaba-inc.com) [pdf] [EE-LLM] ⭐️⭐️
2023.10 🔥[FREE] Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding(@KAIST AI&AWS AI) [pdf] [fast_robust_early_exit] ⭐️⭐️
2024.07 [Skip Attention] Attention Is All You Need But You Don’t Need All Of It For Inference of Large Language Models(@University College London) [pdf] ⚠️ ⭐️⭐️

📖Parallel Decoding/Sampling (©️back👆🏻)

Date Title Paper Code Recom
2018.11 🔥[Parallel Decoding] Blockwise Parallel Decoding for Deep Autoregressive Models(@Berkeley&Google) [pdf] ⚠️ ⭐️⭐️
2023.02 🔥[Speculative Sampling] Accelerating Large Language Model Decoding with Speculative Sampling(@DeepMind) [pdf] ⚠️ ⭐️⭐️
2023.05 🔥[Speculative Sampling] Fast Inference from Transformers via Speculative Decoding(@Google Research etc) [pdf] [LLMSpeculativeSampling] ⭐️⭐️
2023.09 🔥[Medusa] Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads(@Tianle Cai etc) [pdf] [Medusa] ⭐️⭐️
2023.10 [OSD] Online Speculative Decoding(@UC Berkeley etc) [pdf] ⚠️ ⭐️⭐️
2023.12 [Cascade Speculative] Cascade Speculative Drafting for Even Faster LLM Inference(@illinois.edu) [pdf] ⚠️ ⭐️
2024.02 🔥[LookaheadDecoding] Break the Sequential Dependency of LLM Inference Using LOOKAHEAD DECODING(@UCSD&Google&UC Berkeley) [pdf] [LookaheadDecoding] ⭐️⭐️
2024.02 🔥🔥[Speculative Decoding] Decoding Speculative Decoding(@cs.wisc.edu) [pdf] Decoding Speculative Decoding ⭐️
2024.04 🔥🔥[TriForce] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding(@cmu.edu&Meta AI) [pdf] [TriForce] ⭐️⭐️
2024.04 🔥🔥[Hidden Transfer] Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration(@pku.edu.cn etc) [pdf] ⚠️ ⭐️
2024.05 🔥[Instructive Decoding] INSTRUCTIVE DECODING: INSTRUCTION-TUNED LARGE LANGUAGE MODELS ARE SELF-REFINER FROM NOISY INSTRUCTIONS(@KAIST AI) [pdf] [Instructive-Decoding] ⭐️
2024.05 🔥[S3D] S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs(@lge.com) [pdf] ⚠️ ⭐️
2024.06 🔥[Parallel Decoding] Exploring and Improving Drafts in Blockwise Parallel Decoding(@KAIST&Google Research) [pdf] ⚠️ ⭐️⭐️
2024.07 [Multi-Token Speculative Decoding] Multi-Token Joint Speculative Decoding for Accelerating Large Language Model Inference(@University of California, etc) [pdf] ⚠️ ⭐️⭐️

📖Structured Prune/KD/Weight Sparse (©️back👆🏻)

Date Title Paper Code Recom
2023.12 [FLAP] Fluctuation-based Adaptive Structured Pruning for Large Language Models(@Chinese Academy of Sciences etc) [pdf] [FLAP] ⭐️⭐️
2023.12 🔥[LASER] The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction(@mit.edu) [pdf] [laser] ⭐️⭐️
2023.12 [PowerInfer] PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU(@SJTU) [pdf] [PowerInfer] ⭐️
2024.01 [Admm Pruning] Fast and Optimal Weight Update for Pruned Large Language Models(@fmph.uniba.sk) [pdf] [admm-pruning] ⭐️
2024.01 [FFSplit] FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference(@1Rice University etc) [pdf] ⚠️ ⭐️

📖Mixture-of-Experts(MoE) LLM Inference (©️back👆🏻)

Date Title Paper Code Recom
2022.11 🔥[WINT8/4] Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production(@NVIDIA&Microsoft) [pdf] [FasterTransformer] ⭐️⭐️
2023.12 🔥 [Mixtral Offloading] Fast Inference of Mixture-of-Experts Language Models with Offloading(@Moscow Institute of Physics and Technology etc) [pdf] [mixtral-offloading] ⭐️⭐️
2024.01 [MoE-Mamba] MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts(@uw.edu.pl) [pdf] ⚠️ ⭐️
2024.04 [MoE Inference] Toward Inference-optimal Mixture-of-Expert Large Language Models(@UC San Diego etc) [pdf] ⚠️ ⭐️
2024.05 🔥🔥🔥[DeepSeek-V2] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model(@DeepSeek-AI) [pdf] [DeepSeek-V2] ⭐️⭐️
2024.06 [MoE] A Survey on Mixture of Experts(@HKU) [pdf] ⚠️ ⭐️

📖CPU/Single GPU/FPGA/Mobile Inference (©️back👆🏻)

Date Title Paper Code Recom
2023.03 [FlexGen] High-Throughput Generative Inference of Large Language Models with a Single GPU(@Stanford University etc) [pdf] [FlexGen] ⭐️
2023.11 [LLM CPU Inference] Efficient LLM Inference on CPUs(@intel) [pdf] [intel-extension-for-transformers] ⭐️
2023.12 [LinguaLinked] LinguaLinked: A Distributed Large Language Model Inference System for Mobile Devices(@University of California Irvine) [pdf] ⚠️ ⭐️
2023.12 [OpenVINO] Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO(@Haim Barad etc) [pdf] ⚠️ ⭐️
2024.03 [FlightLLM] FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs(@Infinigence-AI) [pdf] ⚠️ ⭐️
2024.03 [Transformer-Lite] Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs(@OPPO) [pdf] ⚠️ ⭐️
2024.07 🔥🔥[xFasterTransformer] Inference Performance Optimization for Large Language Models on CPUs(@Intel) [pdf] [xFasterTransformer] ⭐️
2024.07 [Summary] Inference Optimization of Foundation Models on AI Accelerators(@AWS AI) [pdf] ⚠️ ⭐️

📖Non Transformer Architecture (©️back👆🏻)

Date Title Paper Code Recom
2023.05 🔥🔥[RWKV] RWKV: Reinventing RNNs for the Transformer Era(@Bo Peng etc) [pdf] [RWKV-LM] ⭐️⭐️
2023.12 🔥🔥[Mamba] Mamba: Linear-Time Sequence Modeling with Selective State Spaces(@cs.cmu.edu etc) [pdf] [mamba] ⭐️⭐️
2024.06 🔥🔥[RWKV-CLIP] RWKV-CLIP: A Robust Vision-Language Representation Learner(@DeepGlint etc) [pdf] [RWKV-CLIP] ⭐️⭐️

📖GEMM/Tensor Cores/WMMA/Parallel (©️back👆🏻)

Date Title Paper Code Recom
2018.03 🔥🔥[Tensor Core] NVIDIA Tensor Core Programmability, Performance & Precision(@KTH Royal etc) [pdf] ⚠️ ⭐️
2021.05 🔥[Intra-SM Parallelism] Exploiting Intra-SM Parallelism in GPUs via Persistent and Elastic Blocks(@sjtu.edu.cn) [pdf] ⚠️ ⭐️
2022.06 [Microbenchmark] Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors(@tue.nl etc) [pdf] [DissectingTensorCores] ⭐️
2022.09 🔥🔥[FP8] FP8 FORMATS FOR DEEP LEARNING(@NVIDIA) [pdf] ⚠️ ⭐️
2023.08 🔥[Tensor Cores] Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library(@Tokyo Institute etc) [pdf] [wmma_extension] ⭐️
2023.03 🔥🔥[cutlass/cute] Graphene: An IR for Optimized Tensor Computations on GPUs(@NVIDIA) [pdf] [cutlass] ⭐️
2024.02 [QUICK] QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference(@SqueezeBits Inc) [pdf] [QUICK] ⭐️⭐️
2024.02 [Tensor Parallel] TP-AWARE DEQUANTIZATION(@IBM T.J. Watson Research Center) [pdf] ⚠️ ⭐️
2024.07 🔥🔥[flute] Fast Matrix Multiplications for Lookup Table-Quantized LLMs(@mit.edu etc) [pdf] [flute] ⭐️⭐️

📖Position Embed/Others (©️back👆🏻)

Date Title Paper Code Recom
2021.04 🔥[RoPE] ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING(@Zhuiyi Technology Co., Ltd.) [pdf] [transformers] ⭐️
2022.10 [ByteTransformer] A High-Performance Transformer Boosted for Variable-Length Inputs(@ByteDance&NVIDIA) [pdf] [ByteTransformer] ⭐️

©️License

GNU General Public License v3.0

🎉Contribute

Welcome to star & submit a PR to this repo!

About

📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

Resources

License

Stars

Watchers

Forks

Packages

No packages published