From 46b805436c1498eb6c149458cfc1d5aadb666c6e Mon Sep 17 00:00:00 2001 From: Lingyun Yang Date: Mon, 6 Jan 2025 13:34:52 +0000 Subject: [PATCH] GITBOOK-209: Organize the papers of MLSys'24 --- README.md | 2 +- reading-notes/conference/README.md | 2 +- reading-notes/conference/hotnets-2024.md | 6 +- reading-notes/conference/mlsys-2024.md | 113 +++++++++++++++++++---- reading-notes/conference/osdi-2024.md | 4 +- 5 files changed, 103 insertions(+), 24 deletions(-) diff --git a/README.md b/README.md index bb7fff4..8bb6ae4 100644 --- a/README.md +++ b/README.md @@ -18,7 +18,7 @@ Specifically, I have a broad interest in systems (e.g., OSDI, SOSP, NSDI, ATC, E ## Changelogs -* 01/2025: Update the paper list of [Research Skills](paper-list/research-skills.md); organize the papers of [HotNets 2024](reading-notes/conference/hotnets-2024.md). +* 01/2025: Update the paper list of [Research Skills](paper-list/research-skills.md); organize the papers of [HotNets 2024](reading-notes/conference/hotnets-2024.md), [MLSys 2024](reading-notes/conference/mlsys-2024.md). * 12/2024: Briefly organize the papers of [EuroSys 2025](reading-notes/conference/eurosys-2025.md) (only Spring cycle); organize the papers of [SoCC 2024](reading-notes/conference/socc-2024.md), [SC 2024](reading-notes/conference/sc-2024.md); update the reading notes of [SOSP 2024](reading-notes/conference/sosp-2024.md). * 09/2024: Organize the papers of [SOSP 2024](reading-notes/conference/sosp-2024.md). * 08/2024: Organize the papers of [VLDB 2024](reading-notes/conference/vldb-2024.md); update the reading notes of [SIGCOMM 2024](reading-notes/conference/sigcomm-2024.md); create new paper lists of [diffusion models](paper-list/artificial-intelligence/diffusion-models.md), [language models](paper-list/artificial-intelligence/language-models.md), and [deep learning recommendation models](paper-list/artificial-intelligence/dlrm.md). diff --git a/reading-notes/conference/README.md b/reading-notes/conference/README.md index 1e4f16c..6e1f9fa 100644 --- a/reading-notes/conference/README.md +++ b/reading-notes/conference/README.md @@ -23,7 +23,7 @@ | [OSDI 2024](osdi-2024.md) | Jul 10-12, 2024 | Santa Clara, CA, USA | 🧐; co-located with [ATC 2024](atc-2024.md) | | [ISCA 2024](isca-2024.md) | Jun 29-Jul 3, 2024 | Buenos Aires, Argentina | 🧐 | | [CVPR 2024](cvpr-2024.md) | Jun 17-21, 2024 | Seattle Convention Center, Seattle, WA, USA | 🧐 | -| [MLSys 2024](mlsys-2024.md) | May 13-16, 2024 | Santa Clara Convention Center, USA | | +| [MLSys 2024](mlsys-2024.md) | May 13-16, 2024 | Santa Clara Convention Center, USA | 🧐 | | [ASPLOS 2024](asplos-2024/) | Apr 27-May 1, 2024 | Hilton La Jolla Torrey Pines, San Diego, USA | 🧐 | | [EuroSys 2024](eurosys-2024/) | Apr 23-26, 2024 | Athens, Greece | | | [NSDI 2024](nsdi-2024.md) | Apr 16-18, 2024 | Santa Clara, CA, USA | 🧐 | diff --git a/reading-notes/conference/hotnets-2024.md b/reading-notes/conference/hotnets-2024.md index 8b95518..cd90bff 100644 --- a/reading-notes/conference/hotnets-2024.md +++ b/reading-notes/conference/hotnets-2024.md @@ -10,14 +10,14 @@ Paper list: [https://conferences.sigcomm.org/hotnets/2024/program.html](https:// ### Large Language Models (LLMs) -* Networking for LLM Training +* Networking for LLM training * I’ve Got 99 Problems But FLOPS Ain’t One \[[Paper](https://conferences.sigcomm.org/hotnets/2024/papers/hotnets24-333.pdf)] * University Politehnica of Bucharest * The future of large-scale AI infrastructure requires * (1) novel wide-area transports for inter-DC communication; * (2) a multipath transport and novel datacenter topologies for intra-datacenter communication; * (3) high-speed scale-up networks and transport. -* LLM for Networking +* LLM for networking * Designing Network Algorithms via Large Language Models \[[Paper](https://conferences.sigcomm.org/hotnets/2024/papers/hotnets24-88.pdf)] * MSR * **NADA**: Network Algorithm Design Automation via LLMs @@ -62,5 +62,5 @@ Paper list: [https://conferences.sigcomm.org/hotnets/2024/program.html](https:// ## Acronyms * TTL: Time-To-Live -* DNN: Deep Neural Networks +* DNN: Deep Neural Network * DC: Datacenter diff --git a/reading-notes/conference/mlsys-2024.md b/reading-notes/conference/mlsys-2024.md index 71b1df0..9c62b57 100644 --- a/reading-notes/conference/mlsys-2024.md +++ b/reading-notes/conference/mlsys-2024.md @@ -8,20 +8,99 @@ Paper list: [https://mlsys.org/Conferences/2024/AcceptedPapers](https://mlsys.or ## Papers -* S-LoRA: Serving Thousands of Concurrent LoRA Adapters \[[arXiv](https://arxiv.org/abs/2311.03285)] \[[Code](https://github.com/S-LoRA/S-LoRA)] - * UC Berkeley - * A system to serve many LoRA adapters - * Store all adapters in the main memory and fetch the adapters used by the currently running queries to the GPU memory - * Unified Paging β€” a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths - * Employ a tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation - * Built on top of [LightLLM](https://github.com/ModelTC/lightllm) -* Punica: Multi-Tenant LoRA Serving \[[arXiv](https://arxiv.org/abs/2310.18547)] \[[Code](https://github.com/punica-ai/punica)] - * UW & Duke - * A system to serve multiple LoRA models in a shared GPU cluster - * A CUDA kernel β€” Segmented Gather Matrix-Vector Multiplication (SGMV) - * Batch GPU operations for concurrent execution of different LoRA models - * A GPU only needs to store a single copy of the pre-trained model - * A request scheduling mechanism to consolidate multi-tenant LoRA serving workloads - * Route the new request to a small set of active GPUs - * Allocate additional GPU resources when the existing GPUs are fully utilized - * Periodically migrate existing requests for consolidation +### Large Language Models (LLMs) + +* LoRA serving + * S-LoRA: Serving Thousands of Concurrent LoRA Adapters \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/906419cd502575b617cc489a1a696a67-Paper-Conference.pdf)] \[[arXiv](https://arxiv.org/abs/2311.03285)] \[[Code](https://github.com/S-LoRA/S-LoRA)] + * UC Berkeley + * A system to serve many LoRA adapters + * Store all adapters in the main memory and fetch the adapters used by the currently running queries to the GPU memory + * Unified Paging β€” a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths + * Employ a tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation + * Built on top of [LightLLM](https://github.com/ModelTC/lightllm) + * Punica: Multi-Tenant LoRA Serving \[[arXiv](https://arxiv.org/abs/2310.18547)] \[[Code](https://github.com/punica-ai/punica)] + * UW & Duke + * A system to serve multiple LoRA models in a shared GPU cluster + * A CUDA kernel β€” Segmented Gather Matrix-Vector Multiplication (SGMV) + * Batch GPU operations for concurrent execution of different LoRA models + * A GPU only needs to store a single copy of the pre-trained model + * A request scheduling mechanism to consolidate multi-tenant LoRA serving workloads + * Route the new request to a small set of active GPUs + * Allocate additional GPU resources when the existing GPUs are fully utilized + * Periodically migrate existing requests for consolidation +* LLM inference + * Keyformer: KV Cache reduction through key tokens selection for Efficient Generative Inference \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/48fecef47b19fe501d27d338b6d52582-Paper-Conference.pdf)] \[[Code](https://github.com/d-matrix-ai/keyformer-llm)] + * UBC & d-Matrix + * Prompt Cache: Modular Attention Reuse for Low-Latency Inference \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/a66caa1703fe34705a4368c3014c1966-Paper-Conference.pdf)] + * Yale & Google + * HeteGen: Efficient Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/5431dca75a8d2abc1fb51e89e8324f10-Paper-Conference.pdf)] + * NUS + * Vidur: A Large-scale Simulation Framework for LLM Inference \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/b74a8de47d2b3c928360e0a011f48351-Paper-Conference.pdf)] \[[Code](https://github.com/microsoft/vidur)] + * GaTech & MSR India + * FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/5321b1dabcd2be188d796c21b733e8c7-Paper-Conference.pdf)] + * THU & Infinigence-AI +* LLM fine-tuning + * Fine-Tuning Language Models Using Formal Methods Feedback: A Use Case in Autonomous Systems \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/b0131b6ee02a00b03fc3320176fec8f5-Paper-Conference.pdf)] + * UT-Austin +* LLM for data manipulation + * UniDM: A Unified Framework for Data Manipulation with Large Language Models \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/dcb38c6ad7911842ab31081be9540b89-Paper-Conference.pdf)] + * Alibaba & USTC + +### Mixture-of-Experts (MoEs) + +* MoE training + * Lancet: Accelerating Mixture-of-Experts Training by Overlapping Weight Gradient Computation and All-to-All Communication \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/339caf45a6fa281cae8adc6465343464-Paper-Conference.pdf)] + * HKU & AWS & Boson AI +* MoE inference + * QMoE: Sub-1-Bit Compression of Trillion Parameter Models \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/c74b624843218d9b6713fcf299d6d5e4-Paper-Conference.pdf)] \[[Code](https://github.com/IST-DASLab/qmoe)] + * Institute of Science and Technology Austria + * SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models + +### Diffusion Models + +* DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/45c1f6a8cbf2da59ebf2c802b4f742cd-Paper-Conference.pdf)] + * HKU & AWS + +### Deep Learning Recommendation Models (DLRMs) + +* Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large Scale Recommendation \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/78834433edc3291f4c6cbbd2759324db-Paper-Conference.pdf)] + * Meta AI + +### ML Compilation + +* ACRoBat: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/096b1019463f34eb241e87cfce8dfe16-Paper-Conference.pdf)] + * CMU + * Perform hybrid static+dynamic compiler optimizations and end-to-end tensor code generation + +### Quantization + +* FP8 + * Efficient Post-training Quantization with FP8 Formats \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/dea9b4b6f55ae611c54065d6fc750755-Paper-Conference.pdf)] + * Intel +* LLM + * AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/42a452cbafa9dd64e9ba4aa95cc1ef21-Paper-Conference.pdf)] \[[Code](https://github.com/mit-han-lab/llm-awq)] + * MIT + * **Best Paper Award** + * Atom: Low-bit Quantization for Efficient and Accurate LLM Serving \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/5edb57c05c81d04beb716ef1d542fe9e-Paper-Conference.pdf)] \[[Code](https://github.com/efeslab/Atom)] \[[Slides](https://github.com/efeslab/Atom/blob/main/figures/atom_mlsys_slides.pdf)] \[[Poster](https://github.com/efeslab/Atom/blob/main/figures/atom_mlsys_poster.pdf)] + * UW + * Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/bbb7506579431a85861a05fff048d3e1-Paper-Conference.pdf)] \[[Code](https://github.com/VITA-Group/Q-Hitter)] + * UT-Texas & Oxford & Eindhoven University of Technology & Lawrence Livermore National Laboratory & CMU +* ML training + * JIT-Q: Just-in-time Quantization with Processing-In-Memory for Efficient ML Training \[[Paper](https://arxiv.org/pdf/2311.05034)] \[[Slides](https://mlsys.org/media/mlsys-2024/Slides/2660.pdf)] + * AMD + +### Model Adaptation + +* FLASH: Fast Model Adaptation in ML-Centric Cloud Platforms \[Paper] \[[Code](https://gitlab.engr.illinois.edu/DEPEND/flash)] \[[Slides](https://haoran-qiu.com/slides/flash-slides.pdf)] + +### Cloud Configuration Generation + +* CloudEval-YAML: A Practical Benchmark for Cloud Native YAML Configuration Generation \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/554e056fe2b6d9fd27ffcd3367ae1267-Paper-Conference.pdf)] \[[Homepage](https://cloudeval-yaml.github.io)] \[[Code](https://github.com/alibaba/CloudEval-YAML)] \[[Benchmark](https://huggingface.co/datasets/ai4cloud/CloudEval-YAML)] + * Alibaba Cloud & UMich & UCLA & UC Merced + +## Acronyms + +* ML: Machine Learning +* LLM: Large Language Model +* LoRA: Low-Rank Adaptation +* MoE: Mixture-of-Experts diff --git a/reading-notes/conference/osdi-2024.md b/reading-notes/conference/osdi-2024.md index b368adf..088853e 100644 --- a/reading-notes/conference/osdi-2024.md +++ b/reading-notes/conference/osdi-2024.md @@ -8,7 +8,7 @@ Paper list: [https://www.usenix.org/conference/osdi24/technical-sessions](https: ## Papers -### Serving Large Language Models (LLMs) +### Large Language Models (LLMs) * Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve \[[Paper](https://www.usenix.org/conference/osdi24/presentation/agrawal)] \[[Code](https://github.com/microsoft/sarathi-serve)] * MSR India & GaTech @@ -97,7 +97,7 @@ Paper list: [https://www.usenix.org/conference/osdi24/technical-sessions](https: * Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning \[[Paper](https://www.usenix.org/conference/osdi24/presentation/zhai)] \[[Code](https://github.com/zhaiyi000/tlm)] * USTC & Huawei & ByteDance & Hunan University * Tensor Language Model (TLM) -* Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation \[[Paper](https://www.usenix.org/conference/osdi24/presentation/wang-lei)] \[[Code](https://github.com/microsoft/BitBLAS/tree/osdi24\_ladder\_artifact)] +* Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation \[[Paper](https://www.usenix.org/conference/osdi24/presentation/wang-lei)] \[[Code](https://github.com/microsoft/BitBLAS/tree/osdi24_ladder_artifact)] * MSRA * MonoNN: Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures \[[Paper](https://www.usenix.org/conference/osdi24/presentation/zhuang)] \[[Code](https://github.com/AlibabaResearch/mononn)] * Sydney & Alibaba