From 46b805436c1498eb6c149458cfc1d5aadb666c6e Mon Sep 17 00:00:00 2001
From: Lingyun Yang <yangly1999@gmail.com>
Date: Mon, 6 Jan 2025 13:34:52 +0000
Subject: [PATCH] GITBOOK-209: Organize the papers of MLSys'24

---
 README.md                                |   2 +-
 reading-notes/conference/README.md       |   2 +-
 reading-notes/conference/hotnets-2024.md |   6 +-
 reading-notes/conference/mlsys-2024.md   | 113 +++++++++++++++++++----
 reading-notes/conference/osdi-2024.md    |   4 +-
 5 files changed, 103 insertions(+), 24 deletions(-)

diff --git a/README.md b/README.md
index bb7fff4..8bb6ae4 100644
--- a/README.md
+++ b/README.md
@@ -18,7 +18,7 @@ Specifically, I have a broad interest in systems (e.g., OSDI, SOSP, NSDI, ATC, E
 
 ## Changelogs
 
-* 01/2025: Update the paper list of [Research Skills](paper-list/research-skills.md); organize the papers of [HotNets 2024](reading-notes/conference/hotnets-2024.md).
+* 01/2025: Update the paper list of [Research Skills](paper-list/research-skills.md); organize the papers of [HotNets 2024](reading-notes/conference/hotnets-2024.md), [MLSys 2024](reading-notes/conference/mlsys-2024.md).
 * 12/2024: Briefly organize the papers of [EuroSys 2025](reading-notes/conference/eurosys-2025.md) (only Spring cycle); organize the papers of [SoCC 2024](reading-notes/conference/socc-2024.md), [SC 2024](reading-notes/conference/sc-2024.md); update the reading notes of [SOSP 2024](reading-notes/conference/sosp-2024.md).
 * 09/2024: Organize the papers of [SOSP 2024](reading-notes/conference/sosp-2024.md).
 * 08/2024: Organize the papers of [VLDB 2024](reading-notes/conference/vldb-2024.md); update the reading notes of [SIGCOMM 2024](reading-notes/conference/sigcomm-2024.md); create new paper lists of [diffusion models](paper-list/artificial-intelligence/diffusion-models.md), [language models](paper-list/artificial-intelligence/language-models.md), and [deep learning recommendation models](paper-list/artificial-intelligence/dlrm.md).
diff --git a/reading-notes/conference/README.md b/reading-notes/conference/README.md
index 1e4f16c..6e1f9fa 100644
--- a/reading-notes/conference/README.md
+++ b/reading-notes/conference/README.md
@@ -23,7 +23,7 @@
 |    [OSDI 2024](osdi-2024.md)    |   Jul 10-12, 2024  | Santa Clara, CA, USA                                   |  🧐; co-located with [ATC 2024](atc-2024.md)  |
 |    [ISCA 2024](isca-2024.md)    | Jun 29-Jul 3, 2024 | Buenos Aires, Argentina                                |                       🧐                      |
 |    [CVPR 2024](cvpr-2024.md)    |   Jun 17-21, 2024  | Seattle Convention Center, Seattle, WA, USA            |                       🧐                      |
-|   [MLSys 2024](mlsys-2024.md)   |   May 13-16, 2024  | Santa Clara Convention Center, USA                     |                                               |
+|   [MLSys 2024](mlsys-2024.md)   |   May 13-16, 2024  | Santa Clara Convention Center, USA                     |                       🧐                      |
 |   [ASPLOS 2024](asplos-2024/)   | Apr 27-May 1, 2024 | Hilton La Jolla Torrey Pines, San Diego, USA           |                       🧐                      |
 |  [EuroSys 2024](eurosys-2024/)  |   Apr 23-26, 2024  | Athens, Greece                                         |                                               |
 |    [NSDI 2024](nsdi-2024.md)    |   Apr 16-18, 2024  | Santa Clara, CA, USA                                   |                       🧐                      |
diff --git a/reading-notes/conference/hotnets-2024.md b/reading-notes/conference/hotnets-2024.md
index 8b95518..cd90bff 100644
--- a/reading-notes/conference/hotnets-2024.md
+++ b/reading-notes/conference/hotnets-2024.md
@@ -10,14 +10,14 @@ Paper list: [https://conferences.sigcomm.org/hotnets/2024/program.html](https://
 
 ### Large Language Models (LLMs)
 
-* Networking for LLM Training
+* Networking for LLM training
   * I’ve Got 99 Problems But FLOPS Ain’t One \[[Paper](https://conferences.sigcomm.org/hotnets/2024/papers/hotnets24-333.pdf)]
     * University Politehnica of Bucharest
     * The future of large-scale AI infrastructure requires
       * (1) novel wide-area transports for inter-DC communication;
       * (2) a multipath transport and novel datacenter topologies for intra-datacenter communication;
       * (3) high-speed scale-up networks and transport.
-* LLM for Networking
+* LLM for networking
   * Designing Network Algorithms via Large Language Models \[[Paper](https://conferences.sigcomm.org/hotnets/2024/papers/hotnets24-88.pdf)]
     * MSR
     * **NADA**: Network Algorithm Design Automation via LLMs
@@ -62,5 +62,5 @@ Paper list: [https://conferences.sigcomm.org/hotnets/2024/program.html](https://
 ## Acronyms
 
 * TTL: Time-To-Live
-* DNN: Deep Neural Networks
+* DNN: Deep Neural Network
 * DC: Datacenter
diff --git a/reading-notes/conference/mlsys-2024.md b/reading-notes/conference/mlsys-2024.md
index 71b1df0..9c62b57 100644
--- a/reading-notes/conference/mlsys-2024.md
+++ b/reading-notes/conference/mlsys-2024.md
@@ -8,20 +8,99 @@ Paper list: [https://mlsys.org/Conferences/2024/AcceptedPapers](https://mlsys.or
 
 ## Papers
 
-* S-LoRA: Serving Thousands of Concurrent LoRA Adapters \[[arXiv](https://arxiv.org/abs/2311.03285)] \[[Code](https://github.com/S-LoRA/S-LoRA)]
-  * UC Berkeley
-  * A system to serve many LoRA adapters
-  * Store all adapters in the main memory and fetch the adapters used by the currently running queries to the GPU memory
-  * Unified Paging — a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths
-  * Employ a tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation
-  * Built on top of [LightLLM](https://github.com/ModelTC/lightllm)
-* Punica: Multi-Tenant LoRA Serving \[[arXiv](https://arxiv.org/abs/2310.18547)] \[[Code](https://github.com/punica-ai/punica)]
-  * UW & Duke
-  * A system to serve multiple LoRA models in a shared GPU cluster
-  * A CUDA kernel — Segmented Gather Matrix-Vector Multiplication (SGMV)
-    * Batch GPU operations for concurrent execution of different LoRA models
-    * A GPU only needs to store a single copy of the pre-trained model
-  * A request scheduling mechanism to consolidate multi-tenant LoRA serving workloads
-    * Route the new request to a small set of active GPUs
-    * Allocate additional GPU resources when the existing GPUs are fully utilized
-    * Periodically migrate existing requests for consolidation
+### Large Language Models (LLMs)
+
+* LoRA serving
+  * S-LoRA: Serving Thousands of Concurrent LoRA Adapters \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/906419cd502575b617cc489a1a696a67-Paper-Conference.pdf)] \[[arXiv](https://arxiv.org/abs/2311.03285)] \[[Code](https://github.com/S-LoRA/S-LoRA)]
+    * UC Berkeley
+      * A system to serve many LoRA adapters
+      * Store all adapters in the main memory and fetch the adapters used by the currently running queries to the GPU memory
+      * Unified Paging — a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths
+      * Employ a tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation
+      * Built on top of [LightLLM](https://github.com/ModelTC/lightllm)
+  * Punica: Multi-Tenant LoRA Serving \[[arXiv](https://arxiv.org/abs/2310.18547)] \[[Code](https://github.com/punica-ai/punica)]
+    * UW & Duke
+      * A system to serve multiple LoRA models in a shared GPU cluster
+      * A CUDA kernel — Segmented Gather Matrix-Vector Multiplication (SGMV)
+        * Batch GPU operations for concurrent execution of different LoRA models
+        * A GPU only needs to store a single copy of the pre-trained model
+      * A request scheduling mechanism to consolidate multi-tenant LoRA serving workloads
+        * Route the new request to a small set of active GPUs
+        * Allocate additional GPU resources when the existing GPUs are fully utilized
+        * Periodically migrate existing requests for consolidation
+* LLM inference
+  * Keyformer: KV Cache reduction through key tokens selection for Efficient Generative Inference \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/48fecef47b19fe501d27d338b6d52582-Paper-Conference.pdf)] \[[Code](https://github.com/d-matrix-ai/keyformer-llm)]
+    * UBC & d-Matrix
+  * Prompt Cache: Modular Attention Reuse for Low-Latency Inference \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/a66caa1703fe34705a4368c3014c1966-Paper-Conference.pdf)]
+    * Yale & Google
+  * HeteGen: Efficient Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/5431dca75a8d2abc1fb51e89e8324f10-Paper-Conference.pdf)]
+    * NUS
+  * Vidur: A Large-scale Simulation Framework for LLM Inference \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/b74a8de47d2b3c928360e0a011f48351-Paper-Conference.pdf)] \[[Code](https://github.com/microsoft/vidur)]
+    * GaTech & MSR India
+  * FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/5321b1dabcd2be188d796c21b733e8c7-Paper-Conference.pdf)]
+    * THU & Infinigence-AI
+* LLM fine-tuning
+  * Fine-Tuning Language Models Using Formal Methods Feedback: A Use Case in Autonomous Systems \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/b0131b6ee02a00b03fc3320176fec8f5-Paper-Conference.pdf)]
+    * UT-Austin
+* LLM for data manipulation
+  * UniDM: A Unified Framework for Data Manipulation with Large Language Models \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/dcb38c6ad7911842ab31081be9540b89-Paper-Conference.pdf)]
+    * Alibaba & USTC
+
+### Mixture-of-Experts (MoEs)
+
+* MoE training
+  * Lancet: Accelerating Mixture-of-Experts Training by Overlapping Weight Gradient Computation and All-to-All Communication \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/339caf45a6fa281cae8adc6465343464-Paper-Conference.pdf)]
+    * HKU & AWS & Boson AI
+* MoE inference
+  * QMoE: Sub-1-Bit Compression of Trillion Parameter Models \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/c74b624843218d9b6713fcf299d6d5e4-Paper-Conference.pdf)] \[[Code](https://github.com/IST-DASLab/qmoe)]
+    * Institute of Science and Technology Austria
+  * SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models
+
+### Diffusion Models
+
+* DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/45c1f6a8cbf2da59ebf2c802b4f742cd-Paper-Conference.pdf)]
+  * HKU & AWS
+
+### Deep Learning Recommendation Models (DLRMs)
+
+* Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large Scale Recommendation \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/78834433edc3291f4c6cbbd2759324db-Paper-Conference.pdf)]
+  * Meta AI
+
+### ML Compilation
+
+* ACRoBat: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/096b1019463f34eb241e87cfce8dfe16-Paper-Conference.pdf)]
+  * CMU
+  * Perform hybrid static+dynamic compiler optimizations and end-to-end tensor code generation
+
+### Quantization
+
+* FP8
+  * Efficient Post-training Quantization with FP8 Formats \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/dea9b4b6f55ae611c54065d6fc750755-Paper-Conference.pdf)]
+    * Intel
+* LLM
+  * AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/42a452cbafa9dd64e9ba4aa95cc1ef21-Paper-Conference.pdf)] \[[Code](https://github.com/mit-han-lab/llm-awq)]
+    * MIT
+    * **Best Paper Award**
+  * Atom: Low-bit Quantization for Efficient and Accurate LLM Serving \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/5edb57c05c81d04beb716ef1d542fe9e-Paper-Conference.pdf)] \[[Code](https://github.com/efeslab/Atom)] \[[Slides](https://github.com/efeslab/Atom/blob/main/figures/atom_mlsys_slides.pdf)] \[[Poster](https://github.com/efeslab/Atom/blob/main/figures/atom_mlsys_poster.pdf)]
+    * UW
+  * Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/bbb7506579431a85861a05fff048d3e1-Paper-Conference.pdf)] \[[Code](https://github.com/VITA-Group/Q-Hitter)]
+    * UT-Texas & Oxford & Eindhoven University of Technology & Lawrence Livermore National Laboratory & CMU
+* ML training
+  * JIT-Q: Just-in-time Quantization with Processing-In-Memory for Efficient ML Training \[[Paper](https://arxiv.org/pdf/2311.05034)] \[[Slides](https://mlsys.org/media/mlsys-2024/Slides/2660.pdf)]
+    * AMD
+
+### Model Adaptation
+
+* FLASH: Fast Model Adaptation in ML-Centric Cloud Platforms \[Paper] \[[Code](https://gitlab.engr.illinois.edu/DEPEND/flash)] \[[Slides](https://haoran-qiu.com/slides/flash-slides.pdf)]
+
+### Cloud Configuration Generation
+
+* CloudEval-YAML: A Practical Benchmark for Cloud Native YAML Configuration Generation \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/554e056fe2b6d9fd27ffcd3367ae1267-Paper-Conference.pdf)] \[[Homepage](https://cloudeval-yaml.github.io)] \[[Code](https://github.com/alibaba/CloudEval-YAML)] \[[Benchmark](https://huggingface.co/datasets/ai4cloud/CloudEval-YAML)]
+  * Alibaba Cloud & UMich & UCLA & UC Merced
+
+## Acronyms
+
+* ML: Machine Learning
+* LLM: Large Language Model
+* LoRA: Low-Rank Adaptation
+* MoE: Mixture-of-Experts
diff --git a/reading-notes/conference/osdi-2024.md b/reading-notes/conference/osdi-2024.md
index b368adf..088853e 100644
--- a/reading-notes/conference/osdi-2024.md
+++ b/reading-notes/conference/osdi-2024.md
@@ -8,7 +8,7 @@ Paper list: [https://www.usenix.org/conference/osdi24/technical-sessions](https:
 
 ## Papers
 
-### Serving Large Language Models (LLMs)
+### Large Language Models (LLMs)
 
 * Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve \[[Paper](https://www.usenix.org/conference/osdi24/presentation/agrawal)] \[[Code](https://github.com/microsoft/sarathi-serve)]
   * MSR India & GaTech
@@ -97,7 +97,7 @@ Paper list: [https://www.usenix.org/conference/osdi24/technical-sessions](https:
 * Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning \[[Paper](https://www.usenix.org/conference/osdi24/presentation/zhai)] \[[Code](https://github.com/zhaiyi000/tlm)]
   * USTC & Huawei & ByteDance & Hunan University
   * Tensor Language Model (TLM)
-* Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation \[[Paper](https://www.usenix.org/conference/osdi24/presentation/wang-lei)] \[[Code](https://github.com/microsoft/BitBLAS/tree/osdi24\_ladder\_artifact)]
+* Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation \[[Paper](https://www.usenix.org/conference/osdi24/presentation/wang-lei)] \[[Code](https://github.com/microsoft/BitBLAS/tree/osdi24_ladder_artifact)]
   * MSRA
 * MonoNN: Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures \[[Paper](https://www.usenix.org/conference/osdi24/presentation/zhuang)] \[[Code](https://github.com/AlibabaResearch/mononn)]
   * Sydney & Alibaba