Skip to content

Commit

Permalink
GITBOOK-209: Organize the papers of MLSys'24
Browse files Browse the repository at this point in the history
  • Loading branch information
mental2008 authored and gitbook-bot committed Jan 6, 2025
1 parent ebd6155 commit 46b8054
Show file tree
Hide file tree
Showing 5 changed files with 103 additions and 24 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Specifically, I have a broad interest in systems (e.g., OSDI, SOSP, NSDI, ATC, E

## Changelogs

* 01/2025: Update the paper list of [Research Skills](paper-list/research-skills.md); organize the papers of [HotNets 2024](reading-notes/conference/hotnets-2024.md).
* 01/2025: Update the paper list of [Research Skills](paper-list/research-skills.md); organize the papers of [HotNets 2024](reading-notes/conference/hotnets-2024.md), [MLSys 2024](reading-notes/conference/mlsys-2024.md).
* 12/2024: Briefly organize the papers of [EuroSys 2025](reading-notes/conference/eurosys-2025.md) (only Spring cycle); organize the papers of [SoCC 2024](reading-notes/conference/socc-2024.md), [SC 2024](reading-notes/conference/sc-2024.md); update the reading notes of [SOSP 2024](reading-notes/conference/sosp-2024.md).
* 09/2024: Organize the papers of [SOSP 2024](reading-notes/conference/sosp-2024.md).
* 08/2024: Organize the papers of [VLDB 2024](reading-notes/conference/vldb-2024.md); update the reading notes of [SIGCOMM 2024](reading-notes/conference/sigcomm-2024.md); create new paper lists of [diffusion models](paper-list/artificial-intelligence/diffusion-models.md), [language models](paper-list/artificial-intelligence/language-models.md), and [deep learning recommendation models](paper-list/artificial-intelligence/dlrm.md).
Expand Down
2 changes: 1 addition & 1 deletion reading-notes/conference/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
| [OSDI 2024](osdi-2024.md) | Jul 10-12, 2024 | Santa Clara, CA, USA | 🧐; co-located with [ATC 2024](atc-2024.md) |
| [ISCA 2024](isca-2024.md) | Jun 29-Jul 3, 2024 | Buenos Aires, Argentina | 🧐 |
| [CVPR 2024](cvpr-2024.md) | Jun 17-21, 2024 | Seattle Convention Center, Seattle, WA, USA | 🧐 |
| [MLSys 2024](mlsys-2024.md) | May 13-16, 2024 | Santa Clara Convention Center, USA | |
| [MLSys 2024](mlsys-2024.md) | May 13-16, 2024 | Santa Clara Convention Center, USA | 🧐 |
| [ASPLOS 2024](asplos-2024/) | Apr 27-May 1, 2024 | Hilton La Jolla Torrey Pines, San Diego, USA | 🧐 |
| [EuroSys 2024](eurosys-2024/) | Apr 23-26, 2024 | Athens, Greece | |
| [NSDI 2024](nsdi-2024.md) | Apr 16-18, 2024 | Santa Clara, CA, USA | 🧐 |
Expand Down
6 changes: 3 additions & 3 deletions reading-notes/conference/hotnets-2024.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,14 @@ Paper list: [https://conferences.sigcomm.org/hotnets/2024/program.html](https://

### Large Language Models (LLMs)

* Networking for LLM Training
* Networking for LLM training
* I’ve Got 99 Problems But FLOPS Ain’t One \[[Paper](https://conferences.sigcomm.org/hotnets/2024/papers/hotnets24-333.pdf)]
* University Politehnica of Bucharest
* The future of large-scale AI infrastructure requires
* (1) novel wide-area transports for inter-DC communication;
* (2) a multipath transport and novel datacenter topologies for intra-datacenter communication;
* (3) high-speed scale-up networks and transport.
* LLM for Networking
* LLM for networking
* Designing Network Algorithms via Large Language Models \[[Paper](https://conferences.sigcomm.org/hotnets/2024/papers/hotnets24-88.pdf)]
* MSR
* **NADA**: Network Algorithm Design Automation via LLMs
Expand Down Expand Up @@ -62,5 +62,5 @@ Paper list: [https://conferences.sigcomm.org/hotnets/2024/program.html](https://
## Acronyms

* TTL: Time-To-Live
* DNN: Deep Neural Networks
* DNN: Deep Neural Network
* DC: Datacenter
113 changes: 96 additions & 17 deletions reading-notes/conference/mlsys-2024.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,20 +8,99 @@ Paper list: [https://mlsys.org/Conferences/2024/AcceptedPapers](https://mlsys.or

## Papers

* S-LoRA: Serving Thousands of Concurrent LoRA Adapters \[[arXiv](https://arxiv.org/abs/2311.03285)] \[[Code](https://github.com/S-LoRA/S-LoRA)]
* UC Berkeley
* A system to serve many LoRA adapters
* Store all adapters in the main memory and fetch the adapters used by the currently running queries to the GPU memory
* Unified Paging — a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths
* Employ a tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation
* Built on top of [LightLLM](https://github.com/ModelTC/lightllm)
* Punica: Multi-Tenant LoRA Serving \[[arXiv](https://arxiv.org/abs/2310.18547)] \[[Code](https://github.com/punica-ai/punica)]
* UW & Duke
* A system to serve multiple LoRA models in a shared GPU cluster
* A CUDA kernel — Segmented Gather Matrix-Vector Multiplication (SGMV)
* Batch GPU operations for concurrent execution of different LoRA models
* A GPU only needs to store a single copy of the pre-trained model
* A request scheduling mechanism to consolidate multi-tenant LoRA serving workloads
* Route the new request to a small set of active GPUs
* Allocate additional GPU resources when the existing GPUs are fully utilized
* Periodically migrate existing requests for consolidation
### Large Language Models (LLMs)

* LoRA serving
* S-LoRA: Serving Thousands of Concurrent LoRA Adapters \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/906419cd502575b617cc489a1a696a67-Paper-Conference.pdf)] \[[arXiv](https://arxiv.org/abs/2311.03285)] \[[Code](https://github.com/S-LoRA/S-LoRA)]
* UC Berkeley
* A system to serve many LoRA adapters
* Store all adapters in the main memory and fetch the adapters used by the currently running queries to the GPU memory
* Unified Paging — a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths
* Employ a tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation
* Built on top of [LightLLM](https://github.com/ModelTC/lightllm)
* Punica: Multi-Tenant LoRA Serving \[[arXiv](https://arxiv.org/abs/2310.18547)] \[[Code](https://github.com/punica-ai/punica)]
* UW & Duke
* A system to serve multiple LoRA models in a shared GPU cluster
* A CUDA kernel — Segmented Gather Matrix-Vector Multiplication (SGMV)
* Batch GPU operations for concurrent execution of different LoRA models
* A GPU only needs to store a single copy of the pre-trained model
* A request scheduling mechanism to consolidate multi-tenant LoRA serving workloads
* Route the new request to a small set of active GPUs
* Allocate additional GPU resources when the existing GPUs are fully utilized
* Periodically migrate existing requests for consolidation
* LLM inference
* Keyformer: KV Cache reduction through key tokens selection for Efficient Generative Inference \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/48fecef47b19fe501d27d338b6d52582-Paper-Conference.pdf)] \[[Code](https://github.com/d-matrix-ai/keyformer-llm)]
* UBC & d-Matrix
* Prompt Cache: Modular Attention Reuse for Low-Latency Inference \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/a66caa1703fe34705a4368c3014c1966-Paper-Conference.pdf)]
* Yale & Google
* HeteGen: Efficient Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/5431dca75a8d2abc1fb51e89e8324f10-Paper-Conference.pdf)]
* NUS
* Vidur: A Large-scale Simulation Framework for LLM Inference \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/b74a8de47d2b3c928360e0a011f48351-Paper-Conference.pdf)] \[[Code](https://github.com/microsoft/vidur)]
* GaTech & MSR India
* FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/5321b1dabcd2be188d796c21b733e8c7-Paper-Conference.pdf)]
* THU & Infinigence-AI
* LLM fine-tuning
* Fine-Tuning Language Models Using Formal Methods Feedback: A Use Case in Autonomous Systems \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/b0131b6ee02a00b03fc3320176fec8f5-Paper-Conference.pdf)]
* UT-Austin
* LLM for data manipulation
* UniDM: A Unified Framework for Data Manipulation with Large Language Models \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/dcb38c6ad7911842ab31081be9540b89-Paper-Conference.pdf)]
* Alibaba & USTC

### Mixture-of-Experts (MoEs)

* MoE training
* Lancet: Accelerating Mixture-of-Experts Training by Overlapping Weight Gradient Computation and All-to-All Communication \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/339caf45a6fa281cae8adc6465343464-Paper-Conference.pdf)]
* HKU & AWS & Boson AI
* MoE inference
* QMoE: Sub-1-Bit Compression of Trillion Parameter Models \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/c74b624843218d9b6713fcf299d6d5e4-Paper-Conference.pdf)] \[[Code](https://github.com/IST-DASLab/qmoe)]
* Institute of Science and Technology Austria
* SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models

### Diffusion Models

* DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/45c1f6a8cbf2da59ebf2c802b4f742cd-Paper-Conference.pdf)]
* HKU & AWS

### Deep Learning Recommendation Models (DLRMs)

* Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large Scale Recommendation \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/78834433edc3291f4c6cbbd2759324db-Paper-Conference.pdf)]
* Meta AI

### ML Compilation

* ACRoBat: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/096b1019463f34eb241e87cfce8dfe16-Paper-Conference.pdf)]
* CMU
* Perform hybrid static+dynamic compiler optimizations and end-to-end tensor code generation

### Quantization

* FP8
* Efficient Post-training Quantization with FP8 Formats \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/dea9b4b6f55ae611c54065d6fc750755-Paper-Conference.pdf)]
* Intel
* LLM
* AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/42a452cbafa9dd64e9ba4aa95cc1ef21-Paper-Conference.pdf)] \[[Code](https://github.com/mit-han-lab/llm-awq)]
* MIT
* **Best Paper Award**
* Atom: Low-bit Quantization for Efficient and Accurate LLM Serving \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/5edb57c05c81d04beb716ef1d542fe9e-Paper-Conference.pdf)] \[[Code](https://github.com/efeslab/Atom)] \[[Slides](https://github.com/efeslab/Atom/blob/main/figures/atom_mlsys_slides.pdf)] \[[Poster](https://github.com/efeslab/Atom/blob/main/figures/atom_mlsys_poster.pdf)]
* UW
* Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/bbb7506579431a85861a05fff048d3e1-Paper-Conference.pdf)] \[[Code](https://github.com/VITA-Group/Q-Hitter)]
* UT-Texas & Oxford & Eindhoven University of Technology & Lawrence Livermore National Laboratory & CMU
* ML training
* JIT-Q: Just-in-time Quantization with Processing-In-Memory for Efficient ML Training \[[Paper](https://arxiv.org/pdf/2311.05034)] \[[Slides](https://mlsys.org/media/mlsys-2024/Slides/2660.pdf)]
* AMD

### Model Adaptation

* FLASH: Fast Model Adaptation in ML-Centric Cloud Platforms \[Paper] \[[Code](https://gitlab.engr.illinois.edu/DEPEND/flash)] \[[Slides](https://haoran-qiu.com/slides/flash-slides.pdf)]

### Cloud Configuration Generation

* CloudEval-YAML: A Practical Benchmark for Cloud Native YAML Configuration Generation \[[Paper](https://proceedings.mlsys.org/paper_files/paper/2024/file/554e056fe2b6d9fd27ffcd3367ae1267-Paper-Conference.pdf)] \[[Homepage](https://cloudeval-yaml.github.io)] \[[Code](https://github.com/alibaba/CloudEval-YAML)] \[[Benchmark](https://huggingface.co/datasets/ai4cloud/CloudEval-YAML)]
* Alibaba Cloud & UMich & UCLA & UC Merced

## Acronyms

* ML: Machine Learning
* LLM: Large Language Model
* LoRA: Low-Rank Adaptation
* MoE: Mixture-of-Experts
4 changes: 2 additions & 2 deletions reading-notes/conference/osdi-2024.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Paper list: [https://www.usenix.org/conference/osdi24/technical-sessions](https:

## Papers

### Serving Large Language Models (LLMs)
### Large Language Models (LLMs)

* Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve \[[Paper](https://www.usenix.org/conference/osdi24/presentation/agrawal)] \[[Code](https://github.com/microsoft/sarathi-serve)]
* MSR India & GaTech
Expand Down Expand Up @@ -97,7 +97,7 @@ Paper list: [https://www.usenix.org/conference/osdi24/technical-sessions](https:
* Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning \[[Paper](https://www.usenix.org/conference/osdi24/presentation/zhai)] \[[Code](https://github.com/zhaiyi000/tlm)]
* USTC & Huawei & ByteDance & Hunan University
* Tensor Language Model (TLM)
* Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation \[[Paper](https://www.usenix.org/conference/osdi24/presentation/wang-lei)] \[[Code](https://github.com/microsoft/BitBLAS/tree/osdi24\_ladder\_artifact)]
* Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation \[[Paper](https://www.usenix.org/conference/osdi24/presentation/wang-lei)] \[[Code](https://github.com/microsoft/BitBLAS/tree/osdi24_ladder_artifact)]
* MSRA
* MonoNN: Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures \[[Paper](https://www.usenix.org/conference/osdi24/presentation/zhuang)] \[[Code](https://github.com/AlibabaResearch/mononn)]
* Sydney & Alibaba
Expand Down

0 comments on commit 46b8054

Please sign in to comment.