Skip to content

Commit

Permalink
GITBOOK-178: Update reading notes of OSDI '24 papers
Browse files Browse the repository at this point in the history
  • Loading branch information
mental2008 authored and gitbook-bot committed Jul 21, 2024
1 parent c555e54 commit aa4dcf4
Showing 1 changed file with 32 additions and 3 deletions.
35 changes: 32 additions & 3 deletions reading-notes/conference/osdi-2024.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,18 +22,32 @@ Paper list: [https://www.usenix.org/conference/osdi24/technical-sessions](https:
* Use cost models to estimate the time of loading checkpoints from different tiers in the storage hierarchy and the time of migrating an LLM inference to another server; choose the best server to minimize model startup latency.
* InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management \[[Paper](https://www.usenix.org/conference/osdi24/presentation/lee)]
* Seoul National University
* **InfiniGen**: a _KV cache management_ framework for _long-text generation_.
* Key insight: A few important tokens can be speculated by performing a minimal rehearsal with the inputs of the current layer and part of the query weight and key cache of the subsequent layer.
* Prefetch only the essential KV cache entries instead of fetching them all. -> Mitigate the fetch overhead from the host memory.
* Llumnix: Dynamic Scheduling for Large Language Model Serving \[[Paper](https://www.usenix.org/conference/osdi24/presentation/sun-biao)] \[[Code](https://github.com/AlibabaPAI/llumnix)]
* Alibaba
* _Reschedule requests_ to improve load-balancing and isolation, mitigate resource fragmentation, and differentiate request priorities and SLOs.
* Live migration for requests and the in-memory states (tokens).
* DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving \[[Paper](https://www.usenix.org/conference/osdi24/presentation/zhong-yinmin)] \[[Code](https://github.com/LLMServe/DistServe)]
* PKU & UCSD
* Disaggregate the prefill and decoding computation.
* Co-optimize the resource allocation and parallelism strategy for each phase; consider the cluster's bandwidth to minimize the communication overhead.
* dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving \[[Paper](https://www.usenix.org/conference/osdi24/presentation/wu-bingyang)]
* PKU & Shanghai AI Lab
* A credit-based batching algorithm to decide when to _merge and unmerge_ LoRA adapters with the base model.
* A request-adapter co-migration algorithm to decide when to _migrate_ between different worker replicas.
* Parrot: Efficient Serving of LLM-based Applications with Semantic Variable \[[Paper](https://www.usenix.org/conference/osdi24/presentation/lin-chaofan)] \[[Code](https://github.com/microsoft/ParrotServe)]
* SJTU & MSRA
* **Semantic Variable**: a unified abstraction to expose application-level knowledge to public LLM services.
* Annotate an input/output variable in the prompt of a request.
* Create the data pipeline when connecting multiple LLM requests.
* Allow to perform conventional data flow analysis to uncover the correlation across multiple LLM requests.
* Implemented on Python.
* Fairness in Serving Large Language Models \[[Paper](https://www.usenix.org/conference/osdi24/presentation/sheng)] \[[Code](https://github.com/Ying1123/VTC-artifact)]
* UC Berkeley
* This is the _first_ work to discuss the _fair serving_ of LLMs.
* Propose a fair-serving algorithm called Virtual Token Counter (VTC).
* Propose a fair-serving algorithm called **Virtual Token Counter** (**VTC**).
* Track the services received for each client.
* Prioritize the ones with the least services received.
* Only manipulate the dispatch order and don't reject a request if it can fit in the batch.
Expand All @@ -42,26 +56,41 @@ Paper list: [https://www.usenix.org/conference/osdi24/technical-sessions](https:

* Optimizing Resource Allocation in Hyperscale Datacenters: Scalability, Usability, and Experiences \[[Paper](https://www.usenix.org/conference/osdi24/presentation/kumar)]
* Meta Platforms
* **Rebalancer**
* Main challenges for a resource-allocation framework.
* Usability: how to translate real-life policies into precise mathematical formulas.
* Scalability: NP-hard problems that cannot be solved efficiently by commercial solvers.
* **Rebalancer**: Meta's resource-allocation framework.
* An expression graph that enables its optimization algorithm to run more efficiently than past algorithms (for scalability).
* A high-level specification language to lower the barrier for adoption by system practitioners (for usability).

### Job Scheduling

* When will my ML Job finish? Toward providing Completion Time Estimates through Predictability-Centric Scheduling \[[Paper](https://www.usenix.org/conference/osdi24/presentation/bin-faisal)] \[[Code](https://github.com/TuftsNATLab/PCS)]
* Tufts
* PCS: Predictability-Centric Scheduling
* Use Weighted-Fair-Queueing (WFQ) and find a suitable configuration of different WFQ parameters (e.g., queue weights).
* Use a simulation-aided search strategy to discover WFQ configurations.
* MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale \[[Paper](https://www.usenix.org/conference/osdi24/presentation/choudhury)]
* Meta Platforms
* MAST: ML Application Scheduler on Twine
* Provide a global-scheduling abstraction to all ML training workloads.
* Three design principles: temporal decoupling, scope decoupling, and exhaustive search.

### Auto Parallelization

* nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training \[[Paper](https://www.usenix.org/conference/osdi24/presentation/lin-zhiqi)] \[[Code](https://github.com/microsoft/nnscaler)]
* USTC & MSRA & xAI & BaseBit Technologies
* Empower domain experts to construct their own search space through three primitives, `op-trans`, `op-assign`, and `op-order`.
* Allow the application of constraints to those primitives during space construction.

### Machine Learning Inference

* Usher: Holistic Interference Avoidance for Resource Optimized ML Inference \[[Paper](https://www.usenix.org/conference/osdi24/presentation/shubha)]
* Usher: Holistic Interference Avoidance for Resource Optimized ML Inference \[[Paper](https://www.usenix.org/conference/osdi24/presentation/shubha)] \[[Code](https://github.com/ss7krd/Usher)]
* UVA & GaTech
* Usher: an interference-aware ML serving system to maximize resource utilization (GPU spatial multiplexing).
* GPU kernel-based model resource requirement estimator.
* Heuristic-based interference-aware resource utilization-maximizing scheduler that decides the batch size, model replication degree, and model placement.
* Operator graph merger to merge multiple models to minimize interference in GPU cache.

### Tensor Program Generation

Expand Down

0 comments on commit aa4dcf4

Please sign in to comment.