From aa4dcf46dcf55f3e9f9ed2dd5042d94961055b69 Mon Sep 17 00:00:00 2001 From: Lingyun Yang Date: Sun, 21 Jul 2024 13:05:38 +0000 Subject: [PATCH] GITBOOK-178: Update reading notes of OSDI '24 papers --- reading-notes/conference/osdi-2024.md | 35 ++++++++++++++++++++++++--- 1 file changed, 32 insertions(+), 3 deletions(-) diff --git a/reading-notes/conference/osdi-2024.md b/reading-notes/conference/osdi-2024.md index 5b75253..b368adf 100644 --- a/reading-notes/conference/osdi-2024.md +++ b/reading-notes/conference/osdi-2024.md @@ -22,18 +22,32 @@ Paper list: [https://www.usenix.org/conference/osdi24/technical-sessions](https: * Use cost models to estimate the time of loading checkpoints from different tiers in the storage hierarchy and the time of migrating an LLM inference to another server; choose the best server to minimize model startup latency. * InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management \[[Paper](https://www.usenix.org/conference/osdi24/presentation/lee)] * Seoul National University + * **InfiniGen**: a _KV cache management_ framework for _long-text generation_. + * Key insight: A few important tokens can be speculated by performing a minimal rehearsal with the inputs of the current layer and part of the query weight and key cache of the subsequent layer. + * Prefetch only the essential KV cache entries instead of fetching them all. -> Mitigate the fetch overhead from the host memory. * Llumnix: Dynamic Scheduling for Large Language Model Serving \[[Paper](https://www.usenix.org/conference/osdi24/presentation/sun-biao)] \[[Code](https://github.com/AlibabaPAI/llumnix)] * Alibaba + * _Reschedule requests_ to improve load-balancing and isolation, mitigate resource fragmentation, and differentiate request priorities and SLOs. + * Live migration for requests and the in-memory states (tokens). * DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving \[[Paper](https://www.usenix.org/conference/osdi24/presentation/zhong-yinmin)] \[[Code](https://github.com/LLMServe/DistServe)] * PKU & UCSD + * Disaggregate the prefill and decoding computation. + * Co-optimize the resource allocation and parallelism strategy for each phase; consider the cluster's bandwidth to minimize the communication overhead. * dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving \[[Paper](https://www.usenix.org/conference/osdi24/presentation/wu-bingyang)] * PKU & Shanghai AI Lab + * A credit-based batching algorithm to decide when to _merge and unmerge_ LoRA adapters with the base model. + * A request-adapter co-migration algorithm to decide when to _migrate_ between different worker replicas. * Parrot: Efficient Serving of LLM-based Applications with Semantic Variable \[[Paper](https://www.usenix.org/conference/osdi24/presentation/lin-chaofan)] \[[Code](https://github.com/microsoft/ParrotServe)] * SJTU & MSRA + * **Semantic Variable**: a unified abstraction to expose application-level knowledge to public LLM services. + * Annotate an input/output variable in the prompt of a request. + * Create the data pipeline when connecting multiple LLM requests. + * Allow to perform conventional data flow analysis to uncover the correlation across multiple LLM requests. + * Implemented on Python. * Fairness in Serving Large Language Models \[[Paper](https://www.usenix.org/conference/osdi24/presentation/sheng)] \[[Code](https://github.com/Ying1123/VTC-artifact)] * UC Berkeley * This is the _first_ work to discuss the _fair serving_ of LLMs. - * Propose a fair-serving algorithm called Virtual Token Counter (VTC). + * Propose a fair-serving algorithm called **Virtual Token Counter** (**VTC**). * Track the services received for each client. * Prioritize the ones with the least services received. * Only manipulate the dispatch order and don't reject a request if it can fit in the batch. @@ -42,26 +56,41 @@ Paper list: [https://www.usenix.org/conference/osdi24/technical-sessions](https: * Optimizing Resource Allocation in Hyperscale Datacenters: Scalability, Usability, and Experiences \[[Paper](https://www.usenix.org/conference/osdi24/presentation/kumar)] * Meta Platforms - * **Rebalancer** + * Main challenges for a resource-allocation framework. + * Usability: how to translate real-life policies into precise mathematical formulas. + * Scalability: NP-hard problems that cannot be solved efficiently by commercial solvers. + * **Rebalancer**: Meta's resource-allocation framework. + * An expression graph that enables its optimization algorithm to run more efficiently than past algorithms (for scalability). + * A high-level specification language to lower the barrier for adoption by system practitioners (for usability). ### Job Scheduling * When will my ML Job finish? Toward providing Completion Time Estimates through Predictability-Centric Scheduling \[[Paper](https://www.usenix.org/conference/osdi24/presentation/bin-faisal)] \[[Code](https://github.com/TuftsNATLab/PCS)] * Tufts * PCS: Predictability-Centric Scheduling + * Use Weighted-Fair-Queueing (WFQ) and find a suitable configuration of different WFQ parameters (e.g., queue weights). + * Use a simulation-aided search strategy to discover WFQ configurations. * MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale \[[Paper](https://www.usenix.org/conference/osdi24/presentation/choudhury)] * Meta Platforms + * MAST: ML Application Scheduler on Twine + * Provide a global-scheduling abstraction to all ML training workloads. * Three design principles: temporal decoupling, scope decoupling, and exhaustive search. ### Auto Parallelization * nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training \[[Paper](https://www.usenix.org/conference/osdi24/presentation/lin-zhiqi)] \[[Code](https://github.com/microsoft/nnscaler)] * USTC & MSRA & xAI & BaseBit Technologies + * Empower domain experts to construct their own search space through three primitives, `op-trans`, `op-assign`, and `op-order`. + * Allow the application of constraints to those primitives during space construction. ### Machine Learning Inference -* Usher: Holistic Interference Avoidance for Resource Optimized ML Inference \[[Paper](https://www.usenix.org/conference/osdi24/presentation/shubha)] +* Usher: Holistic Interference Avoidance for Resource Optimized ML Inference \[[Paper](https://www.usenix.org/conference/osdi24/presentation/shubha)] \[[Code](https://github.com/ss7krd/Usher)] * UVA & GaTech + * Usher: an interference-aware ML serving system to maximize resource utilization (GPU spatial multiplexing). + * GPU kernel-based model resource requirement estimator. + * Heuristic-based interference-aware resource utilization-maximizing scheduler that decides the batch size, model replication degree, and model placement. + * Operator graph merger to merge multiple models to minimize interference in GPU cache. ### Tensor Program Generation