From aa4dcf46dcf55f3e9f9ed2dd5042d94961055b69 Mon Sep 17 00:00:00 2001
From: Lingyun Yang <yangly1999@gmail.com>
Date: Sun, 21 Jul 2024 13:05:38 +0000
Subject: [PATCH] GITBOOK-178: Update reading notes of OSDI '24 papers

---
 reading-notes/conference/osdi-2024.md | 35 ++++++++++++++++++++++++---
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/reading-notes/conference/osdi-2024.md b/reading-notes/conference/osdi-2024.md
index 5b75253..b368adf 100644
--- a/reading-notes/conference/osdi-2024.md
+++ b/reading-notes/conference/osdi-2024.md
@@ -22,18 +22,32 @@ Paper list: [https://www.usenix.org/conference/osdi24/technical-sessions](https:
   * Use cost models to estimate the time of loading checkpoints from different tiers in the storage hierarchy and the time of migrating an LLM inference to another server; choose the best server to minimize model startup latency.
 * InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management \[[Paper](https://www.usenix.org/conference/osdi24/presentation/lee)]
   * Seoul National University
+  * **InfiniGen**: a _KV cache management_ framework for _long-text generation_.
+  * Key insight: A few important tokens can be speculated by performing a minimal rehearsal with the inputs of the current layer and part of the query weight and key cache of the subsequent layer.
+  * Prefetch only the essential KV cache entries instead of fetching them all. -> Mitigate the fetch overhead from the host memory.
 * Llumnix: Dynamic Scheduling for Large Language Model Serving \[[Paper](https://www.usenix.org/conference/osdi24/presentation/sun-biao)] \[[Code](https://github.com/AlibabaPAI/llumnix)]
   * Alibaba
+  * _Reschedule requests_ to improve load-balancing and isolation, mitigate resource fragmentation, and differentiate request priorities and SLOs.
+  * Live migration for requests and the in-memory states (tokens).
 * DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving \[[Paper](https://www.usenix.org/conference/osdi24/presentation/zhong-yinmin)] \[[Code](https://github.com/LLMServe/DistServe)]
   * PKU & UCSD
+  * Disaggregate the prefill and decoding computation.
+  * Co-optimize the resource allocation and parallelism strategy for each phase; consider the cluster's bandwidth to minimize the communication overhead.
 * dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving \[[Paper](https://www.usenix.org/conference/osdi24/presentation/wu-bingyang)]
   * PKU & Shanghai AI Lab
+  * A credit-based batching algorithm to decide when to _merge and unmerge_ LoRA adapters with the base model.
+  * A request-adapter co-migration algorithm to decide when to _migrate_ between different worker replicas.
 * Parrot: Efficient Serving of LLM-based Applications with Semantic Variable \[[Paper](https://www.usenix.org/conference/osdi24/presentation/lin-chaofan)] \[[Code](https://github.com/microsoft/ParrotServe)]
   * SJTU & MSRA
+  * **Semantic Variable**: a unified abstraction to expose application-level knowledge to public LLM services.
+    * Annotate an input/output variable in the prompt of a request.
+    * Create the data pipeline when connecting multiple LLM requests.
+    * Allow to perform conventional data flow analysis to uncover the correlation across multiple LLM requests.
+  * Implemented on Python.
 * Fairness in Serving Large Language Models \[[Paper](https://www.usenix.org/conference/osdi24/presentation/sheng)] \[[Code](https://github.com/Ying1123/VTC-artifact)]
   * UC Berkeley
   * This is the _first_ work to discuss the _fair serving_ of LLMs.
-  * Propose a fair-serving algorithm called Virtual Token Counter (VTC).
+  * Propose a fair-serving algorithm called **Virtual Token Counter** (**VTC**).
     * Track the services received for each client.
     * Prioritize the ones with the least services received.
     * Only manipulate the dispatch order and don't reject a request if it can fit in the batch.
@@ -42,26 +56,41 @@ Paper list: [https://www.usenix.org/conference/osdi24/technical-sessions](https:
 
 * Optimizing Resource Allocation in Hyperscale Datacenters: Scalability, Usability, and Experiences \[[Paper](https://www.usenix.org/conference/osdi24/presentation/kumar)]
   * Meta Platforms
-  * **Rebalancer**
+  * Main challenges for a resource-allocation framework.
+    * Usability: how to translate real-life policies into precise mathematical formulas.
+    * Scalability: NP-hard problems that cannot be solved efficiently by commercial solvers.
+  * **Rebalancer**: Meta's resource-allocation framework.
+    * An expression graph that enables its optimization algorithm to run more efficiently than past algorithms (for scalability).
+    * A high-level specification language to lower the barrier for adoption by system practitioners (for usability).
 
 ### Job Scheduling
 
 * When will my ML Job finish? Toward providing Completion Time Estimates through Predictability-Centric Scheduling \[[Paper](https://www.usenix.org/conference/osdi24/presentation/bin-faisal)] \[[Code](https://github.com/TuftsNATLab/PCS)]
   * Tufts
   * PCS: Predictability-Centric Scheduling
+  * Use Weighted-Fair-Queueing (WFQ) and find a suitable configuration of different WFQ parameters (e.g., queue weights).
+  * Use a simulation-aided search strategy to discover WFQ configurations.
 * MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale \[[Paper](https://www.usenix.org/conference/osdi24/presentation/choudhury)]
   * Meta Platforms
+  * MAST: ML Application Scheduler on Twine
+  * Provide a global-scheduling abstraction to all ML training workloads.
   * Three design principles: temporal decoupling, scope decoupling, and exhaustive search.
 
 ### Auto Parallelization
 
 * nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training \[[Paper](https://www.usenix.org/conference/osdi24/presentation/lin-zhiqi)] \[[Code](https://github.com/microsoft/nnscaler)]
   * USTC & MSRA & xAI & BaseBit Technologies
+  * Empower domain experts to construct their own search space through three primitives, `op-trans`, `op-assign`, and `op-order`.
+  * Allow the application of constraints to those primitives during space construction.
 
 ### Machine Learning Inference
 
-* Usher: Holistic Interference Avoidance for Resource Optimized ML Inference \[[Paper](https://www.usenix.org/conference/osdi24/presentation/shubha)]
+* Usher: Holistic Interference Avoidance for Resource Optimized ML Inference \[[Paper](https://www.usenix.org/conference/osdi24/presentation/shubha)] \[[Code](https://github.com/ss7krd/Usher)]
   * UVA & GaTech
+  * Usher: an interference-aware ML serving system to maximize resource utilization (GPU spatial multiplexing).
+    * GPU kernel-based model resource requirement estimator.
+    * Heuristic-based interference-aware resource utilization-maximizing scheduler that decides the batch size, model replication degree, and model placement.
+    * Operator graph merger to merge multiple models to minimize interference in GPU cache.
 
 ### Tensor Program Generation