diff --git a/README.md b/README.md index 740368b..42cd7e3 100644 --- a/README.md +++ b/README.md @@ -18,6 +18,7 @@ Specifically, I have a broad interest in systems (e.g., OSDI, SOSP, NSDI, ATC, E ## Changelogs +* 08/2024: Update the reading notes of [SIGCOMM 2024](reading-notes/conference/sigcomm-2024.md). * 07/2024: Organize the papers of [SIGCOMM 2024](reading-notes/conference/sigcomm-2024.md), [ICML 2024](reading-notes/conference/icml-2024.md), [ATC 2024](reading-notes/conference/atc-2024.md), [OSDI 2024](reading-notes/conference/osdi-2024.md), [NSDI 2024](reading-notes/conference/nsdi-2024.md), [CVPR 2024](reading-notes/conference/cvpr-2024.md), [ISCA 2024](reading-notes/conference/isca-2024.md); create a new paper list of [Systems for diffusion models](paper-list/systems-for-ml/diffusion-models.md); update the paper list of [Systems for LLMs](paper-list/systems-for-ml/llm.md), [Systems for DLRMs](paper-list/systems-for-ml/dlrm.md), [Resource Scheduler](paper-list/systems-for-ml/resource-scheduler.md). ## Epilogue diff --git a/paper-list/systems-for-ml/llm.md b/paper-list/systems-for-ml/llm.md index b260426..98017b2 100644 --- a/paper-list/systems-for-ml/llm.md +++ b/paper-list/systems-for-ml/llm.md @@ -26,7 +26,7 @@ I am actively maintaining this list. ## LLM Inference -* CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving ([SIGCOMM 2024](../../reading-notes/conference/sigcomm-2024.md)) \[[arXiv](https://arxiv.org/abs/2310.07240)] \[[Code](https://github.com/UChi-JCL/CacheGen)] +* CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving ([SIGCOMM 2024](../../reading-notes/conference/sigcomm-2024.md)) \[[arXiv](https://arxiv.org/abs/2310.07240)] \[[Code](https://github.com/UChi-JCL/CacheGen)] \[[Video](https://www.youtube.com/watch?v=H4\_OUWvdiNo)] * UChicago & Microsoft & Stanford * Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve ([OSDI 2024](../../reading-notes/conference/osdi-2024.md)) \[[Paper](https://www.usenix.org/conference/osdi24/presentation/agrawal)] \[[Code](https://github.com/microsoft/sarathi-serve)] \[[arXiv](https://arxiv.org/abs/2403.02310)] * MSR India & GaTech diff --git a/reading-notes/conference/README.md b/reading-notes/conference/README.md index 0f9887e..bd3118d 100644 --- a/reading-notes/conference/README.md +++ b/reading-notes/conference/README.md @@ -7,8 +7,8 @@ | SoCC 2024 | Nov 22-24, 2024 | Seattle, Washington, USA | **Upcoming** | | SC 2024 | Nov 17-22, 2024 | Atlanta, GA, USA | **Upcoming** | | SOSP 2024 | Nov 4-6, 2024 | Hilton Austin, Texas, USA | **Upcoming** | -| [SIGCOMM 2024](sigcomm-2024.md) | Aug 4-8, 2024 | Sydney, Australia | **Upcoming** | -| [ICML 2024](icml-2024.md) | Jul 21-27, 2024 | Messe Wien Exhibition Congress Center, Vienna, Austria | 👀**Ongoing!** | +| [SIGCOMM 2024](sigcomm-2024.md) | Aug 4-8, 2024 | Sydney, Australia | 🧐 | +| [ICML 2024](icml-2024.md) | Jul 21-27, 2024 | Messe Wien Exhibition Congress Center, Vienna, Austria | | | [ATC 2024](atc-2024.md) | Jul 10-12, 2024 | Santa Clara, CA, USA | 🧐; co-located with [OSDI 2024](osdi-2024.md) | | [OSDI 2024](osdi-2024.md) | Jul 10-12, 2024 | Santa Clara, CA, USA | 🧐; co-located with [ATC 2024](atc-2024.md) | | [ISCA 2024](isca-2024.md) | Jun 29-Jul 3, 2024 | Buenos Aires, Argentina | 🧐 | diff --git a/reading-notes/conference/sigcomm-2024.md b/reading-notes/conference/sigcomm-2024.md index a235070..2be8a46 100644 --- a/reading-notes/conference/sigcomm-2024.md +++ b/reading-notes/conference/sigcomm-2024.md @@ -4,43 +4,68 @@ Homepage: [https://conferences.sigcomm.org/sigcomm/2024/](https://conferences.sigcomm.org/sigcomm/2024/) -Paper list: [https://conferences.sigcomm.org/sigcomm/2024/program/](https://conferences.sigcomm.org/sigcomm/2024/program/) +### Paper list + +* [https://conferences.sigcomm.org/sigcomm/2024/program/](https://conferences.sigcomm.org/sigcomm/2024/program/) +* [https://dl.acm.org/doi/proceedings/10.1145/3651890](https://dl.acm.org/doi/proceedings/10.1145/3651890) ## Papers ### Large Language Models (LLMs) * Systems/Networking for LLM - * CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving \[[arXiv](https://arxiv.org/abs/2310.07240)] \[[Code](https://github.com/UChi-JCL/CacheGen)] + * CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving \[[Paper](https://dl.acm.org/doi/10.1145/3651890.3672274)] \[[arXiv](https://arxiv.org/abs/2310.07240)] \[[Code](https://github.com/UChi-JCL/CacheGen)] \[[Video](https://www.youtube.com/watch?v=H4\_OUWvdiNo)] * UChicago & Microsoft & Stanford + * **CacheGen**: A context-loading module for LLM systems. * Use a custom tensor encoder to encode a KV cache into more compact bitstream representations with negligible decoding overhead. * Adapt the compression level of different parts of a KV cache to cope with changes in available bandwidth. - * Focus on reducing the network delay in fetching the KV cache. → TTFT reduction. - * Alibaba HPN: A Data Center Network for Large Language Model Training + * Objective: Focus on reducing the network delay in fetching the KV cache → TTFT reduction. + * Alibaba HPN: A Data Center Network for Large Language Model Training \[[Paper](https://doi.org/10.1145/3651890.3672265)] \[[Video](https://www.youtube.com/watch?v=s-3VLs9sd10)] * Alibaba Cloud * Experience Track + * LLM training's characteristics + * Produce a small number of periodic, bursty flows (e.g., 400Gbps) on each host. + * Require GPUs to complete iterations in synchronization; more sensitive to single-point failure. + * Alibaba High-Performance Network (**HPN**): Introduce a 2-tier, dual-plane architecture capable of interconnecting 15K GPUs within one Pod. + * Benefits: eliminate hash polarization; simplify the optimal path selections. + * RDMA over Ethernet for Distributed Training at Meta Scale \[[Paper](https://dl.acm.org/doi/10.1145/3651890.3672233)] \[[Blog](https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/)] + * Meta + * Experience Track + * Deploy a combination of centralized traffic engineering and an Enhanced ECMP (Equal-Cost Multi-Path) scheme to achieve optimal load distribution for training workloads. + * Design a receiver-driven traffic admission via the collective library -> Co-tune both the collective library configuration and the underlying network configuration. * LLMs for Networking - * NetLLM: Adapting Large Language Models for Networking + * NetLLM: Adapting Large Language Models for Networking \[[Paper](https://dl.acm.org/doi/10.1145/3651890.3672268)] * CUHK-Shenzhen & Tsinghua SIGS & UChicago + * **NetLLM**: Empower the LLM to process multimodal data in networking and generate task-specific answers. + * Study three networking-related use cases: viewport prediction, adaptive bitrate streaming, and cluster job scheduling. ### Distributed Training -* Crux: GPU-Efficient Communication Scheduling for Deep Learning Training \[[Dataset](https://github.com/alibaba/alibaba-lingjun-dataset-2023)] +* Crux: GPU-Efficient Communication Scheduling for Deep Learning Training \[[Paper](https://dl.acm.org/doi/10.1145/3651890.3672239)] \[[Dataset](https://github.com/alibaba/alibaba-lingjun-dataset-2023)] * Alibaba Cloud -* RDMA over Ethernet for Distributed Training at Meta Scale - * Meta - * Experience Track -* Accelerating Model Training in Multi-cluster Environments with Consumer-grade GPUs + * Observation: Communication contention among different deep learning training (DLT) jobs seriously influences the overall GPU computation utilization -> Low efficiency of the training cluster. + * **Crux**: A communication scheduler + * Objective: Mitigate the communication contention among DLT jobs -> Maximize GPU computation utilization. + * Designs: reduce the GPU utilization problem to a flow optimization problem; GPU intensity-aware communication scheduling; prioritize the DLT flows with high GPU computation intensity. +* Accelerating Model Training in Multi-cluster Environments with Consumer-grade GPUs \[[Paper](https://dl.acm.org/doi/10.1145/3651890.3672228)] * KAIST & UC Irvine & VMware Research + * Cache-aware gradient compression; a CPU-based sparse optimizer. + * Adapt training configurations to fluctuating dynamic network bandwidth -> Enable co-training using on-premises and cloud clusters. ### Data Processing -* Turbo: Efficient Communication Framework for Large-scale Data Processing Cluster +* Turbo: Efficient Communication Framework for Large-scale Data Processing Cluster \[[Paper](https://dl.acm.org/doi/10.1145/3651890.3672241)] * Tencent & FDU & NVIDIA & THU - * Experience Track + * Experience Track + * Network throughput & scalability: A dynamic block-level flowlet transmission mechanism; a non-blocking communication middleware. + * System reliability: Utilize an external shuffle service as well as TCP serving as a backup. + * Integrated into Apache Spark. ### Data Transfers -* An exabyte a day: Throughput-oriented, Large-scale, Managed Data Transfers with Effingo +* An exabyte a day: Throughput-oriented, Large-scale, Managed Data Transfers with Effingo \[[Paper](https://dl.acm.org/doi/10.1145/3651890.3672262)] * Google * Experience Track + * **Effingo**: A copy system, integrated with resource management and authorization systems. + * Per-cluster deployments -> Limit failure domains to individual clusters. + * Separation from the bandwidth management layer (BwE) -> A modular design that reduces dependencies.