Skip to content

Latest commit

 

History

History
111 lines (81 loc) · 7.57 KB

File metadata and controls

111 lines (81 loc) · 7.57 KB

SC 2024

Meta Info

Homepage: https://sc24.conference-program.com

Paper list: https://dl.acm.org/doi/proceedings/10.5555/3703596

Papers

AI Infrastructure

  • Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning [Paper] [HAI Platform Code]
    • DeepSeek AI
    • Include Network Co-Design, HFReduce (collective communication library), HaiScale (optimized parallelism methods), 3FS Distributed File System, and HAI Platform (task scheduling, fault tolerance).

Large Language Models (LLMs)

  • LLM inference
    • PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation [Paper] [Code]
      • Iowa State University & TU Darmstadt
      • Continuous Asynchronous Speculation: run single-token inference simultaneously with several speculative runs.
      • Early Inference Cancellation: skip the computation of invalidated runs.
    • LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services [Paper] [Benchmark] [Code]
      • IBM Research
      • Learn a predictive model to recommend the most cost-effective hardware for a previously unseen LLM.
  • LLM fine-tuning
    • Long Exposure: Accelerating Parameter-Efficient Fine-Tuning for LLMs under Shadowy Sparsity [Paper] [Code]
      • MSRA & THU
  • LLM for anomaly detection
    • Large Language Models for Anomaly Detection in Computational Workflows: From Supervised Fine-Tuning to In-Context Learning [Paper] [Code] [Benchmark]
      • Argonne National Laboratory & USC & Oak Ridge National Laboratory
      • Investigated two approaches: (1) supervised fine-tuning (pre-trained LLMs are fine-tuned on labeled data for sentence classification to identify anomalies); (2) in-context learning (prompts containing task descriptions and examples guide LLMs in few-shot anomaly detection without fine-tuning).

Mixture-of-Experts (MoEs)

  • APTMoE: Affinity-Aware Pipeline Tuning for MoE Models on Bandwidth-Constrained GPU Nodes [Paper] [Code]
    • SYSU

Deep Learning Recommendation Models (DLRMs)

  • Accelerating Distributed DLRM Training with Optimized TT Decomposition and Micro-Batching [Paper] [Code]
    • WHU & NVIDIA & UMacau
    • EcoRec: eliminate redundancy in TT (Tensor-Train) operations; micro-batching with sorted indices to reduce memory.
  • Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression [Paper] [Code]
    • Indiana University, Bloomington & Meta & University of Rochester & ICT, CAS
    • In-depth analysis of embedding data features; employ error-bounded lossy compression to reduce the communication data size.
  • Efficient Tensor Offloading for Large Deep-Learning Model Training based on Compute Express Link [Paper] [Code]
    • UC Merced & SK Hynix
    • TECO: Tensor-CXL-Offload
    • Introduce a cache coherence interconnect based on CXL to build a cache coherence domain between CPU memory and accelerator memory; offload tensors to CPU memory to save accelerator memory.
  • RecFlex: Enabling Feature Heterogeneity-Aware Optimization for Deep Recommendation Models with Flexible Schedules [Paper] [Code]
    • RUC & Microsoft & UCSD
    • Create fused kernels with distinct schedules for different feature fields.

Graph Transformer

  • TorchGT: A Holistic System for Large-Scale Graph Transformer Training [Paper] [Code]
    • NTU & Shanghai AI Lab & ZJU & SenseTime

Reinforcement Learning (RL)

  • Stellaris: Staleness-Aware Distributed Reinforcement Learning with Serverless Computing [Paper] [Code]
    • Stevens Institute of Technology & NEU & Stony Brook University & Missouri University of Science and Technology
    • Introduce a generic asynchronous learning paradigm.

Job Scheduling

  • PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters [Paper] [Code]
    • UW-Madison
    • Characterize which applications are more likely to suffer from performance variability; balance performance variability with locality to ensure jobs are spread across as few nodes as possible.

Distributed Training

  • ,Optimizing Distributed ML Communication with Fused Computation-Collective Operations [Paper]
    • AMD
    • Developed three prototype fused operators (embedding + All-to-All, GEMV + AllReduce, and GEMM + All-to-All) to address the communication overheads in DLRM, Transformers and MoE model architectures.

Serverless Computing

  • SMIless: Serving DAG-based Inference with Dynamic Invocations under Serverless Computing [Paper] [Code]
    • SIAT, CAS & UMacau
    • Integrate adaptive pre-warming windows; built on top of OpenFaaS.

GPU Sharing

  • ParvaGPU: Efficient Spatial GPU Sharing for Large-Scale DNN Inference in Cloud Environments [Paper] [Code]
    • Chung-Ang University & Electronics and Telecommunications Research Institute & Virginia Tech
    • Integrate MIG and MPS to enhance GPU utilization.

Performance Analysis

  • GVARP: Detecting Performance Variance on Large-Scale Heterogeneous Systems [Paper] [Code]
    • Beihang University
    • Employ static analysis to identify the performance-critical parameters of kernel functions; segment the program execution with external library calls and asynchronous kernel operations; construct a state transfer graph and estimate the workload of each program segment.

Interconnects

  • Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects [Paper] [Benchmark]
    • Sapienza University of Rome & University of Trento & Vrije Universiteit Amsterdam & ETH & CINECA & University of Antwerp & HPE & NVIDIA
    • Characterize three supercomputers: Alps, Leonardo, and LUMI.

Acronyms

  • LLM: Large Language Model
  • MoE: Mixture-of-Experts
  • DLRM: Deep Learning Recommendation Model
  • PEFT: Parameter-Efficient Fine-Tuning
  • MIG: Multi-Instance GPU
  • MPS: Multi-Process Service
  • CXL: Compute Express Link