description |
---|
#distributed_serving_system #batch_serving #selective_batching #transformer-based_model #iteration-level_scheduling |
Presented in OSDI 2022.
Authors: Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, Byung-Gon Chun (Seoul National University, FriendliAI)
This paper presents a distributed serving system called Orca, which applies selective batching and iteration-level scheduling (instead of request, each time run only a single iteration of the model) to a Transformer-based model.
- ML inference serving: serving system + execution engine
- Example: Triton (group multiple client requests into a batch) + FasterTransformer (conduct the inference procedure in the batched manner)
The transformer-based generative models generate a next token in an autoregressive manner, so they need to be executed multiple times to process an inference request.
- Key designs
- Schedule the execution at the granularity of iteration instead of request
- Selective batching
- Split the batch and process each request individually for the attention operation
- The decision not to batch the executions of the Attention operation has only a small impact on efficiency
- Apply batching to other operations
- Split the batch and process each request individually for the attention operation
- Others
- Simple first-come-first-served algorithm
- Adopt intra-layer and inter-layer model parallelism
- Reserve "max tokens" slots of GPU memory for storing the keys & values in advance
- Tune the maximum batch size to maximize throughput while satisfying one’s latency budget
- Separate the communication channels for control messages (plus tokens) and tensor data transfer
- 13k lines of C++, based on CUDA.
- Used gRPC for communication in the control plane.
- Used NCCL in the data plane.
- Implemented fused kernels for LayerNorm, Attention, and GeLU operators.
- Models
- GPT-3 models (13B, 101B, 175B, 341B)
- No actual model checkpoint
- Synthesized the trace of client requests
- Orca outperforms NVIDIA FasterTransformer: 36.9x throughput improvement at the same level of latency.