Summary of GSoC 2019 (Run GPU sharing workloads with Kubernetes + kubeflow )

Student: Jianbo Ma(majb2114@zju.edu.cn)

Mentor: Harry Zhang (@resouer) ,Kai Zhang(@wsxiaozhang) ,Jian He (@jian-he)

Overview

Being able to participate in GSoC is a lucky thing for me. In the past three months, I have improved my engineering ability with the help of my mentors. I am very grateful for this. Now GSoC 2019 is nearly over, this is my summary of this stage of work.

Project description

GPUSharing is an open source project which could share GPU by leveraging Kubernetes scheduling and Device Plugin extensibility.
Arena is a command-line interface for the data scientists to run and monitor the machine learning training jobs and check their results in an easy way.In the backend, it is based on Kubernetes, helm and Kubeflow. But the data scientists can have very little knowledge about kubernetes. It's goal is to make the data scientists feel like to work on a single machine but with the Power of GPU clusters indeed.

Goals

Integrate arena with GPUSharing in tensorflow-serving situation.
Integrate Nvidia MPS as the option for isolation

Stage 1: Integrate arena with GPUSharing in tensorflow-serving situation.

Achievement

Finish an end to end tf-serving task using GPUShare.
Check the GPUMemory resource of kubernetes cluster.
Finish a user_guide of tf-serving with GPUShare.

Design

1. per_process_gpu_memory_fraction

Per_process_gpu_memory_fraction is a fraction that each process occupies of the GPU memory space. The value is between 0.0 and 1.0 (with 0.0 as the default)
If 1.0, the server will allocate all the memory when the server starts,
If 0.0, Tensorflow will automatically select a valupe.

For example, If we want the serving job to occupy half of the GPU resources,we can set per_process_gpu_memory_fraction equals to 0.5.

2. The design process.

Goals:After users submit the serving task,we need to calculate the correct per_process_gpu_memory_fraction and convert it as a parameter of serving-task.

per_process_gpu_memory_fraction=(required GPUMemory)/(total GPUMemory in allocated GPU card).

The gpumemory serving task requires will be transformed into spec.container.resource.limits.aliyun.com/gpu-mem.
After GPUShare scheduler-extender and device-plugin,environmental variable will be generated.
Required GPUMemory equals to ALIYUN_COM_GPU_MEM_CONTAINER,total GPUMemory in GPU card equals to ALIYUN_COM_GPU_MEM_DEV.
per_process_gpu_memory_fraction=$ALIYUN_COM_GPU_MEM_CONTAINER/$ALIYUN_COM_GPU_MEM_DEV
If in GPUShare situation,convert per_process_gpu_memory_fraction in the task.

3. The design diagram.

Code

Stage 2: Integrate Nvidia MPS as the option for isolation

Achievement

Investigate how to use MPS.
Test the capacity of MPS.
Integrate MPS with GPUShare,simplify user operations.

Design and result

Use MPS
Test result
Integration design

Code

User_guide and Integration

To do

Test if GPU thread is controled by MPS.

Reference:

MPS
nvprof

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
Arena		Arena
MPS		MPS
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Summary of GSoC 2019 (Run GPU sharing workloads with Kubernetes + kubeflow )

Overview

Project description

Goals

Stage 1: Integrate arena with GPUSharing in tensorflow-serving situation.

Achievement

Design

1. per_process_gpu_memory_fraction

2. The design process.

3. The design diagram.

Code

Stage 2: Integrate Nvidia MPS as the option for isolation

Achievement

Design and result

Code

To do

About

Releases

Packages

Sakuralbj/CNCF-GSoC

Folders and files

Latest commit

History

Repository files navigation

Summary of GSoC 2019 (Run GPU sharing workloads with Kubernetes + kubeflow )

Overview

Project description

Goals

Stage 1: Integrate arena with GPUSharing in tensorflow-serving situation.

Achievement

Design

1. per_process_gpu_memory_fraction

2. The design process.

3. The design diagram.

Code

Stage 2: Integrate Nvidia MPS as the option for isolation

Achievement

Design and result

Code

To do

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages