Proposal for a XGBOOST operator #247

merlintang · 2019-04-04T13:43:36Z

Motivation

XGBOOST is the state of art approach for machine learning. Additionally to deploy XGboost over Yarn, or Spark, it is necessary to provide Kubernetes with the ability to handle distributed XGBoost training and prediction. The Kubernetes Operator for XGBOOST reduces the gap to build distributed XGBOOST over Kubernetes, and allow XGBOOST applications to be specified, run, and monitored idiomatically on Kubernetes.

The operator allows ML applications based XGBOOST to be specified in a declarative manner (e.g., in a YAML file) and run without the need to deal with the XGBOOST submission process. It also enables status of XGBOOST job to be tracked and presented idiomatically like other types of workloads on Kubernetes. This document discusses the design and architecture of the operator.

Goals

Provide a common custom resource definition (CRD) for defining a single-node or multiple node XGBOOST training and predication job.
Implement a custom controller to manage the CRD, create dependency resources, and reconcile the desired states.

More details
A XGBOOST operator
A way to deploy the operator
A single pod XGBOOST example
A distributed XGBOOST example

Non-Goals

Issues or changes not being addressed by this proposal.

UI or API

Custom Resource Definition
The custom resource submitted to the Kubernetes API would look something like this:

apiVersion: "kubeflow.org/v1alpha1"
kind: "XGBoostJob"
metadata:
  name: "xgboost-example-job"
  command: ["xgboost"]
        args: [
          "-bind-to", "none",
          '-distributed', "yes",
          "-job-type", "train",
          "python", "scripts/xgboost_test/xgboost_test.py",
          "--model", "modelname"
        ]
spec:
  backend: "rabit"
  masterPort: "23456"
  replicaSpecs:
    - replicas: 1
      ReplicaType: MASTER
      template:
        spec:
          containers:
            - image: xgboost/xgboost:latest
              name: master
              imagePullPolicy: IfNotPresent
          restartPolicy: OnFailure
    - replicas: 2
      ReplicaType: WORKER
      template:
        spec:
          containers:
            - image: xgboost/xgboost:latest
              name: worker
          restartPolicy: OnFailure

This XGBoostJob resembles the existing TFJob for the tf-operator. backend defines the protocol the XGboost workers to communicate when initializing the worker group. masterPort defines the port the group will use to communicate with the master's Kubernetes service.

Resulting Master

kind: Service
apiVersion: v1
metadata:
  name: xgboost-master-${job_id}
spec:
  selector:
    app: xgboost-master-${job_id}
  ports:
  - port: 23456
    targetPort: 23456

Details

apiVersion: v1
kind: Pod
metadata:
  name: xgboost-master-${job_id}
  labels:
    app: xgboost-${job_id}
spec:
  containers:
  - image: xgboost/xgboost:latest
    imagePullPolicy: IfNotPresent
    name: master
    env:
      - name: MASTER_PORT
        value: "23456"
      - name: MASTER_ADDR
        value: "localhost"
      - name: WORLD_SIZE
        value: "3"
        # Rank 0 is the master
      - name: RANK
        value: "0"
    ports:
      - name: masterPort
        containerPort: 23456
  restartPolicy: OnFailure

Resulting Worker

apiVersion: v1
kind: Pod
metadata:
  name: xgboost-worker-${job_id}
spec:
  containers:
  - image: xboost/xgboost:latest
    imagePullPolicy: IfNotPresent
    name: worker
    env:
    - name: MASTER_PORT
      value: "23456"
    - name: MASTER_ADDR
      value: xgboost-master-${job_id}
    - name: WORLD_SIZE
      value: "3"
    - name: RANK
      value: "1"
  restartPolicy: OnFailure

The worker spec generates a pod. They will communicate to the master through the master's service name.

Design

The design of distributed XGBOOST follow the Rabit protocol of XGBOOST. The rabit design can be found here. Thus, XGBoost operator is coming to provide the framework for start the master node and slave nodes for Rabit as following way.

The master node of Rabit is initialized, and each slave node comes to connect with master node via the provided port and IP. Each work in pods to read data from locally, and map the input data into Dmatrix format for XGBoost.
a. For training job: One of worker is selected as Host, and other workers use the IP and port of number of HOST to build the rabit network for training as the Figure 1.
b. For predication job: the trained model is popugate to each work node, and use the local validation data for predication..

Alternatives Considered

Description of possible alternative solutions and the reasons they were not chosen.

jlewi · 2019-04-08T00:47:56Z

Thanks for writing this up!

The link to Rabit isn't working.

What does the heavy lifting of splitting work among the different workers; is that Rabit?

I'd suggest sharing this issue in slack to make sure others are aware. @johnugeorge and @richardsliu are the primary owners of our job operators so they would probably be the best points of contact in terms of reviewing this proposal.

/assign @johnugeorge
/assign @richardsliu

johnugeorge · 2019-04-08T05:33:37Z

Thanks for this proposal.

I see that your proposed CRD is similar to v1alpha1 version of job operators. See sample current v1beta2 (latest version) of operator. See https://github.com/kubeflow/pytorch-operator/blob/master/examples/mnist/v1beta2/pytorch_job_mnist_gloo.yaml

We tried to keep consistent behavior across operators. See tf and pytorch operator v1beta2 spec. Do you see any issues in keeping the same format of v1beta2? This would help in using the common implementation and having consistent behavior.

Just to add: we are moving the common code(currently, it resides in tf-operator) to a separate repo(https://github.com/kubeflow/common). This will contain the common code and apis that can be used across all operators(https://github.com/kubeflow/common/blob/master/operator/v1/types.go)

gaocegege · 2019-04-08T06:29:19Z

Thanks for the awesome proposal. Same opinion with johnugeorge

I suggest using v1alpha2 style API (See https://github.com/kubeflow/pytorch-operator/blob/master/examples/mnist/v1beta2/pytorch_job_mnist_gloo.yaml).

terrytangyuan · 2019-04-08T13:42:53Z

Link to Rabit here: https://github.com/dmlc/rabit

+1 to use kubeflow/common. Feel free to create new issues there if you come across anything else that can be reused by XGBoost operator.

richardsliu · 2019-04-09T02:55:56Z

Yes please use the kubeflow/common library and file issues if you find any problems. Thanks!

merlintang · 2019-04-10T14:38:44Z

Thanks for your comments and update. I totally agree that we can share the code of kuberflow/common. I would check the diff between common and xgboost operator. It would be excited to use xgboost for the test bed for common operator. I would send a followup PR for the xgboost over common operator.

…

On Tue, Apr 9, 2019 at 10:55 AM Richard Liu ***@***.***> wrote: Yes please use the kubeflow/common library and file issues if you find any problems. Thanks! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#247 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABXY-awxLeXjXCPlZNGghi4OtRvWldS3ks5vfAE9gaJpZM4cc1kL> .

merlintang · 2019-04-17T01:06:44Z

@jlewi should we create a repo under kubeflow for xgboost operator ?

See proposal: kubeflow/community#247

jlewi · 2019-04-22T14:23:33Z

I have created https://github.com/kubeflow/xgboost

Should we close this issue out and open up more specific issues in that repo?

A good place to start would be to define a ROADMAP.md in that repository.

Can we aim to have an alpha version of the custom resource as part of 0.6 which will be released end of June?

terrytangyuan · 2019-04-23T03:41:59Z

@jlewi Thanks! Would it be better to change the repo name to be xgboost-operator so it's more consistent with repos for the other operators?

gaocegege · 2019-04-23T03:46:24Z

+1 for xgboost-operator

merlintang · 2019-04-23T08:49:00Z

xgboost-operator would be a better name. +1

…

On Apr 23, 2019, at 11:46 AM, Ce Gao ***@***.***> wrote: +1 for xgboost-operator — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

richardsliu · 2019-07-17T23:35:55Z

This can be closed right? @merlintang

terrytangyuan · 2019-07-17T23:44:58Z

Yes.

/close

k8s-ci-robot · 2019-07-17T23:44:59Z

@terrytangyuan: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

merlintang · 2019-07-18T00:22:35Z

yes thanks richard

…

On Jul 17, 2019, at 4:35 PM, Richard Liu ***@***.***> wrote: This can be closed right? @merlintang — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

See proposal: kubeflow/community#247

* Improve the instructions for setting up the OAuth client. * Add screenshots * Explain that Authorized Domains may not be an option and explain they don't need to set it in that case. Fix kubeflow#177 * Address comments.

k8s-ci-robot assigned johnugeorge and richardsliu Apr 8, 2019

jlewi added a commit to kubeflow/xgboost-operator that referenced this issue Apr 22, 2019

Create an initial owners file

eb0ac85

See proposal: kubeflow/community#247

hcho3 mentioned this issue Apr 22, 2019

Can I have an example for xgboost run on kubernets dmlc/xgboost#4270

Closed

terrytangyuan mentioned this issue Apr 30, 2019

project setup kubeflow/xgboost-operator#3

Merged

merlintang mentioned this issue May 5, 2019

Change repo name from xgboost to xgboost-operator kubeflow/xgboost-operator#4

Closed

k8s-ci-robot closed this as completed Jul 17, 2019

xfate123 pushed a commit to xfate123/xgboost-operator that referenced this issue May 16, 2020

Create an initial owners file

0ab723e

See proposal: kubeflow/community#247

ryantd mentioned this issue Apr 16, 2021

DGL Operator: Leverage DGL on K8s dmlc/dgl#2843

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for a XGBOOST operator #247

Proposal for a XGBOOST operator #247

merlintang commented Apr 4, 2019

jlewi commented Apr 8, 2019

johnugeorge commented Apr 8, 2019

gaocegege commented Apr 8, 2019

terrytangyuan commented Apr 8, 2019

richardsliu commented Apr 9, 2019

merlintang commented Apr 10, 2019 via email

merlintang commented Apr 17, 2019 •

edited

Loading

jlewi commented Apr 22, 2019

terrytangyuan commented Apr 23, 2019

gaocegege commented Apr 23, 2019

merlintang commented Apr 23, 2019 via email

richardsliu commented Jul 17, 2019

terrytangyuan commented Jul 17, 2019 •

edited

Loading

k8s-ci-robot commented Jul 17, 2019

merlintang commented Jul 18, 2019 via email

Proposal for a XGBOOST operator #247

Proposal for a XGBOOST operator #247

Comments

merlintang commented Apr 4, 2019

Motivation

Goals

Non-Goals

UI or API

Resulting Master

Resulting Worker

Design

Alternatives Considered

jlewi commented Apr 8, 2019

johnugeorge commented Apr 8, 2019

gaocegege commented Apr 8, 2019

terrytangyuan commented Apr 8, 2019

richardsliu commented Apr 9, 2019

merlintang commented Apr 10, 2019 via email

merlintang commented Apr 17, 2019 • edited Loading

jlewi commented Apr 22, 2019

terrytangyuan commented Apr 23, 2019

gaocegege commented Apr 23, 2019

merlintang commented Apr 23, 2019 via email

richardsliu commented Jul 17, 2019

terrytangyuan commented Jul 17, 2019 • edited Loading

k8s-ci-robot commented Jul 17, 2019

merlintang commented Jul 18, 2019 via email

merlintang commented Apr 17, 2019 •

edited

Loading

terrytangyuan commented Jul 17, 2019 •

edited

Loading