kfp-dist-train contains utilities to use together with Kubeflow Pipeline to enable writing distributed training code directly using Kubeflow Pipeline SDK.
- Setup an Kubeflow environment (maybe use https://github.com/alauda/kubeflow-chart).
- Upload the example kfp-dist-train.ipynb into a Notebook instance, or setup local pipeline submit.
- Execute the example to submit a workflow, you can configure the number of workers in the Kubeflow web UI. The job should look like below:
- support
kfpdist.component(dist=True)
decorator as an wrap ofdsl.component
- support parameter server strategy
- support pytorch