Docs: reference architecture for fault tolerance capabilities #2157

StefanoFioravanzo · 2024-07-04T11:54:40Z

What you would like to be added?

Unfortunately, we don't have good docs right now about our ElasticPolicy: https://github.com/kubeflow/training-operator/blob/master/pkg/apis/kubeflow.org/v1/pytorch_types.go#L98 and restart policy APIs: https://github.com/kubeflow/training-operator/blob/master/pkg/apis/kubeflow.org/v1/common_types.go#L170

We should write some reference architecture docs to expose these features to our users.

Why is this needed?

Users do not have a reference to understand and appreciate the fault tolerance capabilities offered by training operator

Love this feature?

Give it a 👍 We prioritize the features with most 👍

andreyvelich · 2024-07-05T22:17:53Z

Thank you for creating this @StefanoFioravanzo!
/good-first-issue

google-oss-prow · 2024-07-05T22:17:56Z

@andreyvelich:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

Thank you for creating this @StefanoFioravanzo!
/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

LogicalGuy77 · 2024-07-08T23:15:28Z

Hi, new to kubeflow, would like to work on this. Couple of questions:

Will docs be contributed to kubeflow-website repo? (so PR will be created there?)
Should I create a google doc first and have it reviewed iteratively?
I'm new to kubeflow ecosystem so it'd be great if you can point me towards the resources/pointers to fully understand the above mentioned policies.

Thank you for your time. @andreyvelich @StefanoFioravanzo

aryan-py · 2024-07-09T10:10:36Z

I want to take this up, and would love any advice @StefanoFioravanzo @andreyvelich

StefanoFioravanzo · 2024-07-09T11:57:22Z

@LogicalGuy77 @aryan-py Thanks for stepping up!

Docs will be contributed to the Kubeflow Website, more specifically under Katib's Reference section here https://www.kubeflow.org/docs/components/katib/reference/

I would recommend starting with a google doc, especially since you are not familiar with these concepts. This will allow the project owners to review faster. You can then move the content to a PR once it's in a good state. I suggest sharing the Google Doc with the whole google discuss group, with commenter privilege (if you do that, remember to un-tick the option to notify the recipient, otherwise everyone in the Kubeflow google group will be spammed).

andreyvelich · 2024-07-09T12:09:36Z

Thanks for your interest @LogicalGuy77 and @aryan-py!

Yes, @StefanoFioravanzo is right, we are planning to contribute this docs to the Kubeflow website: https://github.com/kubeflow/website

Just a small correction, we should use Training Operator user-guides section to explain how various APIs work with Training Operator to achieve fault tolerance: https://www.kubeflow.org/docs/components/training/user-guides/

I would suggest to start with RestartPolicy API to handle ML training Pod restarts, and Elastic Policy API for fault-tolerant PyTorch on Kubernetes.
cc @kubeflow/wg-training-leads

StefanoFioravanzo · 2024-07-09T12:12:36Z

@andreyvelich ops, sorry indeed we are talking about training-operator. But shouldn't this go under Reference? I think we are talking about how fault-tolerance is designed in the operator.

What kind of user guides are you thinking about?

andreyvelich · 2024-07-09T14:10:38Z

I guess, we can add two things:

Explain users how they can leverage various training Job run policies API for large-scale distributed training
Add diagrams to the reference docs to explain how does it work

LogicalGuy77 · 2024-07-12T19:18:54Z

Update:

I've been going through lots of code and documentation and have prepared an initial draft for Restart Policy: Google Doc. I've provided commenter access to kubeflow-discuss google group. I would love to have your guidance to improve it further.
I was thinking of dividing the task into three parts:

Restart Policy
Elastic Policy
Job Run Policies for Large-Scale Distributed Training: like clean pod and scheduling policy.

Could you elaborate more on what kind of diagrams are you looking for?

Thank you for your time. @andreyvelich @StefanoFioravanzo

StefanoFioravanzo added kind/feature lifecycle/needs-triage labels Jul 4, 2024

andreyvelich added area/docs and removed lifecycle/needs-triage labels Jul 5, 2024

google-oss-prow bot added good first issue help wanted labels Jul 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docs: reference architecture for fault tolerance capabilities #2157

Docs: reference architecture for fault tolerance capabilities #2157

StefanoFioravanzo commented Jul 4, 2024

andreyvelich commented Jul 5, 2024

google-oss-prow bot commented Jul 5, 2024

LogicalGuy77 commented Jul 8, 2024 •

edited

Loading

aryan-py commented Jul 9, 2024

StefanoFioravanzo commented Jul 9, 2024

andreyvelich commented Jul 9, 2024 •

edited

Loading

StefanoFioravanzo commented Jul 9, 2024

andreyvelich commented Jul 9, 2024

LogicalGuy77 commented Jul 12, 2024

Docs: reference architecture for fault tolerance capabilities #2157

Docs: reference architecture for fault tolerance capabilities #2157

Comments

StefanoFioravanzo commented Jul 4, 2024

What you would like to be added?

Why is this needed?

Love this feature?

andreyvelich commented Jul 5, 2024

google-oss-prow bot commented Jul 5, 2024

LogicalGuy77 commented Jul 8, 2024 • edited Loading

aryan-py commented Jul 9, 2024

StefanoFioravanzo commented Jul 9, 2024

andreyvelich commented Jul 9, 2024 • edited Loading

StefanoFioravanzo commented Jul 9, 2024

andreyvelich commented Jul 9, 2024

LogicalGuy77 commented Jul 12, 2024

Update:

LogicalGuy77 commented Jul 8, 2024 •

edited

Loading

andreyvelich commented Jul 9, 2024 •

edited

Loading