Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training: Reorganized Training Operator Docs #3719

Merged
merged 15 commits into from
Apr 26, 2024

Conversation

andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Apr 23, 2024

Related: kubeflow/training-operator#1998.
I created the following sections for Training Operator docs:

  • Overview
  • Installation
  • Getting Started
  • User Guides
  • Reference

A few points:

  • @StefanoFioravanzo @kubeflow/wg-training-leads Any ideas on what we could add to Why Training Operator ? section ? Initially, we can just add some basic info.
  • I didn't move CRDs to reference in this PR since we don't have time to discuss how we are going to generate them. What do you think we should do in this PR ?
  • Do we need to have working example in GettingStarted page ? Would it be too complicated to consume ?

/hold for review

/assign @StefanoFioravanzo @kubeflow/wg-training-leads @hbelmiro @kuizhiqing @droctothorpe @franciscojavierarceo
Looking for your feedback!

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
# Create model.
class Net(torch.nn.Module):
"""Create the Pytorch model"""
...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: my recommendation would be to populate this even with a trivial single layer NN that would actually run for this example. It helps users get started that may just be executing copy-pasta.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me try to add something simple.

@StefanoFioravanzo
Copy link
Member

@andreyvelich thanks for this!

Any ideas on what we could add to Why Training Operator? section ?

Let's start with something simple and iterate in future PRs. We can start by answering questions like:

  • How does training operator simplify distributed training with respect to a more traditional approach?
  • How does Kubernetes help in solving these problems?
  • Why is Training Operator part of the Kubeflow ecosystem?

I didn't move CRDs to reference in this PR since we don't have time to discuss how we are going to generate them. What do you think we should do in this PR ?

Makes sense. I'd keep the scope of this PR to the restructuring you already implemented. Let's iterate on content separtely. We can address each framework's user guide in dedicated PRs.

Do we need to have working example in GettingStarted page ? Would it be too complicated to consume ?

Getting Started should have an end-to-end working (yet simple) example. Generally people just want to copy paste some stuff, run it, and see results. Then you typically link some more advanced tutorials or user guides at the end

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@andreyvelich
Copy link
Member Author

That makes sense @StefanoFioravanzo, I added initial ideas for Why Training Operator ? and also I added the AI/ML lifecycle diagram that we can re-use in various Kubeflow components to explain which stage of lifecycle each component addresses (e.g. Spark Operator, Model Registry, Katib, Notebooks, KServe).
Please let me know what do you think @kubeflow/wg-training-leads @StefanoFioravanzo ?

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

<img src="/docs/components/training/images/distributed-tfjob.drawio.svg"
alt="Distributed TFJob"
<img src="/docs/components/training/images/ml-lifecycle-training-operator.drawio.svg"
Copy link
Contributor

@franciscojavierarceo franciscojavierarceo Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this diagram. Based on my comment here, it may make sense to also feature serving along with model serving (though I note in my comment that there are some architectural choices to be had on which order the serving should happen). Regardless, I think it's worth stating explicitly that feature serving is a component in model serving.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, do we want to update the diagram once we discuss the architecture for Feature Serving ?
For this diagram, I just took diagram that we worked together with @ronaldpetty @zanetworker for CNCF WG AI WhitePaper
https://www.cncf.io/wp-content/uploads/2024/03/cloud_native_ai24_031424a-2.pdf

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice, looks like that was very recent. @ronaldpetty @zanetworker let me know if you have any thoughts/opinions there. Would love for feature serving to be included in this as I think it is becoming increasingly more important.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@franciscojavierarceo Not sure how to make it simple in this diagram. What do you think about this:
Screenshot 2024-04-24 at 23 20 02

Copy link
Contributor

@franciscojavierarceo franciscojavierarceo Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be overkill...but here's my attempt at it. It's technically missing the measurement required to actually repeat the process (e.g., a click for a recommendation engine).

Screenshot 2024-04-24 at 11 55 19 PM

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, thanks @franciscojavierarceo!
A few questions:

  1. Isn't the feature store also use for Model Training and Optimization ?
  2. Should we split the Model Development, Iteration, and Optimization with:
    Model Experimentation and Development (Kubeflow Notebooks) --> Model Optimization and Hyperparameter Tuning (Katib) (similar to this flow: https://www.kubeflow.org/docs/started/architecture/#introducing-the-ml-workflow).
    Since I also want to use this diagram in other docs: Kubeflow Notebooks, Kubeflow Katib.

Copy link
Contributor

@franciscojavierarceo franciscojavierarceo Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the feature store also use for Model Training and Optimization ?

In some sense yes. Feature selection is often done during Model Development, Iteration, and Optimization but that is downstream of Feature Extraction. In short, you have to pull all of the features you want first before you can select which ones are best for your model, so this diagram is more representative of how that flow actually works. Let me know if you have additional thoughts there.

Should we split the Model Development, Iteration, and Optimization with:
Model Experimentation and Development (Kubeflow Notebooks) --> Model Optimization and Hyperparameter Tuning (Katib) (similar to this flow: https://www.kubeflow.org/docs/started/architecture/#introducing-the-ml-workflow).

How about this:
image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks good, I would also split Model Experimentation with Chose ML Algorithms + Code Model and HP Tuning + Architecture Search.
That will allow us to explain users on which stage Kubeflow Notebooks are used on which Kubeflow Katib is used

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will send you a message on slack

Copy link

@zanetworker zanetworker Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@franciscojavierarceo this sounds like a good discussion to have on #wg-artificial-intelligence :)

If we are looking for abstractions, I'd categorize it as Data prep, model building (training, tuning, experimentation, feature engineering,...), model serving & deployment (pull from registry, deploying inference,...), then operation & eval (monitoring,..).

I.e., there are phases in the lifecycle (building, serving, operations and iteration), components (feature stores, registries, infra,...), and personas (highlighted in the white paper, but also here: https://tag-runtime.cncf.io/wgs/cnaiwg/glossary/#data-engineers, https://tag-runtime.cncf.io/wgs/cnaiwg/glossary/#data-engineers).

Aligning, reusing the existing language across would be great and if there are language gaps we could patch in cnai as well :)

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@andreyvelich
Copy link
Member Author

@franciscojavierarceo I added working example, does it look good ?

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
| TensorFlow | [TFJob](/docs/components/training/user-guides/tensorflow/) |
| XGBoost | [XGBoostJob](/docs/components/training/user-guides/xgboost/) |
| MPI | [MPIJob](/docs/components/training/user-guides/mpi/) |
| PaddlePaddle | [PaddleJob](/docs/components/training/user-guides/paddlepaddle/) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broken link.

Suggested change
| PaddlePaddle | [PaddleJob](/docs/components/training/user-guides/paddlepaddle/) |
| PaddlePaddle | [PaddleJob](/docs/components/training/user-guides/paddle/) |

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@andreyvelich
Copy link
Member Author

@franciscojavierarceo and I made a few changes for the AI/ML lifecycle diagram, so it would be easier to use it in other Kubeflow Components doc (e.g. Katib, FEAST, Model Registry, KServe, Notebooks, Spark Operator).

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@andreyvelich
Copy link
Member Author

This PR should be ready unless you have any other comments.
/hold cancel

Copy link
Contributor

@hbelmiro hbelmiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Apr 26, 2024
@google-oss-prow google-oss-prow bot merged commit 8fe2bd2 into kubeflow:master Apr 26, 2024
6 checks passed
@andreyvelich andreyvelich deleted the training-improve-docs branch April 26, 2024 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants