Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ml-platform release from development branch #715

Merged
merged 58 commits into from
Jul 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
d45b633
Added project flag to commands
arueth May 28, 2024
ef19c56
Added IAP brand creation to initialize feature
arueth May 28, 2024
08301dc
Updated google providers to 5.31.0
arueth May 29, 2024
729e49b
Standardized GitOps scripts and added Kueue
arueth May 30, 2024
472b451
Renamed acm-template folder to configsync
arueth May 30, 2024
3e13659
Added initial test harness
arueth May 30, 2024
989637a
Added h100 DWS node pool
arueth May 30, 2024
f9fd86c
Moved node pools to a seperate file and cleaned up noode pool and Con…
arueth Jun 2, 2024
d921d99
Add notebook packaging guide to docs (#690)
kenthua Jun 3, 2024
dc55edc
Added enhancements to the dataprocessing use cases
arueth Jun 3, 2024
d5f1917
Moved ConfigSync GitOps resources to a new file
arueth Jun 4, 2024
064be50
Moved container_cluster to a new file
arueth Jun 4, 2024
bad5124
Updated Kueue to use the 0.7.0 manifests
arueth Jun 4, 2024
7689a6c
Added additonal logic
arueth Jun 4, 2024
250d762
Added additional logging
arueth Jun 4, 2024
fe0f011
Increased the cluster resource limits
arueth Jun 4, 2024
e9b7d14
Updated documentation and added TimeoutError exception handling
arueth Jun 5, 2024
4c8acce
Added products and features outline
arueth Jun 5, 2024
7845cf2
Moved outputs to new file
arueth Jun 5, 2024
8408c15
Moved google_clients to a new file
arueth Jun 5, 2024
2979785
Moved project resources to a new file
arueth Jun 5, 2024
ff85617
Moved networking resources to a new file
arueth Jun 5, 2024
db204fe
Moved Kubernetes resources to a new file
arueth Jun 5, 2024
e9a9387
Moved fleet resources to a new file
arueth Jun 5, 2024
0064ca4
Moved git resources to a new file
arueth Jun 5, 2024
c0b31af
Cleaned up locals
arueth Jun 5, 2024
c221dad
Updated Kueue manifests to remove namespace creation
arueth Jun 5, 2024
2cc9ab1
Moved system node pool to the default node pool
arueth Jun 5, 2024
c4b45f9
Refactored test scripts
arueth Jun 6, 2024
2cfbf23
Ignore changes to node pool
arueth Jun 6, 2024
45da948
Add the downloaded KUBECONFIG
arueth Jun 6, 2024
3900a6d
Add notebook packaging link (#698)
kenthua Jun 6, 2024
9b898a8
Updated unit test scripts
arueth Jun 7, 2024
4630f79
Upgraded google provider to 5.33.0
arueth Jun 11, 2024
21f3562
Tweaked test scripts
arueth Jun 11, 2024
73eed59
Added Secret Manager add-on to the cluster
arueth Jun 11, 2024
ffe688d
Changed configsync git respository name to allow for easier use of mu…
arueth Jun 11, 2024
25ccc2d
Added a GitLab project module
arueth Jun 11, 2024
1cae8d2
Standardized git variables to support GitHub or GitLab
arueth Jun 12, 2024
1d3957d
Added a100 40GB node pools
arueth Jun 12, 2024
3a8b96e
Updated variable name
arueth Jun 14, 2024
7840119
Moved cpu node pool from n2 to n4 machines
arueth Jun 18, 2024
92bec17
Cleaned up MLP_ENVIRONMENT_NAME logic
arueth Jun 18, 2024
e428f62
Add environment_name to the Ray dashboard endpoint
arueth Jun 18, 2024
62615cc
Renamed repository files to help with automation
arueth Jun 18, 2024
0f26aab
Removed base e2e test script
arueth Jun 18, 2024
350cee9
Added missing prefix for environment_name
arueth Jun 18, 2024
772b386
Removed fleet level configmanagement and Google service accounts for …
arueth Jun 20, 2024
debcb96
Added Config Controller Terraform module
arueth Jun 26, 2024
6b9fffa
Upgraded providers
arueth Jun 26, 2024
56189c3
change order for file copy and files copied
kenthua Jun 26, 2024
8fdf90c
Added local provider to versions file
arueth Jun 27, 2024
828973f
Terraform formatting
arueth Jun 27, 2024
ab064db
Added NVIDIA DCGM
arueth Jun 28, 2024
7ed570d
Set default disk type to pd-balanced
arueth Jul 2, 2024
15fed50
Added allow KubeRay Operator to the namespace network policy
arueth Jul 2, 2024
44e827b
Updated test README
arueth Jul 3, 2024
f79396a
Fixed formatting
arueth Jul 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion benchmarks/infra/stage-1/sample-tfvars/gpu-sample.tfvars
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
project_id = "$PROJECT_ID"
project_id = "$PROJECT_ID"
cluster_name = "ai-benchmark"
region = "us-central1"
gke_location = "us-central1-a"
Expand Down
2 changes: 2 additions & 0 deletions best-practices/ml-platform/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
test/log/*.log
test/scripts/locks/*.lock
6 changes: 6 additions & 0 deletions best-practices/ml-platform/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ This reference architecture demonstrates how to build a GKE platform that facili
- Platform admins will create a namespace per application and provide the application team member full access to it.
- The namespace scoped resources will be created by the Application/ML teams either via [Config Sync][config-sync] or through a deployment tool like [Cloud Deploy][cloud-deploy]

For an outline of products and features used in the platform, see the [Platform Products and Features](/best-practices/ml-platform/docs/platform/products-and-features.md) document.

## Critical User Journeys (CUJs)

### Persona : Platform Admin
Expand Down Expand Up @@ -60,6 +62,10 @@ This reference architecture demonstrates how to build a GKE platform that facili

- [Distributed Data Processing with Ray](examples/use-case/ray/dataprocessing/README.md): Run a distributed data processing job using Ray.

## Resources

- [Packaging Jupyter notebooks](docs/notebook/packaging.md): Patterns and tools to get your ipynb's ready for deployment in a container runtime.

[gitops]: https://about.gitlab.com/topics/gitops/
[repo-sync]: https://cloud.google.com/anthos-config-management/docs/reference/rootsync-reposync-fields
[root-sync]: https://cloud.google.com/anthos-config-management/docs/reference/rootsync-reposync-fields
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
93 changes: 93 additions & 0 deletions best-practices/ml-platform/docs/notebook/packaging.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Packaging Jupyter notebook as deployable code

Jupyter notebook is widely used by data scientists and machine learning experts in their day to day work to interactively and iteratively develop. However, the `ipynb` format is typically not used as a deployable or packagable artifact. There are two scenarios that notebooks are converted to deployable/package artifacts:
1. Model training tasks needed to convert to batch jobs to scale up with more computational resources
1. Model inference tasks needed to convert to an API server to serve the end-user requests

In this guide we will showcase two different tools which may help faciliate converting your notebook to a deployable/packageable raw python library.

This process can also be automated utilizing Continuous Integration (CI) tools such as [Cloud Build](https://cloud.google.com/build/).

## Use jupytext to convert notebook to raw python and containerize

1. Update the notebook to `Pair Notebook with Percent Format`

Jupytext comes with recent jupyter notebook or jupyter-lab. In addition to just converting from `ipynb` to python, it can pair between the formats. This allows for updates made in `ipynb` to be propagated to python and vice versa.

To pair the notebook, simply use the pair function in the File menu:

![jupyter-pairing](../images/notebook/jupyter-pairing.png)

In this example we use the file [gpt-j-online.ipynb](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/ray-on-gke/examples/notebooks/gpt-j-online.ipynb):![jupyter-gpt-j-online-ipynb](../images/notebook/jupyter-gpt-j-online-ipynb.png)

1. After pairing, we get the generated raw python:

![jupyter-gpt-j-online-py](../images/notebook/jupyter-gpt-j-online-py.png)

**NOTE**: This conversion can also be performed via the `jupytext` cli with the following command:

```sh
jupytext --set-formats ipynb,py:percent \
--to py gpt-j-online.ipynb
```

1. Extract the module dependencies

In the notebook environment, users typically install required python modules using `pip install` commands, but in the container environment, these dependencies need to be installed into the container prior to executing the python library.

We can use the `pipreqs` tool to generate the dependencies. Add the following snippet in a new cell of your notebook and run it:

```sh
!pip install pipreqs
!pipreqs --scan-notebooks
```

The following is an example output:

![jupyter-generate-requirements](../images/notebook/jupyter-generate-requirements.png)
**NOTE**: (the `!cat requirements.txt` line is an example of the generated `requirements.txt`)

1. Create the Dockerfile

To create the docker image of your generated raw python, we need to create a `Dockerfile`, below is an example. Replace `_THE_GENERATED_PYTHON_FILE_` with your generated python file:

```Dockerfile
FROM nvidia/cuda:12.2.0-runtime-ubuntu22.04

RUN apt-get update && \
apt-get -y --no-install-recommends install python3-dev gcc python3-pip git && \
rm -rf /var/lib/apt/lists/*

COPY requirements.txt _THE_GENERATED_PYTHON_FILE_ /_THE_GENERATED_PYTHON_FILE_

RUN pip3 install --no-cache-dir -r requirements.txt

ENV PYTHONUNBUFFERED 1

CMD python3 /_THE_GENERATED_PYTHON_FILE_
```

1. [Optional] Lint and remove unused code

Using `pylint` to validate the generated code is a good practice. Pylint can detect unordered `import` statements, unused code and provide code readability suggestions.

To use `pylint`, create a new cell in your notebook, run the code below and replace `_THE_GENERATED_PYTHON_FILE_` to your filename:

```sh
!pip install pylint
!pylint _THE_GENERATED_PYTHON_FILE_
```

## Use nbconvert to convert notebook to raw python

We can convert a Jupyter notebook to python script using nbconvert tool.
The nbconvert tool is available inside your Jupyter notebook environment in Google Colab Enterprise. If you are in another environment and it is not available, it can be found [here](https://pypi.org/project/nbconvert/)

1. Run the nbconvert command in your notebook. In this example, we are using `gsutil` to copy the notebook to the Colab Enterprise notebook.

```sh
!jupyter nbconvert --to python Fine-tune-Llama-Google-Colab.ipynb
```

Below is an example of the commands
![jupyter-nbconvert](../images/notebook/jupyter-nbconvert.png)
Loading
Loading