Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDK] Create Unify Training Client #1719

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .gitattributes

This file was deleted.

14 changes: 10 additions & 4 deletions docs/development/developer_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,10 +96,16 @@ This command will re-generate the api and model files together with the document
The following files/folders in `sdk/python` are auto-generated and should not be modified directly:

```
docs
kubeflow/training/models
kubeflow/training/*.py
test/*.py
sdk/python/docs
sdk/python/kubeflow/training/models
sdk/python/kubeflow/training/*.py
sdk/python/test/*.py
```

The Training Operator client and public APIs are located here:

```
sdk/python/kubeflow/training/api
```

## Code Style
Expand Down
18 changes: 10 additions & 8 deletions docs/testing/e2e_debugging.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
# How to debug an E2E test for Kubeflow Training Operator

[E2E Testing](./e2e_testing.md) gives an overview of writing e2e tests. This guidance concentrates more on the e2e failure debugging.
TODO (andreyvelich): This doc is outdated. Currently, E2Es are located here:
[`sdk/python/test/e2e`](../../sdk/python/test/e2e)

[E2E Testing](./e2e_testing.md) gives an overview of writing e2e tests. This guidance concentrates more on the e2e failure debugging.

## Prerequsite

Expand All @@ -16,7 +18,8 @@ wget https://github.com/ksonnet/ksonnet/releases/download/v0.13.1/ks_0.13.1_linu
tar -xvzf ks_0.13.1_linux_amd64.tar.gz
sudo cp ks_0.13.1_linux_amd64/ks /usr/local/bin/ks-13
```
> We would like to deprecate `ksonnet` but may takes some time. Feel free to pick up [the issue](https://github.com/kubeflow/training-operator/issues/1468) if you are interested in it.

> We would like to deprecate `ksonnet` but may takes some time. Feel free to pick up [the issue](https://github.com/kubeflow/training-operator/issues/1468) if you are interested in it.
> If your platform is darwin or windows, feel free to download binaries in [ksonnet v0.13.1](https://github.com/ksonnet/ksonnet/releases/tag/v0.13.1)

4. Deploy HEAD training operator version in your environment
Expand All @@ -33,23 +36,24 @@ kubectl set image deployment.v1.apps/training-operator training-operator=kubeflo
## Run E2E Tests locally

1. Set environments

```
export KUBEFLOW_PATH=$GOPATH/src/github.com/kubeflow
export KUBEFLOW_TRAINING_REPO=$KUBEFLOW_PATH/training-operator
export KUBEFLOW_TESTING_REPO=$KUBEFLOW_PATH/testing
export PYTHONPATH=$KUBEFLOW_TRAINING_REPO:$KUBEFLOW_TRAINING_REPO/py:$KUBEFLOW_TESTING_REPO/py:$KUBEFLOW_TRAINING_REPO/sdk/python
```


2. Install python dependencies

```
pip3 install -r $KUBEFLOW_TESTING_REPO/py/kubeflow/testing/requirements.txt
```

> Note: if you have meet problem install requirement, you may need to `sudo apt-get install libffi-dev`. Feel free to share error logs if you don't know how to handle it.


3. Run Tests

```
# enter the ksonnet app to run tests
cd $KUBEFLOW_TRAINING_REPO/test/workflows
Expand All @@ -60,10 +64,9 @@ python3 -m kubeflow.tf_operator.cleanpod_policy_tests --app_dir=$KUBEFLOW_TRAINI
python3 -m kubeflow.tf_operator.simple_tfjob_tests --app_dir=$KUBEFLOW_TRAINING_REPO/test/workflows --params=name=simple-tfjob-tests-v1,namespace=kubeflow --tfjob_version=v1 --num_trials=2 --artifacts_path=/tmp/output/artifact
```


## Check results

You can either check logs or check results in `/tmp/output/artifact`.
You can either check logs or check results in `/tmp/output/artifact`.

```
$ ls -al /tmp/output/artifact
Expand All @@ -75,7 +78,7 @@ $ cat /tmp/output/artifact/junit_test_simple_tfjob_cpu.xml

## Common issues

1. ksonnet is not installed
1. ksonnet is not installed
Copy link
Member

@tenzen-y tenzen-y Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we removed all configuration files for ksonnet in this PR. So Can we remove this section?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to update these docs for the new E2Es, do we want to do it in the following PRs ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I'm ok with either PR.


```
ERROR|2021-11-16T03:06:06|/home/jiaxin.shan/go/src/github.com/kubeflow/training-operator/py/kubeflow/tf_operator/test_runner.py|57| There was a problem running the job; Exception [Errno 2] No such file or directory: 'ks-13': 'ks-13'
Expand All @@ -97,7 +100,6 @@ FileNotFoundError: [Errno 2] No such file or directory: 'ks-13': 'ks-13'

Please check `Prerequsite` section to install ksonnet.


2. TypeError: load() missing 1 required positional argument: 'Loader'

```
Expand Down
3 changes: 3 additions & 0 deletions docs/testing/e2e_testing.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
# How to Write an E2E Test for Kubeflow Training Operator

TODO (andreyvelich): This doc is outdated. Currently, E2Es are located here:
[`sdk/python/test/e2e`](../../sdk/python/test/e2e)

The E2E tests for Kubeflow Training operator are implemented as Argo workflows. For more background and details
about Argo (not required for understanding the rest of this document), please take a look at
[this link](https://github.com/kubeflow/testing/blob/master/README.md).
Expand Down
33 changes: 21 additions & 12 deletions hack/python-sdk/post_gen.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,21 +46,30 @@ def fix_test_files() -> None:
for test_file in test_files:
print(f"Precessing file {test_file}")
if test_file.endswith(".py"):
with fileinput.FileInput(os.path.join(test_folder_dir, test_file), inplace=True) as file:
with fileinput.FileInput(
os.path.join(test_folder_dir, test_file), inplace=True
) as file:
for line in file:
print(_apply_regex(line), end='')
print(_apply_regex(line), end="")


def add_imports() -> None:
with open(os.path.join(sdk_dir, "kubeflow/training/__init__.py"), "a") as init_file:
init_file.write("from kubeflow.training.api.tf_job_client import TFJobClient\n")
init_file.write("from kubeflow.training.api.py_torch_job_client import PyTorchJobClient\n")
init_file.write("from kubeflow.training.api.xgboost_job_client import XGBoostJobClient\n")
init_file.write("from kubeflow.training.api.mpi_job_client import MPIJobClient\n")
init_file.write("from kubeflow.training.api.mx_job_client import MXJobClient\n")
init_file.write("from kubeflow.training.api.paddle_job_client import PaddleJobClient\n")
with open(os.path.join(sdk_dir, "kubeflow/__init__.py"), "a") as init_file:
init_file.write("__path__ = __import__('pkgutil').extend_path(__path__, __name__)")
with open(os.path.join(sdk_dir, "kubeflow/training/__init__.py"), "a") as f:
f.write("from kubeflow.training.api.training_client import TrainingClient\n")
with open(os.path.join(sdk_dir, "kubeflow/__init__.py"), "a") as f:
f.write("__path__ = __import__('pkgutil').extend_path(__path__, __name__)")

# Add Kubernetes models to proper deserialization of Training models.
with open(os.path.join(sdk_dir, "kubeflow/training/models/__init__.py"), "r") as f:
new_lines = []
for line in f.readlines():
new_lines.append(line)
if line.startswith("from __future__ import absolute_import"):
new_lines.append("\n")
new_lines.append("# Import Kubernetes models.\n")
new_lines.append("from kubernetes.client import *\n")
with open(os.path.join(sdk_dir, "kubeflow/training/models/__init__.py"), "w") as f:
f.writelines(new_lines)
Comment on lines +62 to +72
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to generate swagger.json containing kubernetes APIs, not importing them to Python SDK.
That helps users to generate SDK of other languages by themselves.

For example, we can generate swagger.json with kubernetes core API in the following:

$ openapi-gen --input-dirs github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1,github.com/kubeflow/common/pkg/apis/common/v1,k8s.io/api/core/v1 --report-filename=hack/violation_exception.list \
    --output-package github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1 \
    --go-header-file hack/boilerplate/boilerplate.go.txt \
    --output-base "${TEMP_DIR}"

WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, but in that case we are going to store all the Kubernetes models in our repo.
Do we want it @tenzen-y @johnugeorge ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... You're correct.
It might be better to import kubernetes models to Python SDK in this PR. And then we create an issue about this.
WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that we can keep the way as it is for this PR and create a separate issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let's discuss the long term plan for it separately.



def _apply_regex(input_str: str) -> str:
Expand All @@ -69,5 +78,5 @@ def _apply_regex(input_str: str) -> str:
return input_str


if __name__ == '__main__':
if __name__ == "__main__":
main()
17 changes: 6 additions & 11 deletions hack/python-sdk/swagger_config.json
Original file line number Diff line number Diff line change
@@ -1,13 +1,8 @@
{
"packageName" : "kubeflow.training",
"projectName" : "training",
"packageVersion": "1.5.0",
"importMappings": {
"V1Container": "from kubernetes.client import V1Container",
"V1ObjectMeta": "from kubernetes.client import V1ObjectMeta",
"V1ListMeta": "from kubernetes.client import V1ListMeta",
"V1ResourceRequirements": "from kubernetes.client import V1ResourceRequirements",
"V1JobCondition": "from kubernetes.client import V1JobCondition",
"V1PodTemplateSpec": "from kubernetes.client import V1PodTemplateSpec"
}
"packageName": "kubeflow.training",
"projectName": "training",
"packageVersion": "1.5.0",
"typeMappings": {
"V1Time": "datetime"
}
}
1 change: 0 additions & 1 deletion py/kubeflow/__init__.py

This file was deleted.

18 changes: 0 additions & 18 deletions py/kubeflow/tf_operator/Pipfile

This file was deleted.

Loading