Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sdk tests with papermill #2448

Draft
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

yehudit1987
Copy link

What this PR does / why we need it:
This PR creates E2E tests for katib examples to run with papermill.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2417

Checklist:

  • Docs included if any changes are user facing

Yehudit Kerido added 5 commits October 27, 2024 12:49
Signed-off-by: Yehudit Kerido <yehudit.kerido@nokia.com>
Signed-off-by: Yehudit Kerido <yehudit.kerido@nokia.com>
Signed-off-by: Yehudit Kerido <yehudit.kerido@nokia.com>
Signed-off-by: Yehudit Kerido <yehudit.kerido@nokia.com>
Signed-off-by: Yehudit Kerido <yehudit.kerido@nokia.com>
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Electronic-Waste
Copy link
Member

/rerun-all

@Electronic-Waste
Copy link
Member

Electronic-Waste commented Oct 28, 2024

@yehudit1987 Can you please fix these CI errors?

@Electronic-Waste
Copy link
Member

@yehudit1987 Can you sign your commits with git commit -s? The DCO checks failed due to this reason.

@Electronic-Waste
Copy link
Member

Yehudit Kerido added 10 commits October 29, 2024 21:29
Signed-off-by: Yehudit Kerido <yehudit.kerido@nokia.com>
… trigger it

Signed-off-by: Yehudit Kerido <yehudit.kerido@nokia.com>
… trigger it 2

Signed-off-by: Yehudit Kerido <yehudit.kerido@nokia.com>
Signed-off-by: Yehudit Kerido <yehudit.kerido@nokia.com>
Signed-off-by: Yehudit Kerido <yehudit.kerido@nokia.com>
Signed-off-by: Yehudit Kerido <yehudit.kerido@nokia.com>
Signed-off-by: Yehudit Kerido <yehudit.kerido@nokia.com>
Signed-off-by: Yehudit Kerido <yehudit.kerido@nokia.com>
Signed-off-by: Yehudit Kerido <yehudit.kerido@nokia.com>
Signed-off-by: Yehudit Kerido <yehudit.kerido@nokia.com>
@yehudit1987 yehudit1987 force-pushed the sdk-tests-with-papermill branch from 963d367 to 6633aa5 Compare October 29, 2024 19:29
@yehudit1987 yehudit1987 marked this pull request as ready for review October 29, 2024 19:31
@Electronic-Waste
Copy link
Member

/rerun-all

@yehudit1987 yehudit1987 marked this pull request as ready for review November 5, 2024 13:00
@Electronic-Waste
Copy link
Member

/rerun-all

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review @yehudit1987. I was very busy recently.

I left some comments for you. Thanks for your great contributions!

Comment on lines 351 to 376
"from kubernetes import client, config\n",
"\n",
"# Initialize KatibClient\n",
"kclient = KatibClient(namespace=namespace)\n",
"\n",
"# Load kubeconfig\n",
"config.load_kube_config()\n",
"\n",
"# Kubernetes API for managing namespaces\n",
"core_v1_api = client.CoreV1Api()\n",
"\n",
"# Function to add label to namespace if it doesn't have the required one\n",
"def add_katib_label_to_namespace(namespace):\n",
" ns = core_v1_api.read_namespace(namespace)\n",
" labels = ns.metadata.labels or {}\n",
" if labels.get(\"katib.kubeflow.org/metrics-collector-injection\") != \"enabled\":\n",
" print(f\"Adding label to namespace {namespace}...\")\n",
" labels[\"katib.kubeflow.org/metrics-collector-injection\"] = \"enabled\"\n",
" body = {\"metadata\": {\"labels\": labels}}\n",
" core_v1_api.patch_namespace(namespace, body)\n",
" print(f\"Label added to namespace {namespace}.\")\n",
" else:\n",
" print(f\"Namespace {namespace} already has the required label.\")\n",
"\n",
"# Add the required label to the namespace\n",
"add_katib_label_to_namespace(namespace)\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to add these lines?

Comment on lines +9 to +12
papermill-args-yaml:
description: 'Additional arguments to pass to Papermill in yaml format'
required: false
default: ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we didn't pass parameters to these two notebook examples. Not sure if it meets our requirements

cc👀 @andreyvelich @tenzen-y @yehudit1987

@@ -65,10 +65,12 @@ echo "Deploying Katib"
cd ../../../../../ && WITH_DATABASE_TYPE=$WITH_DATABASE_TYPE make deploy && cd -

# Wait until all Katib pods is running.
TIMEOUT=120s
TIMEOUT=180s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to change TIMEOUT to 180s?

kubectl wait --for=condition=ContainersReady=True --timeout=${TIMEOUT} -l "katib.kubeflow.org/component in ($WITH_DATABASE_TYPE,controller,db-manager,ui)" -n kubeflow pod ||
(kubectl get pods -n kubeflow && kubectl describe pods -n kubeflow && exit 1)
echo "Waiting for pods to be ready for $TIMEOUT seconds..."
sleep $TIMEOUT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary since we already have kubectl wait instruction? Please let me know your thought.

Comment on lines 96 to 99
if ! kubectl get namespaces | grep -q "kubeflow-user-example-com"; then
kubectl create namespace kubeflow-user-example-com
fi

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use default namespace instead of creating a new one:)

Comment on lines 26 to 35
- name: Install dependencies
shell: bash
run: |
python -m pip install --upgrade pip
pip install papermill kubeflow-katib jupyter ipykernel
python -m ipykernel install --user --name python3 --display-name "Python 3"

- name: Setup Minikube Cluster
shell: bash
run: ./test/e2e/v1beta1/scripts/gh-actions/setup-minikube.sh true true "" "" "cmaes"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to create minikube cluster between these two steps?

- name: Setup Minikube Cluster
uses: medyagh/setup-minikube@v0.0.18
with:
network-plugin: cni
cni: flannel
driver: none
kubernetes-version: ${{ inputs.kubernetes-version }}
minikube-version: 1.31.1
start-args: --wait-timeout=120s

Comment on lines 30 to 40
function check_minikube() {
if minikube status >/dev/null 2>&1; then
echo "Minikube is already running."
else
echo "Minikube is not running. Starting Minikube..."
minikube start
fi
}

echo "Checking Minikube Kubernetes Cluster"
check_minikube
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, we do not check the status of minikube cluster here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be better if you do not delete the original content.

And I think it's necessary to rewrite this example since TFJobClient() is outdated in the newest SDK in training-operator. WDYT👀 @kubeflow/wg-automl-leads

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't apply to many changes to that notebook beside as you mentioned replacing the TFJobClient with TrainingClient and specifying the job type. For some reason it looks like I rewrite the whole example. Anyway I fixed the "corrupted" notebook. Also I fixed the other issues you pointed at.

Regarding default namespace, it failed to create experiments without specifying the namespace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yehudit1987!

I know some code lines were generated by pycharm or vscode. The "original content" I meant is some output and images in the notebook, not those code lines.

As for the namespace, could we specify default namespace for the experiment like namespace=default? And it will be better if we could specify namespace for papermill like: kubeflow/training-operator#2274 .

Signed-off-by: Yehudit Kerido <yehudit.kerido@nokia.com>
Yehudit Kerido added 2 commits November 18, 2024 07:59
Signed-off-by: Yehudit Kerido <yehudit.kerido@nokia.com>
Signed-off-by: Yehudit Kerido <yehudit.kerido@nokia.com>
@yehudit1987 yehudit1987 marked this pull request as ready for review November 19, 2024 09:33
Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of effort @yehudit1987 ! Thanks for your contribution.

I left some comments for you. cc👀 @kubeflow/wg-automl-leads

@@ -671,4 +671,4 @@
},
"nbformat": 4,
"nbformat_minor": 4
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
}

kubectl wait --for=condition=ContainersReady=True --timeout=${TIMEOUT} -l "katib.kubeflow.org/component in ($WITH_DATABASE_TYPE,controller,db-manager,ui)" -n kubeflow pod ||
(kubectl get pods -n kubeflow && kubectl describe pods -n kubeflow && exit 1)
kubectl wait --for=condition=ContainersReady=True --timeout=${TIMEOUT} -l "katib.kubeflow.org/component in ($WITH_DATABASE_TYPE,controller,db-manager,ui)" -n kubeflow pod || (kubectl get pods -n kubeflow && kubectl describe pods -n kubeflow && exit 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better if we could adjust the format of this line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Recover its original state)

Comment on lines 101 to 112
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Experiment name and namespace.\n",
"namespace = \"kubeflow-user-example-com\"\n",
"namespace = \"kubeflow\"\n",
"experiment_name = \"cmaes-example\"\n",
"\n",
"metadata = V1ObjectMeta(\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add parameters tag in metadata and allow args in papermill rewrite them like: kubeflow/training-operator#2274?

@@ -314,7 +342,8 @@
"\n",
"# Start the Katib Experiment.\n",
"exp_name = \"tune-mnist\"\n",
"katib_client = katib.KatibClient()\n",
"namespace=\"kubeflow\"\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like above

Comment on lines 444 to 446
"import time\n",
"time.sleep(120)\n",
"status = katib_client.is_experiment_succeeded(exp_name, namespace=namespace)\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace fixed-time sleep with wait_for_experiment_condition()?

def wait_for_experiment_condition(
self,
name: str,
namespace: Optional[str] = None,
expected_condition: str = constants.EXPERIMENT_CONDITION_SUCCEEDED,
timeout: int = 600,
polling_interval: int = 15,
apiserver_timeout: int = constants.DEFAULT_TIMEOUT,
):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we can use this API here.

@Electronic-Waste
Copy link
Member

/rerun-all

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this effort and updating broken Notebooks in Katib @yehudit1987!
Let's finalize this PR once we design the testing script in the Training Operator.

Comment on lines 47 to 59
- name: Run Jupyter Notebook with Papermill
shell: bash
run: |
IFS=',' read -r -a NOTEBOOK_ARRAY <<< "${{ inputs.notebook-input }}"
# Loop through each notebook path
for NOTEBOOK in "${NOTEBOOK_ARRAY[@]}"; do
OUTPUT_FILE="${NOTEBOOK%.ipynb}_output.ipynb"
echo "Running notebook: $NOTEBOOK"
papermill "$NOTEBOOK" "$OUTPUT_FILE" --log-output --kernel python3 || {
echo "Papermill failed for notebook: $NOTEBOOK"
exit 1
}
done
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed with @saileshd1402 in the Training Operator PR: kubeflow/training-operator#2274 (comment), we might want to create script to run those Notebooks with papermill rather than adding the script in the GitHub action directly.
I think, once we finalize it, we can use the same approach for Katib tests.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @andreyvelich for now I fix all the previous comments.
One of the notebooks is again seems to be rewritten (even though using jupyter lab) as you suggest.
Anyway I will fix those notebooks together with the decision about using the script or not.
Please keep me update on that.

Comment on lines 444 to 446
"import time\n",
"time.sleep(120)\n",
"status = katib_client.is_experiment_succeeded(exp_name, namespace=namespace)\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we can use this API here.

Comment on lines 467 to 469
"pycharm": {
"name": "#%% md\n"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to edit these Notebooks using JupyterLab directly.
In that case, the JSON format will be correctly rendered for every IDE.
E.g. you can just run JupyterLab locally to edit them:

pip install jupyterlab
jupyter lab

Yehudit Kerido added 2 commits December 1, 2024 12:53
Signed-off-by: Yehudit Kerido <yehudit.kerido@nokia.com>
Signed-off-by: Yehudit Kerido <yehudit.kerido@nokia.com>
@google-oss-prow google-oss-prow bot added size/XXL and removed size/XL labels Dec 2, 2024
Signed-off-by: Yehudit Kerido <yehudit.kerido@nokia.com>
@google-oss-prow google-oss-prow bot added size/XL and removed size/XXL labels Dec 2, 2024
Signed-off-by: Yehudit Kerido <yehudit.kerido@nokia.com>
@google-oss-prow google-oss-prow bot added size/XXL and removed size/XL labels Dec 2, 2024
@Electronic-Waste
Copy link
Member

/rerun-all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Test] E2e Tests for Notebook Examples
4 participants