[Umbrella] Ray Autoscaling tests #2173

kevin85421 · 2024-05-29T17:34:42Z

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

Currently, there are some autoscaling end-to-end tests for KubeRay in the Ray repository. However, there are some issues with the tests:

It uses Python.
To run the tests, you need to build the Ray source code for the test utilities.
There are some dependencies between the Ray version of the Pods in the K8s cluster and the Ray version in your local environment because it uses kubectl port-forward and interacts with the Ray head via Ray client and ray job submit.
It only tests the Ray nightly and KubeRay latest stable release. It doesn't test the compatibility between the Ray nightly and KubeRay nightly.

Here, we plan to build new autoscaling end-to-end tests in the KubeRay repository.

Use Golang instead of Python so that we can leverage the K8s ecosystem in Golang to add new tests easily.
Test KubeRay nightly with Ray nightly.
It doesn't require to build Ray.
It shouldn't have any dependencies between Ray Pods and local Ray version.
Then, we will replace the existing Ray autoscaling e2e tests with this one.

Progress

Use case

No response

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

Irvingwangjr · 2024-06-06T13:08:04Z

https://kuttl.dev/docs/#pre-requisites
maybe we can consider using this tool?

kevin85421 · 2024-06-09T05:32:53Z

@Irvingwangjr Kuttl looks cool! I read the README. It seems suitable for testing some sample YAMLs, but it doesn't seem to support well the case where we need to execute some commands in the Pods to trigger certain behaviors.

Irvingwangjr · 2024-06-09T05:43:56Z

@Irvingwangjr Kuttl looks cool! I read the README. It seems suitable for testing some sample YAMLs, but it doesn't seem to support well the case where we need to execute some commands in the Pods to trigger certain behaviors.

yeah, that might be a problem, right now we use TestStep to execute scripts to trigger behaviors.
here is an example, we try to simulate the eviction of ray head, and check whether the RayJob will eventually enter the failed state:

apiVersion: kuttl.dev/v1beta1
kind: TestStep
commands:
  - script: |
      pod_name=$(kubectl get pod -l "xx.com/ray-pod-name=mlrayjob-head-failed-on-running-head-0" -n=ns -o jsonpath="{.items[0].metadata.name}")

      cmd="kubectl exec -it $pod_name -n=ns -c ray-container -- dd if=/dev/zero of=/tmp/test.txt bs=1M count=100000"

      $cmd &

      cmd_pid=$!

      wait $cmd_pid

      exit_code=$?
      if [ $exit_code -eq 137 ]; then
          echo "the process was killed, exit with return code of 137"
      else
          echo "the process was killed, exit with return code of $exit_code"
      fi

then we assert the RayJob to be in status of failed

apiVersion: kuttl.dev/v1beta1
kind: TestAssert
commands:
- command: kubectl assert exist-enhanced rayjob/mlrayjob-head-failed-on-running -n=ns --field-selector status.phase=Failed

we also use kube-assert here, it might be helpful.

Irvingwangjr · 2024-06-09T05:46:27Z

https://github.com/open-feature/open-feature-operator
OpenFeature(an CNCF project) adopt this tool, it also provides some examples

kevin85421 added enhancement New feature or request triage autoscaler 1.2.0 ci and removed triage labels May 29, 2024

kevin85421 mentioned this issue May 29, 2024

[Test][Autoscaler][1/n] Add Ray Autoscaler e2e tests #2168

Merged

4 tasks

kevin85421 self-assigned this May 29, 2024

rueian mentioned this issue Jun 6, 2024

[Test][Autoscaler][2/n] Add Ray Autoscaler e2e tests for GPU workers #2181

Merged

4 tasks

MortalHappiness mentioned this issue Jun 14, 2024

[Test][Autoscaling] Add custom resource test #2193

Merged

4 tasks

kevin85421 added 1.3.0 and removed 1.2.0 labels Aug 16, 2024

kevin85421 mentioned this issue Dec 11, 2024

[Umbrella] Autoscaler improvements #2600

Open

28 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Umbrella] Ray Autoscaling tests #2173

[Umbrella] Ray Autoscaling tests #2173

kevin85421 commented May 29, 2024 •

edited

Loading

Irvingwangjr commented Jun 6, 2024

kevin85421 commented Jun 9, 2024

Irvingwangjr commented Jun 9, 2024 •

edited

Loading

Irvingwangjr commented Jun 9, 2024

[Umbrella] Ray Autoscaling tests #2173

[Umbrella] Ray Autoscaling tests #2173

Comments

kevin85421 commented May 29, 2024 • edited Loading

Search before asking

Description

Use case

Related issues

Are you willing to submit a PR?

Irvingwangjr commented Jun 6, 2024

kevin85421 commented Jun 9, 2024

Irvingwangjr commented Jun 9, 2024 • edited Loading

Irvingwangjr commented Jun 9, 2024

kevin85421 commented May 29, 2024 •

edited

Loading

Irvingwangjr commented Jun 9, 2024 •

edited

Loading