Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Umbrella] Ray Autoscaling tests #2173

Open
8 of 15 tasks
Tracked by #2600
kevin85421 opened this issue May 29, 2024 · 4 comments
Open
8 of 15 tasks
Tracked by #2600

[Umbrella] Ray Autoscaling tests #2173

kevin85421 opened this issue May 29, 2024 · 4 comments
Assignees
Labels

Comments

@kevin85421
Copy link
Member

kevin85421 commented May 29, 2024

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

Currently, there are some autoscaling end-to-end tests for KubeRay in the Ray repository. However, there are some issues with the tests:

  • It uses Python.
  • To run the tests, you need to build the Ray source code for the test utilities.
  • There are some dependencies between the Ray version of the Pods in the K8s cluster and the Ray version in your local environment because it uses kubectl port-forward and interacts with the Ray head via Ray client and ray job submit.
  • It only tests the Ray nightly and KubeRay latest stable release. It doesn't test the compatibility between the Ray nightly and KubeRay nightly.

Here, we plan to build new autoscaling end-to-end tests in the KubeRay repository.

  • Use Golang instead of Python so that we can leverage the K8s ecosystem in Golang to add new tests easily.
  • Test KubeRay nightly with Ray nightly.
  • It doesn't require to build Ray.
  • It shouldn't have any dependencies between Ray Pods and local Ray version.
  • Then, we will replace the existing Ray autoscaling e2e tests with this one.

Progress

Use case

No response

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@Irvingwangjr
Copy link

https://kuttl.dev/docs/#pre-requisites
maybe we can consider using this tool?

@kevin85421
Copy link
Member Author

@Irvingwangjr Kuttl looks cool! I read the README. It seems suitable for testing some sample YAMLs, but it doesn't seem to support well the case where we need to execute some commands in the Pods to trigger certain behaviors.

@Irvingwangjr
Copy link

Irvingwangjr commented Jun 9, 2024

@Irvingwangjr Kuttl looks cool! I read the README. It seems suitable for testing some sample YAMLs, but it doesn't seem to support well the case where we need to execute some commands in the Pods to trigger certain behaviors.

yeah, that might be a problem, right now we use TestStep to execute scripts to trigger behaviors.
here is an example, we try to simulate the eviction of ray head, and check whether the RayJob will eventually enter the failed state:

apiVersion: kuttl.dev/v1beta1
kind: TestStep
commands:
  - script: |
      pod_name=$(kubectl get pod -l "xx.com/ray-pod-name=mlrayjob-head-failed-on-running-head-0" -n=ns -o jsonpath="{.items[0].metadata.name}")

      cmd="kubectl exec -it $pod_name -n=ns -c ray-container -- dd if=/dev/zero of=/tmp/test.txt bs=1M count=100000"

      $cmd &

      cmd_pid=$!

      wait $cmd_pid

      exit_code=$?
      if [ $exit_code -eq 137 ]; then
          echo "the process was killed, exit with return code of 137"
      else
          echo "the process was killed, exit with return code of $exit_code"
      fi

then we assert the RayJob to be in status of failed

apiVersion: kuttl.dev/v1beta1
kind: TestAssert
commands:
- command: kubectl assert exist-enhanced rayjob/mlrayjob-head-failed-on-running -n=ns --field-selector status.phase=Failed

we also use kube-assert here, it might be helpful.

@Irvingwangjr
Copy link

https://github.com/open-feature/open-feature-operator
OpenFeature(an CNCF project) adopt this tool, it also provides some examples

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants