Add field to expose entrypoint num cpus in rayjob #1359

shubhscoder · 2023-08-22T20:43:38Z

Why are these changes needed?

Adds the fields entrypoint_num_cpus, gpus, resources in Kuberay Rayjob spec. Ray Job API supports specification of these resources, but before this code change the Kuberay Job Spec did not expose these fields

Related issue number

Closes #1266

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

shubhscoder · 2023-08-23T16:21:45Z

@architkulkarni I think the code changes are ready for review. I have tried to complete all the steps given in the development guide. Apologies if something is missing. Also I am not sure why the tests are failing, I don't seem to have access to the logs. Thanks in advance for the help!

architkulkarni

Looks good to me, but there's a typo. To catch this we would have needed an integration test with Ray. Can you add one? Here's one idea.

Add a new sample YAML file which has the job specify some cpus, gpus, and resources. I think there should be some way to also specify these logical resources in the RayCluster spec for the job (since we don't have physical GPUs, we don't want to autodetect the number of GPUs, but we can just define that the cluster has 4 GPUs or something.)
In the entrypoint script, use ray.available_resources() and ray.cluster_resources() to ensure the expected number of resources are taken up by the currently running script.
Add the name of the new YAML file to test_sample_rayjob_yamls.py so that it's tested in CI.

What do you think?

architkulkarni · 2023-08-23T16:59:53Z

ray-operator/controllers/ray/common/job_test.go

+		"--entrypoint_num_cpus", "1.000000",
+		"--entrypoint_num_gpus", "0.500000",
+		"--entrypoint_resources", `{"Custom_1": 1, "Custom_2": 5.5}`,


Suggested change

"--entrypoint_num_cpus", "1.000000",

"--entrypoint_num_gpus", "0.500000",

"--entrypoint_resources", `{"Custom_1": 1, "Custom_2": 5.5}`,

"--entrypoint-num-cpus", "1.000000",

"--entrypoint-num-gpus", "0.500000",

"--entrypoint-resources", `{"Custom_1": 1, "Custom_2": 5.5}`,

architkulkarni · 2023-08-23T17:00:18Z

ray-operator/controllers/ray/common/job.go

+	if entrypointNumCpus > 0 {
+		k8sJobCommand = append(k8sJobCommand, "--entrypoint_num_cpus", fmt.Sprintf("%f", entrypointNumCpus))
+	}
+
+	if entrypointNumGpus > 0 {
+		k8sJobCommand = append(k8sJobCommand, "--entrypoint_num_gpus", fmt.Sprintf("%f", entrypointNumGpus))
+	}
+
+	if len(entrypointResources) > 0 {
+		k8sJobCommand = append(k8sJobCommand, "--entrypoint_resources", entrypointResources)


Suggested change

if entrypointNumCpus > 0 {

k8sJobCommand = append(k8sJobCommand, "--entrypoint_num_cpus", fmt.Sprintf("%f", entrypointNumCpus))

}

if entrypointNumGpus > 0 {

k8sJobCommand = append(k8sJobCommand, "--entrypoint_num_gpus", fmt.Sprintf("%f", entrypointNumGpus))

}

if len(entrypointResources) > 0 {

k8sJobCommand = append(k8sJobCommand, "--entrypoint_resources", entrypointResources)

if entrypointNumCpus > 0 {

k8sJobCommand = append(k8sJobCommand, "--entrypoint-num-cpus", fmt.Sprintf("%f", entrypointNumCpus))

}

if entrypointNumGpus > 0 {

k8sJobCommand = append(k8sJobCommand, "--entrypoint-num-gpus", fmt.Sprintf("%f", entrypointNumGpus))

}

if len(entrypointResources) > 0 {

k8sJobCommand = append(k8sJobCommand, "--entrypoint-resources", entrypointResources)

architkulkarni · 2023-08-23T17:07:38Z

Can you say more about "don't have access to the logs"? You should be able to run tests locally using kind and if logs are missing, it could be a bug or we might need to improve the development guide.

shubhscoder · 2023-08-23T18:42:37Z

Looks good to me, but there's a typo. To catch this we would have needed an integration test with Ray. Can you add one? Here's one idea.

Add a new sample YAML file which has the job specify some cpus, gpus, and resources. I think there should be some way to also specify these logical resources in the RayCluster spec for the job (since we don't have physical GPUs, we don't want to autodetect the number of GPUs, but we can just define that the cluster has 4 GPUs or something.)

In the entrypoint script, use ray.available_resources() and ray.cluster_resources() to ensure the expected number of resources are taken up by the currently running script.

Add the name of the new YAML file to test_sample_rayjob_yamls.py so that it's tested in CI.

What do you think?

Yes! That sounds great, Let me try adding that test!

architkulkarni · 2023-08-23T19:02:25Z

Great, let me know if you have any questions!

…

On Wed, Aug 23, 2023 at 11:42 AM Shubham Sangamnerkar < ***@***.***> wrote: Looks good to me, but there's a typo. To catch this we would have needed an integration test with Ray. Can you add one? Here's one idea. - Add a new sample YAML file which has the job specify some cpus, gpus, and resources. I think there should be some way to also specify these logical resources in the RayCluster spec for the job (since we don't have physical GPUs, we don't want to autodetect the number of GPUs, but we can just define that the cluster has 4 GPUs or something.) - In the entrypoint script, use ray.available_resources() and ray.cluster_resources() to ensure the expected number of resources are taken up by the currently running script. - Add the name of the new YAML file to test_sample_rayjob_yamls.py so that it's tested in CI. What do you think? Looks good to me, but there's a typo. To catch this we would have needed an integration test with Ray. Can you add one? Here's one idea. - Add a new sample YAML file which has the job specify some cpus, gpus, and resources. I think there should be some way to also specify these logical resources in the RayCluster spec for the job (since we don't have physical GPUs, we don't want to autodetect the number of GPUs, but we can just define that the cluster has 4 GPUs or something.) - In the entrypoint script, use ray.available_resources() and ray.cluster_resources() to ensure the expected number of resources are taken up by the currently running script. - Add the name of the new YAML file to test_sample_rayjob_yamls.py so that it's tested in CI. What do you think? Yes! That sounds great, Let me try adding that test! — Reply to this email directly, view it on GitHub <#1359 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABJU5RSLW3PJJCUBZ4DHIEDXWZFKRANCNFSM6AAAAAA32PDQQM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

shubhscoder · 2023-08-23T19:48:52Z

Actually what do you mean by this: ". I think there should be some way to also specify these logical resources in the RayCluster spec for the job (since we don't have physical GPUs, we don't want to autodetect the number of GPUs, but we can just define that the cluster has 4 GPUs or something.)
"

Following is my understanding:

RayClusterSpec specifies the desired state of the cluster. I see we can specify ResourceRequirements like CPU / memory and maybe even GPUs, etc.
Then kubernetes would try to map the pods of the ray cluster to nodes which have those resources?
So If we specify say the resources something like this:

      resources:
        limits:
          cpu: "1"     # Limit to 1 CPU core
          nvidia.com/gpu: 1  # Limit to 1 GPU
        requests:
          cpu: "200m"  # Request 200m CPU (0.2 CPU cores)
          nvidia.com/gpu: 1  # Request 1 GPU

In that case, the infra on which our tests run would need to have a GPU (which I assume we don't have). Then won't the cluster creation remain in a pending state? I understand we are somehow trying to mock the presence of a GPU by specifying in the cluster spec, but I am not getting exactly how. It would be very helpful if you could clarify this! Thanks for your help!

Add a new sample YAML file which has the job specify some cpus, gpus, and resources. I think there should be some way to also specify these logical resources in the RayCluster spec for the job (since we don't have physical GPUs, we don't want to autodetect the number of GPUs, but we can just define that the cluster has 4 GPUs or something.)

architkulkarni · 2023-08-23T21:30:34Z

Yeah exactly, we don't have physical GPUs, but from Ray's perspective the resources are logical, not physical. So we can just tell Ray that a certain node has 4 GPUs even if it doesn't, and it will schedule tasks and actors (and in this case, the entrypoint script) accordingly. https://docs.ray.io/en/latest/ray-core/scheduling/resources.html#specifying-node-resources (you can see the "KubeRay" tab)

shubhscoder · 2023-08-26T19:28:04Z

@architkulkarni I have pushed the tests. However, I was facing one issue while running the tests. The rayjob used to start and fail as the ray cluster was not ready (Atleast that is what I interepreted from the logs). However, the next attempt (a resubmission I guess) used to pass, because maybe the cluster was up and running by this time. I think the intended behavior is that the cluster comes up, it is ready and then we start/submit the job.

Do you think this is a bug?
Also one problem is that the behavior is not consistently reproducible. I was seeing this behavior day before yesterday and yesterday, but I re-created my kind cluster and ran the submission again, and this time the behavior was as expected. So, unfortunately I don't have the exact way to reproduce this issue.

architkulkarni · 2023-08-28T19:40:17Z

@shubhscoder Ah thanks, that sounds like a bug. If you see it again or if you have the logs from last time would you mind filing an issue? Thanks!

architkulkarni · 2023-08-28T20:35:26Z

https://github.com/ray-project/kuberay/actions/runs/5986738858/job/16284203893?pr=1359#step:5:10

Can you try this https://github.com/ray-project/kuberay/blob/master/ray-operator/DEVELOPMENT.md#consistency-check and see if the error goes away?

architkulkarni · 2023-08-28T20:37:01Z

Please also add the new YAML file to test_sample_rayjob_yamls.py so that it gets tested in CI. You should also be able to run the test locally, but let me know if you run into issues and we can update documentation so that it's easier to use.

shubhscoder · 2023-08-28T22:17:21Z

@architkulkarni I tried running make sync locally and I keep getting this error:

test -s /vagrant/kuberay/ray-operator/bin/controller-gen || GOBIN=/vagrant/kuberay/ray-operator/bin/controller-gen/.. go install sigs.k8s.io/controller-tools/cmd/controller-gen@v0.6.0
/vagrant/kuberay/ray-operator/bin/controller-gen "crd:maxDescLen=100,trivialVersions=true,preserveUnknownFields=false,generateEmbeddedObjectMeta=true,allowDangerousTypes=true" rbac:roleName=kuberay-operator webhook paths="./..." output:crd:artifacts:config=config/crd/bases
test -s /vagrant/kuberay/ray-operator/bin/kustomize ||  GOBIN=/vagrant/kuberay/ray-operator/bin/kustomize/.. go install sigs.k8s.io/kustomize/kustomize/v3@v3.10.0
go: sigs.k8s.io/kustomize/kustomize/v3@v3.10.0 (in sigs.k8s.io/kustomize/kustomize/v3@v3.10.0):
        The go.mod file for the module providing named packages contains one or
        more exclude directives. It must not contain directives that would cause
        it to be interpreted differently than if it were the main module.
Makefile:117: recipe for target 'kustomize' failed
make: *** [kustomize] Error 1

Looks like I am missing something

architkulkarni · 2023-08-28T23:23:18Z

@architkulkarni I tried running make sync locally and I keep getting this error:

Oh weird, I'm not super familiar with this part... cc @kevin85421 in case you have any ideas

Worst case I can just check out your PR and run the command and push to your branch, if you have that enabled.

architkulkarni · 2023-08-28T23:28:02Z

Would you mind creating an issue for the kustomize failure?

Also, another option is to "manually" sync the files, using ray-project/ray#38857 as an example

architkulkarni

The code and the test LGTM otherwise!

shubhscoder · 2023-08-29T04:02:36Z

@architkulkarni Sure I can file a bug for the Kustomize failure. This looks related to kubernetes-sigs/kustomize#3618

Specifically this comment: kubernetes-sigs/kustomize#3618 (comment) where the user tried to install Kustomize 4.0.1 and got the exact same error that I am getting (Kustomize in Kuberay seems to be 3.10.0).
According to this comment the bug was fixed and the user was able to install 4.5.2 kubernetes-sigs/kustomize#3618 (comment)

My guess is that this issue has started surfacing after upgrading to go 1.19, and maybe users who installed Kustomize with older versions of go did not face this issue and their old installations still seems to be working. (Just a guess).

I got around it for now by changing the Kustomize version to 4.5.2 locally. However, I have created this issue to investigate other effects of upgrading the kustomize version: #1368

shubhscoder · 2023-08-29T21:55:04Z

@architkulkarni Thanks for all the help on this one! Also, apologies for so much too and fro on this fairly simple issue! I will try my best to make the future code changes more concise and tested.

architkulkarni · 2023-08-29T23:39:46Z

Not at all, I think it's normal. Thanks for the contribution!

architkulkarni · 2023-08-31T20:37:12Z

@architkulkarni I have pushed the tests. However, I was facing one issue while running the tests. The rayjob used to start and fail as the ray cluster was not ready (Atleast that is what I interepreted from the logs). However, the next attempt (a resubmission I guess) used to pass, because maybe the cluster was up and running by this time. I think the intended behavior is that the cluster comes up, it is ready and then we start/submit the job.

Do you think this is a bug? Also one problem is that the behavior is not consistently reproducible. I was seeing this behavior day before yesterday and yesterday, but I re-created my kind cluster and ran the submission again, and this time the behavior was as expected. So, unfortunately I don't have the exact way to reproduce this issue.

Saw it a couple times, filed an issue here #1381

…ct#1359) --------- Co-authored-by: Sangamnerkar <ssangamnerkar6@gatech.edu>

scube97 and others added 3 commits August 22, 2023 16:34

Add field to expose entrypoint num cpus in rayjob

5bf3147

Add field to expose entrypoint num cpus in rayjob

53d624f

Add params for gpu and custom resources

90ee056

architkulkarni self-assigned this Aug 22, 2023

shubhscoder added 2 commits August 22, 2023 19:49

Fix incorrect merges

937a9c9

Remove unnecessary logs

8b94ef2

shubhscoder changed the title ~~[WIP] Add field to expose entrypoint num cpus in rayjob~~ Add field to expose entrypoint num cpus in rayjob Aug 23, 2023

shubhscoder added 5 commits August 23, 2023 12:40

Add field to expose entrypoint num cpus in rayjob

a004860

Remove unnecessary logs

25f8609

Remove unnecessary changes in samples

911a114

Remove unnecessary changes from sample

643a5c6

Remove spurious change

d35eb40

architkulkarni reviewed Aug 23, 2023

View reviewed changes

architkulkarni requested a review from kevin85421 August 23, 2023 17:06

shubhscoder added 2 commits August 25, 2023 05:27

Fix typo in passing params to ray

6bab1dd

Add a integration test for checking the added params

9a9c326

Final cleanup

4831cf6

Add test to python test file for CI/CD

e48ce18

architkulkarni reviewed Aug 28, 2023

View reviewed changes

shubhscoder added 2 commits August 29, 2023 00:11

Fix helm chart and operator sync

3977601

Remove unnecessary print and newlines

7a32f6c

architkulkarni approved these changes Aug 29, 2023

View reviewed changes

architkulkarni merged commit aa17363 into ray-project:master Aug 29, 2023
19 checks passed

lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023

[RayJob] Add field to expose entrypoint num cpus in rayjob (ray-proje…

3d752ff

…ct#1359) --------- Co-authored-by: Sangamnerkar <ssangamnerkar6@gatech.edu>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add field to expose entrypoint num cpus in rayjob #1359

Add field to expose entrypoint num cpus in rayjob #1359

shubhscoder commented Aug 22, 2023 •

edited

Loading

shubhscoder commented Aug 23, 2023

architkulkarni left a comment

architkulkarni Aug 23, 2023

architkulkarni Aug 23, 2023

architkulkarni commented Aug 23, 2023

shubhscoder commented Aug 23, 2023 •

edited

Loading

architkulkarni commented Aug 23, 2023 via email

shubhscoder commented Aug 23, 2023 •

edited

Loading

architkulkarni commented Aug 23, 2023

shubhscoder commented Aug 26, 2023 •

edited

Loading

architkulkarni commented Aug 28, 2023

architkulkarni commented Aug 28, 2023

architkulkarni commented Aug 28, 2023

shubhscoder commented Aug 28, 2023 •

edited

Loading

architkulkarni commented Aug 28, 2023

architkulkarni commented Aug 28, 2023

architkulkarni left a comment

shubhscoder commented Aug 29, 2023 •

edited

Loading

shubhscoder commented Aug 29, 2023

architkulkarni commented Aug 29, 2023

architkulkarni commented Aug 31, 2023

Add field to expose entrypoint num cpus in rayjob #1359

Add field to expose entrypoint num cpus in rayjob #1359

Conversation

shubhscoder commented Aug 22, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

shubhscoder commented Aug 23, 2023

architkulkarni left a comment

Choose a reason for hiding this comment

architkulkarni Aug 23, 2023

Choose a reason for hiding this comment

architkulkarni Aug 23, 2023

Choose a reason for hiding this comment

architkulkarni commented Aug 23, 2023

shubhscoder commented Aug 23, 2023 • edited Loading

architkulkarni commented Aug 23, 2023 via email

shubhscoder commented Aug 23, 2023 • edited Loading

architkulkarni commented Aug 23, 2023

shubhscoder commented Aug 26, 2023 • edited Loading

architkulkarni commented Aug 28, 2023

architkulkarni commented Aug 28, 2023

architkulkarni commented Aug 28, 2023

shubhscoder commented Aug 28, 2023 • edited Loading

architkulkarni commented Aug 28, 2023

architkulkarni commented Aug 28, 2023

architkulkarni left a comment

Choose a reason for hiding this comment

shubhscoder commented Aug 29, 2023 • edited Loading

shubhscoder commented Aug 29, 2023

architkulkarni commented Aug 29, 2023

architkulkarni commented Aug 31, 2023

shubhscoder commented Aug 22, 2023 •

edited

Loading

shubhscoder commented Aug 23, 2023 •

edited

Loading

shubhscoder commented Aug 23, 2023 •

edited

Loading

shubhscoder commented Aug 26, 2023 •

edited

Loading

shubhscoder commented Aug 28, 2023 •

edited

Loading

shubhscoder commented Aug 29, 2023 •

edited

Loading