Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add field to expose entrypoint num cpus in rayjob #1359

Merged
merged 16 commits into from
Aug 29, 2023

Conversation

shubhscoder
Copy link
Contributor

@shubhscoder shubhscoder commented Aug 22, 2023

Why are these changes needed?

Adds the fields entrypoint_num_cpus, gpus, resources in Kuberay Rayjob spec. Ray Job API supports specification of these resources, but before this code change the Kuberay Job Spec did not expose these fields

Related issue number

Closes #1266

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@architkulkarni architkulkarni self-assigned this Aug 22, 2023
@shubhscoder shubhscoder changed the title [WIP] Add field to expose entrypoint num cpus in rayjob Add field to expose entrypoint num cpus in rayjob Aug 23, 2023
@shubhscoder
Copy link
Contributor Author

@architkulkarni I think the code changes are ready for review. I have tried to complete all the steps given in the development guide. Apologies if something is missing. Also I am not sure why the tests are failing, I don't seem to have access to the logs. Thanks in advance for the help!

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but there's a typo. To catch this we would have needed an integration test with Ray. Can you add one? Here's one idea.

  • Add a new sample YAML file which has the job specify some cpus, gpus, and resources. I think there should be some way to also specify these logical resources in the RayCluster spec for the job (since we don't have physical GPUs, we don't want to autodetect the number of GPUs, but we can just define that the cluster has 4 GPUs or something.)
  • In the entrypoint script, use ray.available_resources() and ray.cluster_resources() to ensure the expected number of resources are taken up by the currently running script.
  • Add the name of the new YAML file to test_sample_rayjob_yamls.py so that it's tested in CI.

What do you think?

Comment on lines 99 to 101
"--entrypoint_num_cpus", "1.000000",
"--entrypoint_num_gpus", "0.500000",
"--entrypoint_resources", `{"Custom_1": 1, "Custom_2": 5.5}`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"--entrypoint_num_cpus", "1.000000",
"--entrypoint_num_gpus", "0.500000",
"--entrypoint_resources", `{"Custom_1": 1, "Custom_2": 5.5}`,
"--entrypoint-num-cpus", "1.000000",
"--entrypoint-num-gpus", "0.500000",
"--entrypoint-resources", `{"Custom_1": 1, "Custom_2": 5.5}`,

Comment on lines 114 to 123
if entrypointNumCpus > 0 {
k8sJobCommand = append(k8sJobCommand, "--entrypoint_num_cpus", fmt.Sprintf("%f", entrypointNumCpus))
}

if entrypointNumGpus > 0 {
k8sJobCommand = append(k8sJobCommand, "--entrypoint_num_gpus", fmt.Sprintf("%f", entrypointNumGpus))
}

if len(entrypointResources) > 0 {
k8sJobCommand = append(k8sJobCommand, "--entrypoint_resources", entrypointResources)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if entrypointNumCpus > 0 {
k8sJobCommand = append(k8sJobCommand, "--entrypoint_num_cpus", fmt.Sprintf("%f", entrypointNumCpus))
}
if entrypointNumGpus > 0 {
k8sJobCommand = append(k8sJobCommand, "--entrypoint_num_gpus", fmt.Sprintf("%f", entrypointNumGpus))
}
if len(entrypointResources) > 0 {
k8sJobCommand = append(k8sJobCommand, "--entrypoint_resources", entrypointResources)
if entrypointNumCpus > 0 {
k8sJobCommand = append(k8sJobCommand, "--entrypoint-num-cpus", fmt.Sprintf("%f", entrypointNumCpus))
}
if entrypointNumGpus > 0 {
k8sJobCommand = append(k8sJobCommand, "--entrypoint-num-gpus", fmt.Sprintf("%f", entrypointNumGpus))
}
if len(entrypointResources) > 0 {
k8sJobCommand = append(k8sJobCommand, "--entrypoint-resources", entrypointResources)

@architkulkarni
Copy link
Contributor

Can you say more about "don't have access to the logs"? You should be able to run tests locally using kind and if logs are missing, it could be a bug or we might need to improve the development guide.

@shubhscoder
Copy link
Contributor Author

shubhscoder commented Aug 23, 2023

Looks good to me, but there's a typo. To catch this we would have needed an integration test with Ray. Can you add one? Here's one idea.

  • Add a new sample YAML file which has the job specify some cpus, gpus, and resources. I think there should be some way to also specify these logical resources in the RayCluster spec for the job (since we don't have physical GPUs, we don't want to autodetect the number of GPUs, but we can just define that the cluster has 4 GPUs or something.)
  • In the entrypoint script, use ray.available_resources() and ray.cluster_resources() to ensure the expected number of resources are taken up by the currently running script.
  • Add the name of the new YAML file to test_sample_rayjob_yamls.py so that it's tested in CI.

What do you think?

Yes! That sounds great, Let me try adding that test!

@architkulkarni
Copy link
Contributor

architkulkarni commented Aug 23, 2023 via email

@shubhscoder
Copy link
Contributor Author

shubhscoder commented Aug 23, 2023

Actually what do you mean by this: ". I think there should be some way to also specify these logical resources in the RayCluster spec for the job (since we don't have physical GPUs, we don't want to autodetect the number of GPUs, but we can just define that the cluster has 4 GPUs or something.)
"

Following is my understanding:

  1. RayClusterSpec specifies the desired state of the cluster. I see we can specify ResourceRequirements like CPU / memory and maybe even GPUs, etc.
  2. Then kubernetes would try to map the pods of the ray cluster to nodes which have those resources?
  3. So If we specify say the resources something like this:
      resources:
        limits:
          cpu: "1"     # Limit to 1 CPU core
          nvidia.com/gpu: 1  # Limit to 1 GPU
        requests:
          cpu: "200m"  # Request 200m CPU (0.2 CPU cores)
          nvidia.com/gpu: 1  # Request 1 GPU

In that case, the infra on which our tests run would need to have a GPU (which I assume we don't have). Then won't the cluster creation remain in a pending state? I understand we are somehow trying to mock the presence of a GPU by specifying in the cluster spec, but I am not getting exactly how. It would be very helpful if you could clarify this! Thanks for your help!

  • Add a new sample YAML file which has the job specify some cpus, gpus, and resources. I think there should be some way to also specify these logical resources in the RayCluster spec for the job (since we don't have physical GPUs, we don't want to autodetect the number of GPUs, but we can just define that the cluster has 4 GPUs or something.)

@architkulkarni
Copy link
Contributor

Yeah exactly, we don't have physical GPUs, but from Ray's perspective the resources are logical, not physical. So we can just tell Ray that a certain node has 4 GPUs even if it doesn't, and it will schedule tasks and actors (and in this case, the entrypoint script) accordingly. https://docs.ray.io/en/latest/ray-core/scheduling/resources.html#specifying-node-resources (you can see the "KubeRay" tab)

@shubhscoder
Copy link
Contributor Author

shubhscoder commented Aug 26, 2023

@architkulkarni I have pushed the tests. However, I was facing one issue while running the tests. The rayjob used to start and fail as the ray cluster was not ready (Atleast that is what I interepreted from the logs). However, the next attempt (a resubmission I guess) used to pass, because maybe the cluster was up and running by this time. I think the intended behavior is that the cluster comes up, it is ready and then we start/submit the job.

Do you think this is a bug?
Also one problem is that the behavior is not consistently reproducible. I was seeing this behavior day before yesterday and yesterday, but I re-created my kind cluster and ran the submission again, and this time the behavior was as expected. So, unfortunately I don't have the exact way to reproduce this issue.

@architkulkarni
Copy link
Contributor

@shubhscoder Ah thanks, that sounds like a bug. If you see it again or if you have the logs from last time would you mind filing an issue? Thanks!

@architkulkarni
Copy link
Contributor

Please also add the new YAML file to test_sample_rayjob_yamls.py so that it gets tested in CI. You should also be able to run the test locally, but let me know if you run into issues and we can update documentation so that it's easier to use.

@shubhscoder
Copy link
Contributor Author

shubhscoder commented Aug 28, 2023

@architkulkarni I tried running make sync locally and I keep getting this error:

test -s /vagrant/kuberay/ray-operator/bin/controller-gen || GOBIN=/vagrant/kuberay/ray-operator/bin/controller-gen/.. go install sigs.k8s.io/controller-tools/cmd/controller-gen@v0.6.0
/vagrant/kuberay/ray-operator/bin/controller-gen "crd:maxDescLen=100,trivialVersions=true,preserveUnknownFields=false,generateEmbeddedObjectMeta=true,allowDangerousTypes=true" rbac:roleName=kuberay-operator webhook paths="./..." output:crd:artifacts:config=config/crd/bases
test -s /vagrant/kuberay/ray-operator/bin/kustomize ||  GOBIN=/vagrant/kuberay/ray-operator/bin/kustomize/.. go install sigs.k8s.io/kustomize/kustomize/v3@v3.10.0
go: sigs.k8s.io/kustomize/kustomize/v3@v3.10.0 (in sigs.k8s.io/kustomize/kustomize/v3@v3.10.0):
        The go.mod file for the module providing named packages contains one or
        more exclude directives. It must not contain directives that would cause
        it to be interpreted differently than if it were the main module.
Makefile:117: recipe for target 'kustomize' failed
make: *** [kustomize] Error 1

Looks like I am missing something

@architkulkarni
Copy link
Contributor

@architkulkarni I tried running make sync locally and I keep getting this error:

Oh weird, I'm not super familiar with this part... cc @kevin85421 in case you have any ideas

Worst case I can just check out your PR and run the command and push to your branch, if you have that enabled.

@architkulkarni
Copy link
Contributor

Would you mind creating an issue for the kustomize failure?

Also, another option is to "manually" sync the files, using ray-project/ray#38857 as an example

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code and the test LGTM otherwise!

@shubhscoder
Copy link
Contributor Author

shubhscoder commented Aug 29, 2023

@architkulkarni Sure I can file a bug for the Kustomize failure. This looks related to kubernetes-sigs/kustomize#3618

Specifically this comment: kubernetes-sigs/kustomize#3618 (comment) where the user tried to install Kustomize 4.0.1 and got the exact same error that I am getting (Kustomize in Kuberay seems to be 3.10.0).
According to this comment the bug was fixed and the user was able to install 4.5.2 kubernetes-sigs/kustomize#3618 (comment)

My guess is that this issue has started surfacing after upgrading to go 1.19, and maybe users who installed Kustomize with older versions of go did not face this issue and their old installations still seems to be working. (Just a guess).

I got around it for now by changing the Kustomize version to 4.5.2 locally. However, I have created this issue to investigate other effects of upgrading the kustomize version: #1368

@architkulkarni architkulkarni merged commit aa17363 into ray-project:master Aug 29, 2023
19 checks passed
@shubhscoder
Copy link
Contributor Author

@architkulkarni Thanks for all the help on this one! Also, apologies for so much too and fro on this fairly simple issue! I will try my best to make the future code changes more concise and tested.

@architkulkarni
Copy link
Contributor

Not at all, I think it's normal. Thanks for the contribution!

@architkulkarni
Copy link
Contributor

@architkulkarni I have pushed the tests. However, I was facing one issue while running the tests. The rayjob used to start and fail as the ray cluster was not ready (Atleast that is what I interepreted from the logs). However, the next attempt (a resubmission I guess) used to pass, because maybe the cluster was up and running by this time. I think the intended behavior is that the cluster comes up, it is ready and then we start/submit the job.

Do you think this is a bug? Also one problem is that the behavior is not consistently reproducible. I was seeing this behavior day before yesterday and yesterday, but I re-created my kind cluster and ran the submission again, and this time the behavior was as expected. So, unfortunately I don't have the exact way to reproduce this issue.

Saw it a couple times, filed an issue here #1381

lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…ct#1359)


---------

Co-authored-by: Sangamnerkar <ssangamnerkar6@gatech.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] [RayJob] Support entrypoint_num_cpus, gpus, resources arguments
3 participants