Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Azure][Cluster] Only set evictionPolicy to Delete when spot is used #46959

Merged

Conversation

kekulai-fredchang
Copy link
Contributor

previous update assumed evictionPolicy on all nodes as 'Delete', which yields error for those azure users that do not launch a Spot instance.

@gramhagen , @eisber , @ijrsvt

Why are these changes needed?

Since Spot instance availability is low in azure, many users will launch with regular instance and encounter error.

added conditional logic in master azure json file to check if priority is indeed Spot , if so, then yield evictionPolicy as 'Delete', otherwise yield json('null') to skip this policy.

Related issue number

fixes #46198

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@gramhagen
Copy link
Contributor

This makes sense. Although if the user provides this parameter, I would default to use that first before filling in a default. I'm assuming it's null if not provided? If so you should be able to check if the policy is null and the priority is for spot then use delete otherwise keep the parameter as is.

@frdchang
Copy link

This makes sense. Although if the user provides this parameter, I would default to use that first before filling in a default. I'm assuming it's null if not provided? If so you should be able to check if the policy is null and the priority is for spot then use delete otherwise keep the parameter as is.

@gramhagen new logic was added to check if the user supplied an eviction policy & a spot instance is used, otherwise if spot instance is regular, it passes null for eviction policy.

Copy link
Contributor

@gramhagen gramhagen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this work? my eyes are getting crossed looking at the parentheses in the if condition 😵‍💫

@@ -78,7 +78,7 @@
},
"evictionPolicy": {
"type": "string",
"defaultValue": "Delete",
"defaultValue": "json('null')",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, didn't realize there was already a default for this, i'd suggest putting this back and just catching the error case below. that should also allow people to maintain any overrides (just not an invalid one)

Suggested change
"defaultValue": "json('null')",
"defaultValue": "Delete",

@@ -253,7 +253,7 @@
}
},
"priority": "[parameters('priority')]",
"evictionPolicy": "[parameters('evictionPolicy')]",
"evictionPolicy": "[if(and(equals(parameters('priority'), 'Spot'), or(equals(parameters('evictionPolicy'), ''), equals(parameters('evictionPolicy'), json('null')))), 'Delete', parameters('evictionPolicy'))]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"evictionPolicy": "[if(and(equals(parameters('priority'), 'Spot'), or(equals(parameters('evictionPolicy'), ''), equals(parameters('evictionPolicy'), json('null')))), 'Delete', parameters('evictionPolicy'))]",
"evictionPolicy": "[if(and(equals(parameters('priority'), 'Spot'), equals(parameters('evictionPolicy'), 'Delete')), json('null'), parameters('evictionPolicy'))]",

@anyscalesam anyscalesam added triage Needs triage (eg: priority, bug/not-bug, and owning component) data Ray Data-related issues labels Aug 12, 2024
@anyscalesam anyscalesam added core Issues that should be addressed in Ray Core and removed data Ray Data-related issues labels Aug 19, 2024
@anyscalesam
Copy link
Contributor

@kekulai-fredchang can you address comments here?

@anyscalesam anyscalesam added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 19, 2024
@frdchang
Copy link

@kekulai-fredchang can you address comments here?

I looked at @gramhagen suggestions, and from my limited azure knowledge it did not seem correct because it will pass json('null') even when 'Delete' is desired.

I redid the logic as follows, which allows both spot instances with delete default and user defined pass through, and regular instances with evictionPolicy set to json('null').

IF priority IS 'Spot' THEN
    IF evictionPolicy IS empty OR null THEN
        SET evictionPolicy TO 'Delete'
    ELSE
        USE user-defined evictionPolicy
    ENDIF
ELSE
    USE user-defined evictionPolicy
ENDIF 

@gramhagen
Copy link
Contributor

ok, lgtm

@mjd3
Copy link
Contributor

mjd3 commented Aug 22, 2024

I was running into this same issue, but I think this can be a lot simpler. The evictionPolicy parameter already defaults to Delete in the template, so you can just set it to an empty string in the case of Regular priority and use the user-defined evictionPolicy if it is provided in the Spot priority case. If the user doesn't specify an evictionPolicy and the priority is Spot then we default to Delete anyway. I just replaced this line with:

"evictionPolicy": "[if(equals(parameters('priority'), 'Spot'), parameters('evictionPolicy'), '')]",

and it works for me.

@gramhagen
Copy link
Contributor

@mjd3 that does seem like the easiest option!

@mjd3
Copy link
Contributor

mjd3 commented Aug 22, 2024

Also, as a workaround until this PR is merged in, you can add evictionPolicy: null in your cluster cfg yaml file for any nodes with priority: Regular.

@kekulai-fredchang
Copy link
Contributor Author

Also, as a workaround until this PR is merged in, you can add evictionPolicy: null in your cluster cfg yaml file for any nodes with priority: Regular.

good point @mjd3.

This either should be annotated in the docs or have this PR merged.

@mjd3
Copy link
Contributor

mjd3 commented Aug 26, 2024

@kekulai-fredchang do you agree with the suggestion above? From looking back, it seems like my suggestion mirrors @gramhagen's initial suggestion too. If so, can you update this PR? Would love to get this one merged in.

@kekulai-fredchang
Copy link
Contributor Author

@kekulai-fredchang do you agree with the suggestion above? From looking back, it seems like my suggestion mirrors @gramhagen's initial suggestion too. If so, can you update this PR? Would love to get this one merged in.

Agreed.

@mjd3
Copy link
Contributor

mjd3 commented Aug 26, 2024

Thanks @kekulai-fredchang! @anyscalesam this should be ready to merge.

@anyscalesam
Copy link
Contributor

@hongchaodeng can you please take a look?

@bthananjeyan
Copy link
Contributor

Sorry for introducing this issue in the earlier PR. I confirmed that this fix works for me locally. @ericl @architkulkarni @hongchaodeng would you be able to review this PR?

@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Sep 24, 2024
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have an Azure account to test this PR, but some folks have already verified it. Hence, I approve this PR.

@bthananjeyan
Copy link
Contributor

bthananjeyan commented Sep 24, 2024

@kekulai-fredchang It looks like you could run these instructions to pass the DCO check: https://github.com/ray-project/ray/pull/46959/checks?check_run_id=30602666160

Thanks for fixing this issue.

@kevin85421
Copy link
Member

It looks like you could run these instructions to pass the DCO check

The DCO check failing is fine with me. We just need to make sure the premerge check passes.

Simplify evictionPolicy handling based on VM priority

- Set default value of evictionPolicy to 'Delete'.
- Implement conditional logic to only pass evictionPolicy when priority is set to 'Spot'.
- If the VM priority is 'Spot', use the user-defined evictionPolicy; otherwise, default to 'Delete'.
- If the VM priority is not 'Spot', the evictionPolicy is not passed (set to empty string '').

(cherry picked from commit 3601528248a0375a6f2f0b19af697095cea665aa)
Signed-off-by: Frederick Chang <fred.chang@kekulai.com>
@kekulai-fredchang
Copy link
Contributor Author

kekulai-fredchang commented Sep 24, 2024

@bthananjeyan @kevin85421 I updated the branch to include signoff

@jjyao jjyao changed the title recent update introduces evictionPolicy error for regular node users on Azure [Azure][Cluster] Only set evictionPolicy to Delete when spot is used Sep 24, 2024
@jjyao jjyao merged commit 9df0730 into ray-project:master Sep 24, 2024
5 checks passed
@kekulai-fredchang kekulai-fredchang deleted the fix_azure_eviction_policy_update branch October 5, 2024 20:36
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Ray-Clusters] Azure Cluster Autoscaler failing to start worker nodes when using non-spot instances.
9 participants