-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix preemptibles and maxRetries on GCP Batch [AN-274] [AN-377] #7684
Merged
Merged
Changes from 31 commits
Commits
Show all changes
33 commits
Select commit
Hold shift + click to select a range
1dabea0
Revert "WX-1595 Refactor the preemption error handling from GCP Batch…
mcovarr e15ff63
Merge commit 'bd2bbe5a2bbf50a529028a71c1a2f08d04c79cef' into an_274_f…
mcovarr 194b566
Merge commit 'f9b91f753e1bc9c89cb41880b5300e1ba93021f4' into an_274_f…
mcovarr 0e12d37
Merge commit 'b29d8005e33aadd4e9e57178101bc3ef9d0ca9bc' into an_274_f…
mcovarr e7b6f51
Merge commit '7bb8d1f102560b625d260db0773f92981b05141d' into an_274_f…
mcovarr ae832f6
Merge commit '42043d7885d837836b2cb176bf92d5b0a7198245' into an_274_f…
mcovarr 9876753
Merge commit 'e32f6e0c4cb7a6b605d94987d5f4c93b25e5fd44' into an_274_f…
mcovarr 46bcec1
Merge commit '985ccf5f13791fca8a109a70c5be8921b27600e9' into an_274_f…
mcovarr 57f5113
Merge commit 'd8d9375845fa213f1550cb0fd39eac4fccfa1110' into an_274_f…
mcovarr c4e878b
Merge commit '3753dfd787495628c37b063ce3e8756d18244ba2' into an_274_f…
mcovarr c4acad1
Merge commit '4eb0730e545e44a7dc989b637772342f75531e99' into an_274_f…
mcovarr 354f007
Merge commit 'b982f6fdd7697da8c7c774a27e89459f702d11cb' into an_274_f…
mcovarr fc88cf4
Merge commit '081318d2fdb9f25ad5b50e7df4a8f21785f01a63' into an_274_f…
mcovarr 4c5f482
Merge commit '80fbf59c6f4e180386af7e1ac7369d50747fe582' into an_274_f…
mcovarr a1d4521
Merge commit 'fb9745a7979df20246d979112e96ea74c62f08a7' into an_274_f…
mcovarr 79994b8
Merge commit '74885eff73ddb2f6a3df9d89be4a7b32f4b741f6' into an_274_f…
mcovarr b75c8f6
Merge commit 'd3ded6f9150fa8a3836855dce0fb76c88d70701a' into an_274_f…
mcovarr aace3a9
Merge commit 'e54b32e3df56ad1168f8cd8607bf4df77c8c3bc3' into an_274_f…
mcovarr 68ed90c
Merge commit '90ff21b3587d541b407fa622b9b6c8e0698462c3' into an_274_f…
mcovarr 40b1847
Merge commit '61a89e3add1ece4ab956b94e41ced9e041b7d232' into an_274_f…
mcovarr d285e1b
Merge commit 'ef8745af3ead6bed4f822fc9a2dddb922728e9a9' into an_274_f…
mcovarr 53ff0ac
Merge commit '7818492ae03b9e2c932715464065d4f471e2ddf7' into an_274_f…
mcovarr b653ea7
Merge commit '2c134ecb731b92fb85e15266261880c9081100da' into an_274_f…
mcovarr 4df2a07
scalafmt
mcovarr 78bfd53
fixups
mcovarr 4e2562b
[AN-393] Add waiting for quota to quota messages (#7686)
LizBaldo 01a294a
Fix preemptible / maxRetries on GCP Batch [AN-274]
mcovarr 4a92bbc
changelog
mcovarr f87aff2
Retry with more memory MEM_SIZE/MEM_UNIT Centaur test
mcovarr 37404e0
Merge branch 'develop' into an_274_fix_preemptibles
mcovarr 506c3c5
Merge remote-tracking branch 'origin/develop' into an_274_fix_preempt…
mcovarr a35bc4e
PR feedback, cleanup
mcovarr ba64037
docs
mcovarr File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2 changes: 1 addition & 1 deletion
2
centaur/src/main/resources/standardTestCases/checkpointing.test
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
70 changes: 70 additions & 0 deletions
70
centaur/src/main/resources/standardTestCases/checkpointing/gcpbatch_checkpointing.wdl
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
version 1.0 | ||
|
||
workflow checkpointing { | ||
call count { input: count_to = 100 } | ||
output { | ||
String preempted = count.preempted | ||
} | ||
} | ||
|
||
task count { | ||
input { | ||
Int count_to | ||
} | ||
|
||
meta { | ||
volatile: true | ||
} | ||
|
||
command <<< | ||
# Read from the my_checkpoint file if there's content there: | ||
FROM_CKPT=$(cat my_checkpoint | tail -n1 | awk '{ print $1 }') | ||
FROM_CKPT=${FROM_CKPT:-1} | ||
|
||
# We don't want any single VM run the entire count, so work out the max counter value for this attempt: | ||
MAX="$(($FROM_CKPT + 66))" | ||
|
||
INSTANCE_NAME=$(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/name" -H "Metadata-Flavor: Google") | ||
echo "Discovered instance: $INSTANCE_NAME" | ||
|
||
# Run the counter: | ||
echo '--' >> my_checkpoint | ||
for i in $(seq $FROM_CKPT ~{count_to}) | ||
do | ||
echo $i | ||
echo $i ${INSTANCE_NAME} $(date) >> my_checkpoint | ||
|
||
# If we're over our max, "preempt" the VM by simulating a maintenance event: | ||
if [ "${i}" -gt "${MAX}" ] | ||
then | ||
fully_qualified_zone=$(curl -s -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/zone) | ||
zone=$(basename "$fully_qualified_zone") | ||
gcloud beta compute instances simulate-maintenance-event $(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/name" -H "Metadata-Flavor: Google") --zone=$zone -q | ||
sleep 60 | ||
fi | ||
|
||
sleep 1 | ||
done | ||
|
||
# Prove that we got preempted at least once: | ||
FIRST_INSTANCE=$(cat my_checkpoint | head -n1 | awk '{ print $2 }') | ||
LAST_INSTANCE=$(cat my_checkpoint | tail -n1 | awk '{ print $2 }') | ||
if [ "${FIRST_INSTANCE}" != "LAST_INSTANCE" ] | ||
then | ||
echo "GOTPREEMPTED" > preempted.txt | ||
else | ||
echo "NEVERPREEMPTED" > preempted.txt | ||
fi | ||
>>> | ||
|
||
runtime { | ||
docker: "gcr.io/google.com/cloudsdktool/cloud-sdk:slim" | ||
preemptible: 3 | ||
checkpointFile: "my_checkpoint" | ||
} | ||
|
||
output { | ||
File checkpoint_log = "my_checkpoint" | ||
String preempted = read_string("preempted.txt") | ||
} | ||
} |
12 changes: 0 additions & 12 deletions
12
centaur/src/main/resources/standardTestCases/error_10_preemptible.test
This file was deleted.
Oops, something went wrong.
13 changes: 13 additions & 0 deletions
13
centaur/src/main/resources/standardTestCases/gcpbatch_checkpointing.test
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
name: gcpbatch_checkpointing | ||
testFormat: workflowsuccess | ||
backends: [GCPBATCH] | ||
|
||
files { | ||
workflow: checkpointing/gcpbatch_checkpointing.wdl | ||
} | ||
|
||
metadata { | ||
workflowName: checkpointing | ||
status: Succeeded | ||
"outputs.checkpointing.preempted": "GOTPREEMPTED" | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
13 changes: 13 additions & 0 deletions
13
centaur/src/main/resources/standardTestCases/gcpbatch_papi_preemptible_and_max_retries.test
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
name: gcpbatch_papi_preemptible_and_max_retries | ||
testFormat: workflowfailure | ||
backends: [GCPBATCH] | ||
|
||
files { | ||
workflow: papi_preemptible_and_max_retries/gcpbatch_papi_preemptible_and_max_retries.wdl | ||
} | ||
|
||
metadata { | ||
workflowName: papi_preemptible_and_max_retries | ||
status: Failed | ||
"papi_preemptible_and_max_retries.delete_self.-1.attempt": 3 | ||
} |
28 changes: 28 additions & 0 deletions
28
centaur/src/main/resources/standardTestCases/gcpbatch_preemptible_and_memory_retry.test
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
name: gcpbatch_preemptible_and_memory_retry | ||
testFormat: workflowfailure | ||
# The original version of this test was tailored to the quirks of Papi v2 in depending on the misdiagnosis of its own | ||
# VM deletion as a preemption event. However GCP Batch perhaps more correctly diagnoses VM deletion as a weird | ||
# non-preemption event. The GCPBATCH version of this test uses `gcloud beta compute instances simulate-maintenance-event` | ||
# to simulate a preemption in a way that GCP Batch actually perceives as a preemption. | ||
backends: [GCPBATCH] | ||
|
||
files { | ||
workflow: retry_with_more_memory/gcpbatch/preemptible_and_memory_retry.wdl | ||
options: retry_with_more_memory/retry_with_more_memory.options | ||
} | ||
|
||
metadata { | ||
workflowName: preemptible_and_memory_retry | ||
status: Failed | ||
"failures.0.message": "Workflow failed" | ||
"failures.0.causedBy.0.message": "stderr for job `preemptible_and_memory_retry.imitate_oom_error_on_preemptible:NA:3` contained one of the `memory-retry-error-keys: [OutOfMemory,Killed]` specified in the Cromwell config. Job might have run out of memory." | ||
"preemptible_and_memory_retry.imitate_oom_error_on_preemptible.-1.1.preemptible": "true" | ||
"preemptible_and_memory_retry.imitate_oom_error_on_preemptible.-1.1.executionStatus": "RetryableFailure" | ||
"preemptible_and_memory_retry.imitate_oom_error_on_preemptible.-1.1.runtimeAttributes.memory": "1 GB" | ||
"preemptible_and_memory_retry.imitate_oom_error_on_preemptible.-1.2.preemptible": "false" | ||
"preemptible_and_memory_retry.imitate_oom_error_on_preemptible.-1.2.executionStatus": "RetryableFailure" | ||
"preemptible_and_memory_retry.imitate_oom_error_on_preemptible.-1.2.runtimeAttributes.memory": "1 GB" | ||
"preemptible_and_memory_retry.imitate_oom_error_on_preemptible.-1.3.preemptible": "false" | ||
"preemptible_and_memory_retry.imitate_oom_error_on_preemptible.-1.3.executionStatus": "Failed" | ||
"preemptible_and_memory_retry.imitate_oom_error_on_preemptible.-1.3.runtimeAttributes.memory": "1.1 GB" | ||
} |
11 changes: 11 additions & 0 deletions
11
centaur/src/main/resources/standardTestCases/gcpbatch_preemptible_basic.test
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
name: gcpbatch_preemptible_basic | ||
testFormat: workflowsuccess | ||
backends: [GCPBATCH] | ||
|
||
files { | ||
workflow: preemptible_basic/gcpbatch_preemptible_basic.wdl | ||
} | ||
|
||
metadata { | ||
status: Succeeded | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
4 changes: 2 additions & 2 deletions
4
centaur/src/main/resources/standardTestCases/papi_preemptible_and_max_retries.test
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
31 changes: 31 additions & 0 deletions
31
...dTestCases/papi_preemptible_and_max_retries/gcpbatch_papi_preemptible_and_max_retries.wdl
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
version 1.0 | ||
|
||
task delete_self { | ||
|
||
command { | ||
preemptible=$(curl -H "Metadata-Flavor: Google" "http://metadata.google.internal/computeMetadata/v1/instance/scheduling/preemptible") | ||
|
||
# Simulate a maintenance event on ourselves if running on a preemptible VM, otherwise delete ourselves. | ||
fully_qualified_zone=$(curl -s -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/zone) | ||
zone=$(basename "$fully_qualified_zone") | ||
|
||
if [ "$preemptible" = "TRUE" ]; then | ||
gcloud beta compute instances simulate-maintenance-event $(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/name" -H "Metadata-Flavor: Google") --zone=$zone -q | ||
sleep 60 | ||
else | ||
# We need to actually delete ourselves if the VM is not preemptible; simulated maintenance events don't seem to | ||
# precipitate the demise of on-demand VMs. | ||
gcloud compute instances delete $(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/name" -H "Metadata-Flavor: Google") --zone=$zone -q | ||
fi | ||
} | ||
|
||
runtime { | ||
preemptible: 1 | ||
docker: "gcr.io/google.com/cloudsdktool/cloud-sdk:slim" | ||
maxRetries: 1 | ||
} | ||
} | ||
|
||
workflow papi_preemptible_and_max_retries { | ||
call delete_self | ||
} |
8 changes: 5 additions & 3 deletions
8
centaur/src/main/resources/standardTestCases/preemptible_and_memory_retry.test
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
11 changes: 11 additions & 0 deletions
11
centaur/src/main/resources/standardTestCases/preemptible_basic.test
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
name: preemptible_basic | ||
testFormat: workflowsuccess | ||
backends: [Papiv2, GCPBATCH_ALT] | ||
|
||
files { | ||
workflow: preemptible_basic/preemptible_basic.wdl | ||
} | ||
|
||
metadata { | ||
status: Succeeded | ||
} |
33 changes: 33 additions & 0 deletions
33
...aur/src/main/resources/standardTestCases/preemptible_basic/gcpbatch_preemptible_basic.wdl
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
version 1.0 | ||
|
||
task delete_self_if_preemptible { | ||
|
||
command <<< | ||
# Prepend date, time and pwd to xtrace log entries. | ||
PS4='\D{+%F %T} \w $ ' | ||
set -o errexit -o nounset -o pipefail -o xtrace | ||
|
||
preemptible=$(curl -H "Metadata-Flavor: Google" "http://metadata.google.internal/computeMetadata/v1/instance/scheduling/preemptible") | ||
|
||
# Perform a maintenance event on this VM if it is preemptible, which should cause it to be preempted. | ||
# Since `preemptible: 1` the job should be restarted on a non-preemptible VM. | ||
if [ "$preemptible" = "TRUE" ]; then | ||
fully_qualified_zone=$(curl -s -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/zone) | ||
zone=$(basename "$fully_qualified_zone") | ||
|
||
gcloud beta compute instances simulate-maintenance-event $(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/name" -H "Metadata-Flavor: Google") --zone=$zone -q | ||
sleep 60 | ||
fi | ||
|
||
>>> | ||
|
||
runtime { | ||
preemptible: 1 | ||
docker: "gcr.io/google.com/cloudsdktool/cloud-sdk:slim" | ||
} | ||
} | ||
|
||
|
||
workflow preemptible_basic { | ||
call delete_self_if_preemptible | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat!