Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix preemptibles and maxRetries on GCP Batch [AN-274] [AN-377] #7684

Merged
merged 33 commits into from
Feb 7, 2025
Merged
Show file tree
Hide file tree
Changes from 31 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
1dabea0
Revert "WX-1595 Refactor the preemption error handling from GCP Batch…
mcovarr Jan 31, 2025
e15ff63
Merge commit 'bd2bbe5a2bbf50a529028a71c1a2f08d04c79cef' into an_274_f…
mcovarr Jan 31, 2025
194b566
Merge commit 'f9b91f753e1bc9c89cb41880b5300e1ba93021f4' into an_274_f…
mcovarr Jan 31, 2025
0e12d37
Merge commit 'b29d8005e33aadd4e9e57178101bc3ef9d0ca9bc' into an_274_f…
mcovarr Jan 31, 2025
e7b6f51
Merge commit '7bb8d1f102560b625d260db0773f92981b05141d' into an_274_f…
mcovarr Jan 31, 2025
ae832f6
Merge commit '42043d7885d837836b2cb176bf92d5b0a7198245' into an_274_f…
mcovarr Jan 31, 2025
9876753
Merge commit 'e32f6e0c4cb7a6b605d94987d5f4c93b25e5fd44' into an_274_f…
mcovarr Jan 31, 2025
46bcec1
Merge commit '985ccf5f13791fca8a109a70c5be8921b27600e9' into an_274_f…
mcovarr Jan 31, 2025
57f5113
Merge commit 'd8d9375845fa213f1550cb0fd39eac4fccfa1110' into an_274_f…
mcovarr Jan 31, 2025
c4e878b
Merge commit '3753dfd787495628c37b063ce3e8756d18244ba2' into an_274_f…
mcovarr Jan 31, 2025
c4acad1
Merge commit '4eb0730e545e44a7dc989b637772342f75531e99' into an_274_f…
mcovarr Jan 31, 2025
354f007
Merge commit 'b982f6fdd7697da8c7c774a27e89459f702d11cb' into an_274_f…
mcovarr Jan 31, 2025
fc88cf4
Merge commit '081318d2fdb9f25ad5b50e7df4a8f21785f01a63' into an_274_f…
mcovarr Jan 31, 2025
4c5f482
Merge commit '80fbf59c6f4e180386af7e1ac7369d50747fe582' into an_274_f…
mcovarr Jan 31, 2025
a1d4521
Merge commit 'fb9745a7979df20246d979112e96ea74c62f08a7' into an_274_f…
mcovarr Jan 31, 2025
79994b8
Merge commit '74885eff73ddb2f6a3df9d89be4a7b32f4b741f6' into an_274_f…
mcovarr Jan 31, 2025
b75c8f6
Merge commit 'd3ded6f9150fa8a3836855dce0fb76c88d70701a' into an_274_f…
mcovarr Jan 31, 2025
aace3a9
Merge commit 'e54b32e3df56ad1168f8cd8607bf4df77c8c3bc3' into an_274_f…
mcovarr Jan 31, 2025
68ed90c
Merge commit '90ff21b3587d541b407fa622b9b6c8e0698462c3' into an_274_f…
mcovarr Jan 31, 2025
40b1847
Merge commit '61a89e3add1ece4ab956b94e41ced9e041b7d232' into an_274_f…
mcovarr Jan 31, 2025
d285e1b
Merge commit 'ef8745af3ead6bed4f822fc9a2dddb922728e9a9' into an_274_f…
mcovarr Jan 31, 2025
53ff0ac
Merge commit '7818492ae03b9e2c932715464065d4f471e2ddf7' into an_274_f…
mcovarr Jan 31, 2025
b653ea7
Merge commit '2c134ecb731b92fb85e15266261880c9081100da' into an_274_f…
mcovarr Jan 31, 2025
4df2a07
scalafmt
mcovarr Jan 31, 2025
78bfd53
fixups
mcovarr Feb 1, 2025
4e2562b
[AN-393] Add waiting for quota to quota messages (#7686)
LizBaldo Feb 4, 2025
01a294a
Fix preemptible / maxRetries on GCP Batch [AN-274]
mcovarr Feb 1, 2025
4a92bbc
changelog
mcovarr Feb 4, 2025
f87aff2
Retry with more memory MEM_SIZE/MEM_UNIT Centaur test
mcovarr Feb 5, 2025
37404e0
Merge branch 'develop' into an_274_fix_preemptibles
mcovarr Feb 5, 2025
506c3c5
Merge remote-tracking branch 'origin/develop' into an_274_fix_preempt…
mcovarr Feb 5, 2025
a35bc4e
PR feedback, cleanup
mcovarr Feb 6, 2025
ba64037
docs
mcovarr Feb 7, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ be found [here](https://cromwell.readthedocs.io/en/stable/backends/HPC/#optional
- The `genomics` configuration entry was renamed to `batch`, see [ReadTheDocs](https://cromwell.readthedocs.io/en/stable/backends/GCPBatch/) for more information.
- Fixes a bug with not being able to recover jobs on Cromwell restart.
- Fixes machine type selection to match the Google Cloud Life Sciences backend, including default n1 non shared-core machine types and correct handling of `cpuPlatform` to select n2 or n2d machine types as appropriate.
- Fixes the preemption error handling, now, the correct error message is printed, this also handles the other potential exit codes.
- Fixes preemption and maxRetries behavior. In particular, once a task has exhausted its allowed preemptible attempts, the task will be scheduled again on a non-preemptible VM.
- Fixes error message reporting for failed jobs.
- Fixes the "retry with more memory" feature.
- Fixes the reference disk feature.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name: checkpointing
testFormat: workflowsuccess
backends: [Papiv2, GCPBATCH]
backends: [Papiv2, GCPBATCH_ALT]

files {
workflow: checkpointing/checkpointing.wdl
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
version 1.0

workflow checkpointing {
call count { input: count_to = 100 }
output {
String preempted = count.preempted
}
}

task count {
input {
Int count_to
}

meta {
volatile: true
}

command <<<
# Read from the my_checkpoint file if there's content there:
FROM_CKPT=$(cat my_checkpoint | tail -n1 | awk '{ print $1 }')
FROM_CKPT=${FROM_CKPT:-1}

# We don't want any single VM run the entire count, so work out the max counter value for this attempt:
MAX="$(($FROM_CKPT + 66))"

INSTANCE_NAME=$(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/name" -H "Metadata-Flavor: Google")
echo "Discovered instance: $INSTANCE_NAME"

# Run the counter:
echo '--' >> my_checkpoint
for i in $(seq $FROM_CKPT ~{count_to})
do
echo $i
echo $i ${INSTANCE_NAME} $(date) >> my_checkpoint

# If we're over our max, "preempt" the VM by simulating a maintenance event:
if [ "${i}" -gt "${MAX}" ]
then
fully_qualified_zone=$(curl -s -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/zone)
zone=$(basename "$fully_qualified_zone")
gcloud beta compute instances simulate-maintenance-event $(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/name" -H "Metadata-Flavor: Google") --zone=$zone -q
sleep 60
fi

sleep 1
done

# Prove that we got preempted at least once:
FIRST_INSTANCE=$(cat my_checkpoint | head -n1 | awk '{ print $2 }')
LAST_INSTANCE=$(cat my_checkpoint | tail -n1 | awk '{ print $2 }')
if [ "${FIRST_INSTANCE}" != "LAST_INSTANCE" ]
then
echo "GOTPREEMPTED" > preempted.txt
else
echo "NEVERPREEMPTED" > preempted.txt
fi
>>>

runtime {
docker: "gcr.io/google.com/cloudsdktool/cloud-sdk:slim"
preemptible: 3
checkpointFile: "my_checkpoint"
}

output {
File checkpoint_log = "my_checkpoint"
String preempted = read_string("preempted.txt")
}
}

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
name: gcpbatch_checkpointing
testFormat: workflowsuccess
backends: [GCPBATCH]

files {
workflow: checkpointing/gcpbatch_checkpointing.wdl
}

metadata {
workflowName: checkpointing
status: Succeeded
"outputs.checkpointing.preempted": "GOTPREEMPTED"
}
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,5 @@ metadata {
"calls.required_files.check_it.executionStatus": "Done"
"calls.required_files.do_it.executionStatus": "Failed"
"calls.required_files.do_it.retryableFailure": "false"
"calls.required_files.do_it.failures.0.message": ~~"failed"
"calls.required_files.do_it.failures.0.message": ~~"Job exited without an error, exit code 0. Batch error code 0. Job failed with an unknown reason"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
name: gcpbatch_papi_preemptible_and_max_retries
testFormat: workflowfailure
backends: [GCPBATCH]

files {
workflow: papi_preemptible_and_max_retries/gcpbatch_papi_preemptible_and_max_retries.wdl
}

metadata {
workflowName: papi_preemptible_and_max_retries
status: Failed
"papi_preemptible_and_max_retries.delete_self.-1.attempt": 3
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
name: gcpbatch_preemptible_and_memory_retry
testFormat: workflowfailure
# The original version of this test was tailored to the quirks of Papi v2 in depending on the misdiagnosis of its own
# VM deletion as a preemption event. However GCP Batch perhaps more correctly diagnoses VM deletion as a weird
# non-preemption event. The GCPBATCH version of this test uses `gcloud beta compute instances simulate-maintenance-event`
# to simulate a preemption in a way that GCP Batch actually perceives as a preemption.
backends: [GCPBATCH]

files {
workflow: retry_with_more_memory/gcpbatch/preemptible_and_memory_retry.wdl
options: retry_with_more_memory/retry_with_more_memory.options
}

metadata {
workflowName: preemptible_and_memory_retry
status: Failed
"failures.0.message": "Workflow failed"
"failures.0.causedBy.0.message": "stderr for job `preemptible_and_memory_retry.imitate_oom_error_on_preemptible:NA:3` contained one of the `memory-retry-error-keys: [OutOfMemory,Killed]` specified in the Cromwell config. Job might have run out of memory."
"preemptible_and_memory_retry.imitate_oom_error_on_preemptible.-1.1.preemptible": "true"
"preemptible_and_memory_retry.imitate_oom_error_on_preemptible.-1.1.executionStatus": "RetryableFailure"
"preemptible_and_memory_retry.imitate_oom_error_on_preemptible.-1.1.runtimeAttributes.memory": "1 GB"
"preemptible_and_memory_retry.imitate_oom_error_on_preemptible.-1.2.preemptible": "false"
"preemptible_and_memory_retry.imitate_oom_error_on_preemptible.-1.2.executionStatus": "RetryableFailure"
"preemptible_and_memory_retry.imitate_oom_error_on_preemptible.-1.2.runtimeAttributes.memory": "1 GB"
"preemptible_and_memory_retry.imitate_oom_error_on_preemptible.-1.3.preemptible": "false"
"preemptible_and_memory_retry.imitate_oom_error_on_preemptible.-1.3.executionStatus": "Failed"
"preemptible_and_memory_retry.imitate_oom_error_on_preemptible.-1.3.runtimeAttributes.memory": "1.1 GB"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
name: gcpbatch_preemptible_basic
testFormat: workflowsuccess
backends: [GCPBATCH]

files {
workflow: preemptible_basic/gcpbatch_preemptible_basic.wdl
}

metadata {
status: Succeeded
}
Original file line number Diff line number Diff line change
Expand Up @@ -14,5 +14,5 @@ metadata {
workflowName: requester_pays_localization
status: Failed
"failures.0.message": "Workflow failed"
"failures.0.causedBy.0.message": ~~"failed"
"failures.0.causedBy.0.message": ~~"The job was stopped before the command finished. Batch error code 0. Job failed with an unknown reason"
}
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: gcpbatch_retry_with_more_memory
testFormat: workflowfailure
testFormat: workflowsuccess
backends: [GCPBATCH]

files {
Expand All @@ -9,13 +9,10 @@ files {

metadata {
workflowName: retry_with_more_memory
status: Failed
"failures.0.message": "Workflow failed"
"failures.0.causedBy.0.message": "stderr for job `retry_with_more_memory.imitate_oom_error:NA:3` contained one of the `memory-retry-error-keys: [OutOfMemory,Killed]` specified in the Cromwell config. Job might have run out of memory."
status: Succeeded
"retry_with_more_memory.imitate_oom_error.-1.1.executionStatus": "RetryableFailure"
"retry_with_more_memory.imitate_oom_error.-1.1.runtimeAttributes.memory": "1 GB"
"retry_with_more_memory.imitate_oom_error.-1.2.executionStatus": "RetryableFailure"
"retry_with_more_memory.imitate_oom_error.-1.2.runtimeAttributes.memory": "1.1 GB"
"retry_with_more_memory.imitate_oom_error.-1.3.executionStatus": "Failed"
"retry_with_more_memory.imitate_oom_error.-1.3.runtimeAttributes.memory": "1.2100000000000002 GB"
"outputs.retry_with_more_memory.memory_output": "1.2100000000000002 GB"
}
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name: papi_preemptible_and_max_retries
testFormat: workflowfailure
# faking own preemption doesn't work on GCP Batch
backends: [Papiv2, GCPBATCH_TESTING_PAPIV2_QUIRKS]
# Faking own preemption has to be done differently on GCP Batch
backends: [Papiv2, GCPBATCH_ALT]

files {
workflow: papi_preemptible_and_max_retries/papi_preemptible_and_max_retries.wdl
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
version 1.0

task delete_self {

command {
preemptible=$(curl -H "Metadata-Flavor: Google" "http://metadata.google.internal/computeMetadata/v1/instance/scheduling/preemptible")

# Simulate a maintenance event on ourselves if running on a preemptible VM, otherwise delete ourselves.
fully_qualified_zone=$(curl -s -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/zone)
zone=$(basename "$fully_qualified_zone")

if [ "$preemptible" = "TRUE" ]; then
gcloud beta compute instances simulate-maintenance-event $(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/name" -H "Metadata-Flavor: Google") --zone=$zone -q
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat!

sleep 60
else
# We need to actually delete ourselves if the VM is not preemptible; simulated maintenance events don't seem to
# precipitate the demise of on-demand VMs.
gcloud compute instances delete $(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/name" -H "Metadata-Flavor: Google") --zone=$zone -q
fi
}

runtime {
preemptible: 1
docker: "gcr.io/google.com/cloudsdktool/cloud-sdk:slim"
maxRetries: 1
}
}

workflow papi_preemptible_and_max_retries {
call delete_self
}
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
name: preemptible_and_memory_retry
testFormat: workflowfailure
# The original version of this test seems to have been tailored to the quirks of Papi v2 in depending on the misdiagnosis of its own VM deletion as a preemption event. GCP Batch perhaps more correctly diagnoses the VM deletion as a weird non-preemption happening, but that frustrates the logic of this test.
# Disabling this as it's not possible to induce a real preemption.
backends: [Papiv2, GCPBATCH_TESTING_PAPIV2_QUIRKS]
# The original version of this test was tailored to the quirks of Papi v2 in depending on the misdiagnosis of its own
# VM deletion as a preemption event. However GCP Batch perhaps more correctly diagnoses VM deletion as a weird
# non-preemption event. The GCPBATCH version of this test uses `gcloud beta compute instances simulate-maintenance-event`
# to simulate a preemption in a way that GCP Batch actually perceives as a preemption.
backends: [Papiv2, GCPBATCH_ALT]

files {
workflow: retry_with_more_memory/preemptible_and_memory_retry.wdl
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
name: preemptible_basic
testFormat: workflowsuccess
backends: [Papiv2, GCPBATCH_ALT]

files {
workflow: preemptible_basic/preemptible_basic.wdl
}

metadata {
status: Succeeded
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
version 1.0

task delete_self_if_preemptible {

command <<<
# Prepend date, time and pwd to xtrace log entries.
PS4='\D{+%F %T} \w $ '
set -o errexit -o nounset -o pipefail -o xtrace

preemptible=$(curl -H "Metadata-Flavor: Google" "http://metadata.google.internal/computeMetadata/v1/instance/scheduling/preemptible")

# Perform a maintenance event on this VM if it is preemptible, which should cause it to be preempted.
# Since `preemptible: 1` the job should be restarted on a non-preemptible VM.
if [ "$preemptible" = "TRUE" ]; then
fully_qualified_zone=$(curl -s -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/zone)
zone=$(basename "$fully_qualified_zone")

gcloud beta compute instances simulate-maintenance-event $(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/name" -H "Metadata-Flavor: Google") --zone=$zone -q
sleep 60
fi

>>>

runtime {
preemptible: 1
docker: "gcr.io/google.com/cloudsdktool/cloud-sdk:slim"
}
}


workflow preemptible_basic {
call delete_self_if_preemptible
}
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ task delete_self_if_preemptible {
# Delete self if running on a preemptible VM. This should produce an "error 10" which Cromwell should treat as a preemption.
# Since `preemptible: 1` the job should be restarted on a non-preemptible VM.
if [ "$preemptible" = "TRUE" ]; then

fully_qualified_zone=$(curl -s -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/zone)
zone=$(basename "$fully_qualified_zone")

Expand All @@ -25,6 +25,6 @@ task delete_self_if_preemptible {
}


workflow error_10_preemptible {
workflow preemptible_basic {
call delete_self_if_preemptible
}
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,14 @@ task imitate_oom_error_on_preemptible {

preemptible=$(curl -H "Metadata-Flavor: Google" "http://metadata.google.internal/computeMetadata/v1/instance/scheduling/preemptible")

# Delete self if running on a preemptible VM
# Simulate a maintenance event on ourselves if running on a preemptible VM
# Since `preemptible: 1` the job should be restarted on a non-preemptible VM.
if [ "$preemptible" = "TRUE" ]; then
fully_qualified_zone=$(curl -s -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/zone)
zone=$(basename "$fully_qualified_zone")

gcloud compute instances delete $(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/name" -H "Metadata-Flavor: Google") --zone=$zone -q
gcloud beta compute instances simulate-maintenance-event $(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/name" -H "Metadata-Flavor: Google") --zone=$zone -q
sleep 60
fi

# Should reach here on the second attempt
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,21 @@ version 1.0

task imitate_oom_error {
command {
printf "Exception in thread "main" java.lang.OutOfMemoryError: testing\n\tat Test.main(Test.java:1)\n" >&2 && (exit 1)
# As a simulation of an OOM condition, do not create the 'foo' file. Cromwell should still be able to delocalize important detritus.
# touch foo
echo "$MEM_SIZE $MEM_UNIT"

# Current bashes do not do floating point arithmetic, Python to the rescue.
LESS=$(python -c "print($MEM_SIZE < 1.21)")

if [[ "$LESS" = "True" ]]
then
printf "Exception in thread "main" java.lang.OutOfMemoryError: testing\n\tat Test.main(Test.java:1)\n" >&2
exit 1
fi

echo "$MEM_SIZE $MEM_UNIT" > memory_output.txt
}
output {
File foo = "foo"
String memory_output = read_string("memory_output.txt")
}
runtime {
docker: "python:latest"
Expand All @@ -19,4 +28,8 @@ task imitate_oom_error {

workflow retry_with_more_memory {
call imitate_oom_error

output {
String memory_output = imitate_oom_error.memory_output
}
}
Loading
Loading