Skip to content

Commit

Permalink
Retry conda commands if a segfault occurs. (#46)
Browse files Browse the repository at this point in the history
This PR makes `rapids-conda-retry` retry if the conda command segfaults.
In discussion with @AyodeAwe and @stadlmax, we believe that the segfault
is a temporary failure related to concurrent resource utilization (or
perhaps a network hiccup?) that can be fixed by sleeping and retrying.

Example:
```
/usr/local/bin/rapids-conda-retry: line 68:   155 Segmentation fault      (core dumped) ${condaCmd} ${args} 2>&1
       156 Done                    | tee "${outfile}"
[rapids-conda-retry] conda returned exit code: 139
[rapids-conda-retry] Exiting, no retryable mamba errors detected: 'ChecksumMismatchError:', 'ChunkedEncodingError:', 'CondaHTTPError:', 'CondaMultiError:', 'ConnectionError:', 'EOFError:', 'JSONDecodeError:', 'Multi-download failed', 'Timeout was reached'
[rapids-conda-retry] 
Error: Process completed with exit code 139.
```

https://github.com/rapidsai/cugraph-ops/actions/runs/4283919882/jobs/7460790452#step:6:387
  • Loading branch information
bdice authored Feb 28, 2023
1 parent e9f730a commit e3eafb9
Showing 1 changed file with 5 additions and 1 deletion.
6 changes: 5 additions & 1 deletion tools/rapids-conda-retry
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,9 @@ function runConda {
elif grep -q "Timeout was reached" "${outfile}"; then
retryingMsg="Retrying, found 'Timeout was reached' in output..."
needToRetry=1
elif [[ $exitcode -eq 139 ]]; then
retryingMsg="Retrying, command resulted in a segfault. This may be an intermittent failure..."
needToRetry=1
else
rapids-echo-stderr "Exiting, no retryable ${RAPIDS_CONDA_EXE} errors detected: \
'ChecksumMismatchError:', \
Expand All @@ -115,7 +118,8 @@ function runConda {
'EOFError:', \
'JSONDecodeError:', \
'Multi-download failed', \
'Timeout was reached'"
'Timeout was reached', \
segfault exit code 139"
fi

if (( needToRetry == 1 )) && \
Expand Down

0 comments on commit e3eafb9

Please sign in to comment.