-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry all http calls for artifact upload and download #675
Conversation
name: string, | ||
method: () => Promise<T>, | ||
getStatusCode: (response: T) => number | undefined, | ||
errorMessages: Map<number, string>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more flexible just to pass in:
errorMessages: Map<number, string>, | |
getErrorMessage: (response: T) => string, |
Some API responses might have an error message in the body.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original intention with this errorMessages
map was so that certain http calls can produce slightly different exception messages if certain response codes are encountered. For example, during artifact creation we might get a 400 which is indicative of a invalid artifact name. However during other calls a 400 might be something totally different so this map helps the method caller customize that users will ultimately see in the logs if something goes wrong.
`${name} - Attempt ${attempt} of ${maxAttempts} failed with error: ${errorMessage}` | ||
) | ||
|
||
await sleep(delay) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we increase the delay on each attempt? I think we should also check delay > 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the existing methods that are retryable (the PUT calls during upload and GET calls during download, see PR description) have their own retryablity which includes exponential back-off after first checking for a retry-after
header that we might send back if too many requests are being made. You can see it here:
toolkit/packages/artifact/src/internal/upload-http-client.ts
Lines 414 to 432 in 73d5917
const backOff = async (retryAfterValue?: number): Promise<void> => { | |
this.uploadHttpManager.disposeAndReplaceClient(httpClientIndex) | |
if (retryAfterValue) { | |
core.info( | |
`Backoff due to too many requests, retry #${retryCount}. Waiting for ${retryAfterValue} milliseconds before continuing the upload` | |
) | |
await new Promise(resolve => setTimeout(resolve, retryAfterValue)) | |
} else { | |
const backoffTime = getExponentialRetryTimeInMilliseconds(retryCount) | |
core.info( | |
`Exponential backoff for retry #${retryCount}. Waiting for ${backoffTime} milliseconds before continuing the upload at offset ${start}` | |
) | |
await new Promise(resolve => setTimeout(resolve, backoffTime)) | |
} | |
core.info( | |
`Finished backoff for retry #${retryCount}, continuing with upload` | |
) | |
return | |
} |
What I did in this PR, is I moved the sleep method to a shared util
file. The existing methods that are retried are a little bit more complicated because there are multiple calls being done concurrently so they're not switching over to the new retryHttpClientRequest
method in this PR. What I am planning on doing in a follow-up PR is:
- Move checking for the
retry-after
header intoutil
alongside computing the exponential back-off time there - Add exponential back-off to
retryHttpClientRequest
that is being added as part of this PR - Switch over all the HTTP calls to use the same
retryHttpClientRequest
method and clean up the code so there aren't 2 separate retry mechanisms
The last part would require a lot more refactoring and I don't want to make this PR too big so I want to do things in phases. In the past, large updates to the artifact actions would take a while to get through and they were harder to test so I want to split things up into manageable chunks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regardless of what the other code is doing, I think we should still implement exponential backoff in the new code. The reason we use exponential backoff is that any fixed delay time we choose might be too short and put too much pressure on the service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added! Most of the code is already there, so it was pretty simple
toolkit/packages/artifact/src/internal/utils.ts
Lines 15 to 34 in 73d5917
/** | |
* Returns a retry time in milliseconds that exponentially gets larger | |
* depending on the amount of retries that have been attempted | |
*/ | |
export function getExponentialRetryTimeInMilliseconds( | |
retryCount: number | |
): number { | |
if (retryCount < 0) { | |
throw new Error('RetryCount should not be negative') | |
} else if (retryCount === 0) { | |
return getInitialRetryIntervalInMilliseconds() | |
} | |
const minTime = | |
getInitialRetryIntervalInMilliseconds() * getRetryMultiplier() * retryCount | |
const maxTime = minTime * getRetryMultiplier() | |
// returns a random number between the minTime (inclusive) and the maxTime (exclusive) | |
return Math.random() * (maxTime - minTime) + minTime | |
} |
extraErrorInformation = errorMessages.get(statusCode) | ||
} | ||
|
||
isRetryable = isRetryableStatusCode(statusCode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can take this a step further to separate this into an HTTP layer and an HTTP-agnostic retry
function:
- Pull out
getStatusCode
,isSuccessStatusCode
,isRetryableStatusCode
from theretry
function and replace it with a single function you pass in calledshouldRetry
retryHttpClientResponse
then can take the HTTP-related stuff and reduce it to ashouldRetry
callback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was mostly going off of the existing retry
function that cache
uses. I feel like pulling out isSuccessStatusCode
and getStatusCode
could cause some repetition throughout the rest of the code. Overall I think the current pattern is sufficient for the calls that we need to make.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't cause any repetition because retryHttpClientRequest
will shim it. It might work but I don't think it's as clear as it could be -- separation of concerns will help with that, and make testing easier. I'd invoke the principle of "code is written once but read many times" here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spent a bit of time trying to pull out each of the methods, but I could really get it in a nice enough format without breaking too much and I don't think it works all that well. The current getStatusCode
method is actually structured the way it is so that it's easier to test each of the responses:
(response: ITestResponse) => response.statusCode, |
My original intention was to use mostly what cache
had without changing it too much so that if there is a fix in one package the two are relatively the same and it's easy to follow.
@@ -0,0 +1,74 @@ | |||
import {IHttpClientResponse} from '@actions/http-client/interfaces' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw https://github.com/actions/toolkit/blob/main/packages/cache/src/internal/requestUtils.ts which you linked. Would we be able to share the same basic retry logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @dhadka and @joshmgross to see if we can share code with the cache action
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd need to move the shared code to it's own package, or add this all to https://github.com/actions/http-client
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A separate (small) NPM package or http-client would make sense. This feels like a long term investment though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, if it needs a separate package then I agree we can hold off. We've tried sharing files among packages before in Azure Pipelines and it doesn't work well.
I think it could fit in in @actions/io
, but that would mean making it a part of the public interface there.
My last comment is on this call: Single PATCH call - Update artifact size to indicate we are done (not retried) Personally I think we should retry this call for any failures, not only on those status codes. This is the last call to create an artifact, and it feels bad if this call fails for some infrastructure issue and all prior efforts are wasted. |
attempt++ | ||
} | ||
|
||
if (response) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to move this closer to the actual call, so we can log diagnostic info of every operation. Otherwise LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 27 in this same file. We should just log it there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to leave it as is. This one method gets called only if we exhaust all retries so we only display the full diagnostics info in two scenarios:
- we have exhausted all retries
- a non-retryable status code was hit
Only if the call actually fails we display the full diagnostics. In all other cases I don't think it's sufficient as the response code will be displayed
We are seeing higher cases of artifact upload failures in github. Other projects are also seeing this as has been reported at actions/upload-artifact#116 There is a fix that was just merged in the base library: actions/toolkit#675 So Hopefully we can revert this before too long. Signed-off-by: Brennan Ashton <bashton@brennanashton.com>
We are seeing higher cases of artifact upload failures in github. Other projects are also seeing this as has been reported at actions/upload-artifact#116 There is a fix that was just merged in the base library: actions/toolkit#675 So Hopefully we can revert this before too long. Signed-off-by: Brennan Ashton <bashton@brennanashton.com>
* Retry all http calls for artifact upload and download * Extra debug information * Fix lint * Always read response body * PR Feedback * Change error message if patch call fails * Add exponential backoff when retrying * Rework tests and add diagnostic info if exception thrown * Fix lint * fix lint error for real this time * PR cleanup * 0.5.0 @actions/artifact release * Display diagnostic info if non-retryable code is hit
Overview
Currently most http calls during artifact upload and download are retried, however disconnects and timeouts are possible so everything should be retried. Some documentation here regarding how we retry existing calls: https://github.com/actions/toolkit/blob/main/packages/artifact/docs/implementation-details.md
Artifact upload consists of 3 steps
Artifact download consists of 3 steps
These changes will ensure that every step along the way can be retried.
These changes should prevent artifact download & upload from failing in a few scenarios that customers are reporting:
actions/download-artifact#72
actions/upload-artifact#116
actions/upload-artifact#123
actions/upload-artifact#135
Inspiration
Original inspiration behind these changes is some similar retry-ability that was added to actions cache as part of this PR: actions/cache#306
The original retry-ability was re-worked a little bit and everything was moved over to the toolkit repo where you can find the code:
Code: https://github.com/actions/toolkit/blob/main/packages/cache/src/internal/requestUtils.ts
Tests: https://github.com/actions/toolkit/blob/main/packages/cache/__tests__/requestUtils.test.ts