Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry all http calls for artifact upload and download #675

Merged
merged 13 commits into from
Dec 18, 2020

Conversation

konradpabjan
Copy link
Contributor

@konradpabjan konradpabjan commented Dec 16, 2020

Overview

Currently most http calls during artifact upload and download are retried, however disconnects and timeouts are possible so everything should be retried. Some documentation here regarding how we retry existing calls: https://github.com/actions/toolkit/blob/main/packages/artifact/docs/implementation-details.md

Artifact upload consists of 3 steps

  • Single POST call - Create artifact container (not retried)
  • Multiple PUT calls - Optionally gzip, then upload concurrently in chunks (retried)
  • Single PATCH call - Update artifact size to indicate we are done (not retried)

Artifact download consists of 3 steps

  • Single GET call - List available artifacts (not retried)
  • Single GET call - Get container items for a specific artifact (not retried)
  • Multiple GET calls - Concurrently download the contents of the artifacts and decompress if necessary (retried)

These changes will ensure that every step along the way can be retried.

These changes should prevent artifact download & upload from failing in a few scenarios that customers are reporting:
actions/download-artifact#72
actions/upload-artifact#116
actions/upload-artifact#123
actions/upload-artifact#135

Inspiration

Original inspiration behind these changes is some similar retry-ability that was added to actions cache as part of this PR: actions/cache#306

The original retry-ability was re-worked a little bit and everything was moved over to the toolkit repo where you can find the code:
Code: https://github.com/actions/toolkit/blob/main/packages/cache/src/internal/requestUtils.ts
Tests: https://github.com/actions/toolkit/blob/main/packages/cache/__tests__/requestUtils.test.ts

@konradpabjan konradpabjan marked this pull request as ready for review December 17, 2020 15:10
@konradpabjan konradpabjan requested review from a team and yacaovsnc December 17, 2020 15:10
name: string,
method: () => Promise<T>,
getStatusCode: (response: T) => number | undefined,
errorMessages: Map<number, string>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more flexible just to pass in:

Suggested change
errorMessages: Map<number, string>,
getErrorMessage: (response: T) => string,

Some API responses might have an error message in the body.

Copy link
Contributor Author

@konradpabjan konradpabjan Dec 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original intention with this errorMessages map was so that certain http calls can produce slightly different exception messages if certain response codes are encountered. For example, during artifact creation we might get a 400 which is indicative of a invalid artifact name. However during other calls a 400 might be something totally different so this map helps the method caller customize that users will ultimately see in the logs if something goes wrong.

`${name} - Attempt ${attempt} of ${maxAttempts} failed with error: ${errorMessage}`
)

await sleep(delay)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we increase the delay on each attempt? I think we should also check delay > 0

Copy link
Contributor Author

@konradpabjan konradpabjan Dec 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the existing methods that are retryable (the PUT calls during upload and GET calls during download, see PR description) have their own retryablity which includes exponential back-off after first checking for a retry-after header that we might send back if too many requests are being made. You can see it here:

const backOff = async (retryAfterValue?: number): Promise<void> => {
this.uploadHttpManager.disposeAndReplaceClient(httpClientIndex)
if (retryAfterValue) {
core.info(
`Backoff due to too many requests, retry #${retryCount}. Waiting for ${retryAfterValue} milliseconds before continuing the upload`
)
await new Promise(resolve => setTimeout(resolve, retryAfterValue))
} else {
const backoffTime = getExponentialRetryTimeInMilliseconds(retryCount)
core.info(
`Exponential backoff for retry #${retryCount}. Waiting for ${backoffTime} milliseconds before continuing the upload at offset ${start}`
)
await new Promise(resolve => setTimeout(resolve, backoffTime))
}
core.info(
`Finished backoff for retry #${retryCount}, continuing with upload`
)
return
}

What I did in this PR, is I moved the sleep method to a shared util file. The existing methods that are retried are a little bit more complicated because there are multiple calls being done concurrently so they're not switching over to the new retryHttpClientRequest method in this PR. What I am planning on doing in a follow-up PR is:

  • Move checking for the retry-after header into util alongside computing the exponential back-off time there
  • Add exponential back-off to retryHttpClientRequest that is being added as part of this PR
  • Switch over all the HTTP calls to use the same retryHttpClientRequest method and clean up the code so there aren't 2 separate retry mechanisms

The last part would require a lot more refactoring and I don't want to make this PR too big so I want to do things in phases. In the past, large updates to the artifact actions would take a while to get through and they were harder to test so I want to split things up into manageable chunks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless of what the other code is doing, I think we should still implement exponential backoff in the new code. The reason we use exponential backoff is that any fixed delay time we choose might be too short and put too much pressure on the service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added! Most of the code is already there, so it was pretty simple

/**
* Returns a retry time in milliseconds that exponentially gets larger
* depending on the amount of retries that have been attempted
*/
export function getExponentialRetryTimeInMilliseconds(
retryCount: number
): number {
if (retryCount < 0) {
throw new Error('RetryCount should not be negative')
} else if (retryCount === 0) {
return getInitialRetryIntervalInMilliseconds()
}
const minTime =
getInitialRetryIntervalInMilliseconds() * getRetryMultiplier() * retryCount
const maxTime = minTime * getRetryMultiplier()
// returns a random number between the minTime (inclusive) and the maxTime (exclusive)
return Math.random() * (maxTime - minTime) + minTime
}

extraErrorInformation = errorMessages.get(statusCode)
}

isRetryable = isRetryableStatusCode(statusCode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can take this a step further to separate this into an HTTP layer and an HTTP-agnostic retry function:

  • Pull out getStatusCode, isSuccessStatusCode, isRetryableStatusCode from the retry function and replace it with a single function you pass in called shouldRetry
  • retryHttpClientResponse then can take the HTTP-related stuff and reduce it to a shouldRetry callback

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mostly going off of the existing retry function that cache uses. I feel like pulling out isSuccessStatusCode and getStatusCode could cause some repetition throughout the rest of the code. Overall I think the current pattern is sufficient for the calls that we need to make.

Copy link
Contributor

@brcrista brcrista Dec 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't cause any repetition because retryHttpClientRequest will shim it. It might work but I don't think it's as clear as it could be -- separation of concerns will help with that, and make testing easier. I'd invoke the principle of "code is written once but read many times" here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spent a bit of time trying to pull out each of the methods, but I could really get it in a nice enough format without breaking too much and I don't think it works all that well. The current getStatusCode method is actually structured the way it is so that it's easier to test each of the responses:

(response: ITestResponse) => response.statusCode,

My original intention was to use mostly what cache had without changing it too much so that if there is a fix in one package the two are relatively the same and it's easy to follow.

@@ -0,0 +1,74 @@
import {IHttpClientResponse} from '@actions/http-client/interfaces'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw https://github.com/actions/toolkit/blob/main/packages/cache/src/internal/requestUtils.ts which you linked. Would we be able to share the same basic retry logic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @dhadka and @joshmgross to see if we can share code with the cache action

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd need to move the shared code to it's own package, or add this all to https://github.com/actions/http-client

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A separate (small) NPM package or http-client would make sense. This feels like a long term investment though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, if it needs a separate package then I agree we can hold off. We've tried sharing files among packages before in Azure Pipelines and it doesn't work well.

I think it could fit in in @actions/io, but that would mean making it a part of the public interface there.

packages/artifact/__tests__/upload.test.ts Outdated Show resolved Hide resolved
packages/artifact/__tests__/upload.test.ts Outdated Show resolved Hide resolved
packages/artifact/src/internal/requestUtils.ts Outdated Show resolved Hide resolved
@yacaovsnc
Copy link
Contributor

My last comment is on this call:

Single PATCH call - Update artifact size to indicate we are done (not retried)

Personally I think we should retry this call for any failures, not only on those status codes. This is the last call to create an artifact, and it feels bad if this call fails for some infrastructure issue and all prior efforts are wasted.

attempt++
}

if (response) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to move this closer to the actual call, so we can log diagnostic info of every operation. Otherwise LGTM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 27 in this same file. We should just log it there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to leave it as is. This one method gets called only if we exhaust all retries so we only display the full diagnostics info in two scenarios:

  • we have exhausted all retries
  • a non-retryable status code was hit

Only if the call actually fails we display the full diagnostics. In all other cases I don't think it's sufficient as the response code will be displayed

@konradpabjan konradpabjan merged commit c861dd8 into main Dec 18, 2020
@konradpabjan konradpabjan deleted the konradpabjan/artifact-retryability branch December 18, 2020 20:40
btashton added a commit to apache/nuttx that referenced this pull request Dec 23, 2020
We are seeing higher cases of artifact upload failures
in github.  Other projects are also seeing this as has been
reported at actions/upload-artifact#116

There is a fix that was just merged in the base library:
actions/toolkit#675

So Hopefully we can revert this before too long.

Signed-off-by: Brennan Ashton <bashton@brennanashton.com>
Ouss4 pushed a commit to apache/nuttx that referenced this pull request Dec 23, 2020
We are seeing higher cases of artifact upload failures
in github.  Other projects are also seeing this as has been
reported at actions/upload-artifact#116

There is a fix that was just merged in the base library:
actions/toolkit#675

So Hopefully we can revert this before too long.

Signed-off-by: Brennan Ashton <bashton@brennanashton.com>
at-wat pushed a commit to at-wat/actions-toolkit that referenced this pull request Aug 31, 2023
* Retry all http calls for artifact upload and download

* Extra debug information

* Fix lint

* Always read response body

* PR Feedback

* Change error message if patch call fails

* Add exponential backoff when retrying

* Rework tests and add diagnostic info if exception thrown

* Fix lint

* fix lint error for real this time

* PR cleanup

* 0.5.0 @actions/artifact release

* Display diagnostic info if non-retryable code is hit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants