Retry all http calls for artifact upload and download #675

konradpabjan · 2020-12-16T22:29:57Z

Overview

Currently most http calls during artifact upload and download are retried, however disconnects and timeouts are possible so everything should be retried. Some documentation here regarding how we retry existing calls: https://github.com/actions/toolkit/blob/main/packages/artifact/docs/implementation-details.md

Artifact upload consists of 3 steps

Single POST call - Create artifact container (not retried)
Multiple PUT calls - Optionally gzip, then upload concurrently in chunks (retried)
Single PATCH call - Update artifact size to indicate we are done (not retried)

Artifact download consists of 3 steps

Single GET call - List available artifacts (not retried)
Single GET call - Get container items for a specific artifact (not retried)
Multiple GET calls - Concurrently download the contents of the artifacts and decompress if necessary (retried)

These changes will ensure that every step along the way can be retried.

These changes should prevent artifact download & upload from failing in a few scenarios that customers are reporting:
actions/download-artifact#72
actions/upload-artifact#116
actions/upload-artifact#123
actions/upload-artifact#135

Inspiration

Original inspiration behind these changes is some similar retry-ability that was added to actions cache as part of this PR: actions/cache#306

The original retry-ability was re-worked a little bit and everything was moved over to the toolkit repo where you can find the code:
Code: https://github.com/actions/toolkit/blob/main/packages/cache/src/internal/requestUtils.ts
Tests: https://github.com/actions/toolkit/blob/main/packages/cache/__tests__/requestUtils.test.ts

packages/artifact/src/internal/utils.ts

brcrista · 2020-12-17T16:36:22Z

packages/artifact/src/internal/requestUtils.ts

+  name: string,
+  method: () => Promise<T>,
+  getStatusCode: (response: T) => number | undefined,
+  errorMessages: Map<number, string>,


I think it would be more flexible just to pass in:

Suggested change

errorMessages: Map<number, string>,

getErrorMessage: (response: T) => string,

Some API responses might have an error message in the body.

My original intention with this errorMessages map was so that certain http calls can produce slightly different exception messages if certain response codes are encountered. For example, during artifact creation we might get a 400 which is indicative of a invalid artifact name. However during other calls a 400 might be something totally different so this map helps the method caller customize that users will ultimately see in the logs if something goes wrong.

packages/artifact/src/internal/requestUtils.ts

brcrista · 2020-12-17T16:40:08Z

packages/artifact/src/internal/requestUtils.ts

+      `${name} - Attempt ${attempt} of ${maxAttempts} failed with error: ${errorMessage}`
+    )
+
+    await sleep(delay)


Should we increase the delay on each attempt? I think we should also check delay > 0

So, the existing methods that are retryable (the PUT calls during upload and GET calls during download, see PR description) have their own retryablity which includes exponential back-off after first checking for a retry-after header that we might send back if too many requests are being made. You can see it here:

toolkit/packages/artifact/src/internal/upload-http-client.ts

Lines 414 to 432 in 73d5917

const backOff = async (retryAfterValue?: number): Promise<void> => {

this.uploadHttpManager.disposeAndReplaceClient(httpClientIndex)

if (retryAfterValue) {

core.info(

`Backoff due to too many requests, retry #${retryCount}. Waiting for ${retryAfterValue} milliseconds before continuing the upload`

)

await new Promise(resolve => setTimeout(resolve, retryAfterValue))

} else {

const backoffTime = getExponentialRetryTimeInMilliseconds(retryCount)

core.info(

`Exponential backoff for retry #${retryCount}. Waiting for ${backoffTime} milliseconds before continuing the upload at offset ${start}`

)

await new Promise(resolve => setTimeout(resolve, backoffTime))

}

core.info(

`Finished backoff for retry #${retryCount}, continuing with upload`

)

return

}

What I did in this PR, is I moved the sleep method to a shared util file. The existing methods that are retried are a little bit more complicated because there are multiple calls being done concurrently so they're not switching over to the new retryHttpClientRequest method in this PR. What I am planning on doing in a follow-up PR is:

Move checking for the retry-after header into util alongside computing the exponential back-off time there

Add exponential back-off to retryHttpClientRequest that is being added as part of this PR

Switch over all the HTTP calls to use the same retryHttpClientRequest method and clean up the code so there aren't 2 separate retry mechanisms

The last part would require a lot more refactoring and I don't want to make this PR too big so I want to do things in phases. In the past, large updates to the artifact actions would take a while to get through and they were harder to test so I want to split things up into manageable chunks

Regardless of what the other code is doing, I think we should still implement exponential backoff in the new code. The reason we use exponential backoff is that any fixed delay time we choose might be too short and put too much pressure on the service.

Added! Most of the code is already there, so it was pretty simple

toolkit/packages/artifact/src/internal/utils.ts

Lines 15 to 34 in 73d5917

/**

* Returns a retry time in milliseconds that exponentially gets larger

* depending on the amount of retries that have been attempted

*/

export function getExponentialRetryTimeInMilliseconds(

retryCount: number

): number {

if (retryCount < 0) {

throw new Error('RetryCount should not be negative')

} else if (retryCount === 0) {

return getInitialRetryIntervalInMilliseconds()

}

const minTime =

getInitialRetryIntervalInMilliseconds() * getRetryMultiplier() * retryCount

const maxTime = minTime * getRetryMultiplier()

// returns a random number between the minTime (inclusive) and the maxTime (exclusive)

return Math.random() * (maxTime - minTime) + minTime

}

packages/artifact/src/internal/requestUtils.ts

brcrista · 2020-12-17T16:51:08Z

packages/artifact/src/internal/requestUtils.ts

+        extraErrorInformation = errorMessages.get(statusCode)
+      }
+
+      isRetryable = isRetryableStatusCode(statusCode)


I think you can take this a step further to separate this into an HTTP layer and an HTTP-agnostic retry function:

Pull out getStatusCode, isSuccessStatusCode, isRetryableStatusCode from the retry function and replace it with a single function you pass in called shouldRetry

retryHttpClientResponse then can take the HTTP-related stuff and reduce it to a shouldRetry callback

I was mostly going off of the existing retry function that cache uses. I feel like pulling out isSuccessStatusCode and getStatusCode could cause some repetition throughout the rest of the code. Overall I think the current pattern is sufficient for the calls that we need to make.

It shouldn't cause any repetition because retryHttpClientRequest will shim it. It might work but I don't think it's as clear as it could be -- separation of concerns will help with that, and make testing easier. I'd invoke the principle of "code is written once but read many times" here.

Spent a bit of time trying to pull out each of the methods, but I could really get it in a nice enough format without breaking too much and I don't think it works all that well. The current getStatusCode method is actually structured the way it is so that it's easier to test each of the responses:

toolkit/packages/artifact/__tests__/retry.test.ts

Line 55 in 1bcfb42

(response: ITestResponse) => response.statusCode,

My original intention was to use mostly what cache had without changing it too much so that if there is a fix in one package the two are relatively the same and it's easy to follow.

brcrista · 2020-12-17T16:53:56Z

packages/artifact/src/internal/requestUtils.ts

@@ -0,0 +1,74 @@
+import {IHttpClientResponse} from '@actions/http-client/interfaces'


I saw https://github.com/actions/toolkit/blob/main/packages/cache/src/internal/requestUtils.ts which you linked. Would we be able to share the same basic retry logic?

cc @dhadka and @joshmgross to see if we can share code with the cache action

I think we'd need to move the shared code to it's own package, or add this all to https://github.com/actions/http-client

A separate (small) NPM package or http-client would make sense. This feels like a long term investment though

Ok, if it needs a separate package then I agree we can hold off. We've tried sharing files among packages before in Azure Pipelines and it doesn't work well.

I think it could fit in in @actions/io, but that would mean making it a part of the public interface there.

packages/artifact/__tests__/upload.test.ts

packages/artifact/src/internal/requestUtils.ts

packages/artifact/src/internal/upload-http-client.ts

yacaovsnc · 2020-12-17T17:12:43Z

My last comment is on this call:

Single PATCH call - Update artifact size to indicate we are done (not retried)

Personally I think we should retry this call for any failures, not only on those status codes. This is the last call to create an artifact, and it feels bad if this call fails for some infrastructure issue and all prior efforts are wasted.

packages/artifact/src/internal/requestUtils.ts

yacaovsnc · 2020-12-18T20:31:34Z

packages/artifact/src/internal/requestUtils.ts

+    attempt++
+  }
+
+  if (response) {


I think we want to move this closer to the actual call, so we can log diagnostic info of every operation. Otherwise LGTM.

Line 27 in this same file. We should just log it there.

I'm going to leave it as is. This one method gets called only if we exhaust all retries so we only display the full diagnostics info in two scenarios:

we have exhausted all retries

a non-retryable status code was hit

Only if the call actually fails we display the full diagnostics. In all other cases I don't think it's sufficient as the response code will be displayed

We are seeing higher cases of artifact upload failures in github. Other projects are also seeing this as has been reported at actions/upload-artifact#116 There is a fix that was just merged in the base library: actions/toolkit#675 So Hopefully we can revert this before too long. Signed-off-by: Brennan Ashton <bashton@brennanashton.com>

* Retry all http calls for artifact upload and download * Extra debug information * Fix lint * Always read response body * PR Feedback * Change error message if patch call fails * Add exponential backoff when retrying * Rework tests and add diagnostic info if exception thrown * Fix lint * fix lint error for real this time * PR cleanup * 0.5.0 @actions/artifact release * Display diagnostic info if non-retryable code is hit

konradpabjan added 3 commits December 16, 2020 17:28

Retry all http calls for artifact upload and download

ef7a575

Extra debug information

175b04e

Fix lint

edbdbf9

konradpabjan marked this pull request as ready for review December 17, 2020 15:10

konradpabjan requested review from a team and yacaovsnc December 17, 2020 15:10

Always read response body

5b7b7aa

brcrista reviewed Dec 17, 2020

View reviewed changes

packages/artifact/src/internal/utils.ts Outdated Show resolved Hide resolved

brcrista reviewed Dec 17, 2020

View reviewed changes

packages/artifact/src/internal/requestUtils.ts Outdated Show resolved Hide resolved

brcrista reviewed Dec 17, 2020

View reviewed changes

packages/artifact/src/internal/requestUtils.ts Outdated Show resolved Hide resolved

brcrista reviewed Dec 17, 2020

View reviewed changes

packages/artifact/src/internal/requestUtils.ts Outdated Show resolved Hide resolved

brcrista reviewed Dec 17, 2020

View reviewed changes

packages/artifact/src/internal/requestUtils.ts Outdated Show resolved Hide resolved

brcrista reviewed Dec 17, 2020

View reviewed changes

yacaovsnc reviewed Dec 17, 2020

View reviewed changes

packages/artifact/__tests__/upload.test.ts Outdated Show resolved Hide resolved

packages/artifact/__tests__/upload.test.ts Outdated Show resolved Hide resolved

packages/artifact/src/internal/requestUtils.ts Outdated Show resolved Hide resolved

brcrista reviewed Dec 17, 2020

View reviewed changes

packages/artifact/src/internal/upload-http-client.ts Show resolved Hide resolved

konradpabjan added 3 commits December 17, 2020 13:14

PR Feedback

f880c0e

Change error message if patch call fails

b20b44f

Add exponential backoff when retrying

1bcfb42

konradpabjan requested review from brcrista and yacaovsnc December 17, 2020 21:21

brcrista approved these changes Dec 17, 2020

View reviewed changes

yacaovsnc reviewed Dec 18, 2020

View reviewed changes

packages/artifact/src/internal/requestUtils.ts Show resolved Hide resolved

konradpabjan added 4 commits December 18, 2020 14:49

Rework tests and add diagnostic info if exception thrown

f98d4c4

Fix lint

00a8824

fix lint error for real this time

1796402

PR cleanup

ff3709c

konradpabjan added 2 commits December 18, 2020 15:16

0.5.0 @actions/artifact release

632e3cf

Display diagnostic info if non-retryable code is hit

3706baf

yacaovsnc reviewed Dec 18, 2020

View reviewed changes

yacaovsnc approved these changes Dec 18, 2020

View reviewed changes

konradpabjan merged commit c861dd8 into main Dec 18, 2020

konradpabjan deleted the konradpabjan/artifact-retryability branch December 18, 2020 20:40

btashton mentioned this pull request Dec 23, 2020

CI: Allow builds to pass even if final artifact upload fails apache/nuttx#2592

Merged

wsnyder mentioned this pull request Dec 24, 2020

Feature request: retry policy actions/download-artifact#72

Closed

This was referenced Jan 4, 2021

Add retries to all HTTP calls + fix dependabot alerts actions/download-artifact#80

Merged

Add retries to all HTTP calls + resolve dependabot alerts actions/upload-artifact#160

Merged

a2ikm mentioned this pull request Dec 29, 2022

Add CRuby 3.0.0 to CI mongomapper/mongomapper#674

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry all http calls for artifact upload and download #675

Retry all http calls for artifact upload and download #675

konradpabjan commented Dec 16, 2020 •

edited

Loading

brcrista Dec 17, 2020

konradpabjan Dec 17, 2020 •

edited

Loading

brcrista Dec 17, 2020

konradpabjan Dec 17, 2020 •

edited

Loading

brcrista Dec 17, 2020

konradpabjan Dec 17, 2020

brcrista Dec 17, 2020

konradpabjan Dec 17, 2020

brcrista Dec 17, 2020 •

edited

Loading

konradpabjan Dec 17, 2020

brcrista Dec 17, 2020

brcrista Dec 17, 2020

joshmgross Dec 17, 2020

konradpabjan Dec 17, 2020

brcrista Dec 17, 2020

yacaovsnc commented Dec 17, 2020

yacaovsnc Dec 18, 2020

yacaovsnc Dec 18, 2020

konradpabjan Dec 18, 2020

	errorMessages: Map<number, string>,
	getErrorMessage: (response: T) => string,

	const backOff = async (retryAfterValue?: number): Promise<void> => {
	this.uploadHttpManager.disposeAndReplaceClient(httpClientIndex)
	if (retryAfterValue) {
	core.info(
	`Backoff due to too many requests, retry #${retryCount}. Waiting for ${retryAfterValue} milliseconds before continuing the upload`
	)
	await new Promise(resolve => setTimeout(resolve, retryAfterValue))
	} else {
	const backoffTime = getExponentialRetryTimeInMilliseconds(retryCount)
	core.info(
	`Exponential backoff for retry #${retryCount}. Waiting for ${backoffTime} milliseconds before continuing the upload at offset ${start}`
	)
	await new Promise(resolve => setTimeout(resolve, backoffTime))
	}
	core.info(
	`Finished backoff for retry #${retryCount}, continuing with upload`
	)
	return
	}

	/**
	* Returns a retry time in milliseconds that exponentially gets larger
	* depending on the amount of retries that have been attempted
	*/
	export function getExponentialRetryTimeInMilliseconds(
	retryCount: number
	): number {
	if (retryCount < 0) {
	throw new Error('RetryCount should not be negative')
	} else if (retryCount === 0) {
	return getInitialRetryIntervalInMilliseconds()
	}

	const minTime =
	getInitialRetryIntervalInMilliseconds() * getRetryMultiplier() * retryCount
	const maxTime = minTime * getRetryMultiplier()

	// returns a random number between the minTime (inclusive) and the maxTime (exclusive)
	return Math.random() * (maxTime - minTime) + minTime
	}

		@@ -0,0 +1,74 @@
		import {IHttpClientResponse} from '@actions/http-client/interfaces'

Retry all http calls for artifact upload and download #675

Retry all http calls for artifact upload and download #675

Conversation

konradpabjan commented Dec 16, 2020 • edited Loading

Overview

Inspiration

Choose a reason for hiding this comment

konradpabjan Dec 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

konradpabjan Dec 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brcrista Dec 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yacaovsnc commented Dec 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

konradpabjan commented Dec 16, 2020 •

edited

Loading

konradpabjan Dec 17, 2020 •

edited

Loading

konradpabjan Dec 17, 2020 •

edited

Loading

brcrista Dec 17, 2020 •

edited

Loading