Don't retry uploads when the http status code response from the server is in the 400's #131

lkillgore · 2019-10-15T21:21:53Z

We shouldn't retry if the server is sending back 'Forbidden' or other 400 errors that indicate a problem with the caller. This change short-circuits the retry logic in those instances, and stops the rest of the batch from being sent.

Example console with 2 concurrent uploads for 7 files:

Uploading artifact 'test_artifact_item-1' from 'C:\work\runner\_layout\_work\lkillgore01\lkillgore01' for run #41
Uploading 7 files
Unable to copy file to server StatusCode=Forbidden:  . Source file path: C:\work\runner\_layout\_work\lkillgore01\lkillgore01\matrix.expected.json. Target server path: test_artifact_item-1/matrix.expected.json
Detail upload trace for file that fail to upload: test_artifact_item-1/matrix.expected.json
File 'test_artifact_item-1/matrix.expected.json' will be uploaded in one chunk.
Attempt '1' for uploading chunk '1' of file 'test_artifact_item-1/matrix.expected.json'.
Generate new HttpRequest for uploading file 'test_artifact_item-1/matrix.expected.json', chunk '1' of '1'.
Start uploading file 'test_artifact_item-1/matrix.expected.json' to server, chunk '1'.
Finished upload chunk '1' of file 'test_artifact_item-1/matrix.expected.json', elapsed 215 (ms), response code 'Forbidden'.
Unable to copy file to server StatusCode=Forbidden:  . Source file path: C:\work\runner\_layout\_work\lkillgore01\lkillgore01\matrix.actual.json. Target server path: test_artifact_item-1/matrix.actual.json
Detail upload trace for file that fail to upload: test_artifact_item-1/matrix.actual.json
File 'test_artifact_item-1/matrix.actual.json' will be uploaded in one chunk.
Chunk '1' attempt '1' of file 'test_artifact_item-1/matrix.expected.json' received non-success status code Forbidden for sending request and cannot continue.
Attempt '1' for uploading chunk '1' of file 'test_artifact_item-1/matrix.actual.json'.
Generate new HttpRequest for uploading file 'test_artifact_item-1/matrix.actual.json', chunk '1' of '1'.
Start uploading file 'test_artifact_item-1/matrix.actual.json' to server, chunk '1'.
Finished upload chunk '1' of file 'test_artifact_item-1/matrix.actual.json', elapsed 216 (ms), response code 'Forbidden'.
Chunk '1' attempt '1' of file 'test_artifact_item-1/matrix.actual.json' received non-success status code Forbidden for sending request and cannot continue.
Cannot continue uploading files, so draining upload queue of 5 items.
Cannot continue uploading files, so draining upload queue of 5 items.
File upload complete.

response from the server is in the 400's

joshmgross

I think the user experience around hitting a limit on artifact uploads could be improved here. In the current experience, they'll get through their build and start uploading artifacts only to see a lot of errors that upload failed. Looking at the logs in the PR description, it's hard to find the forbidden status code and any associated error message.
I think ideally we would have one simple error like "Unable to upload artifacts, size limit reach. See [link to billing page] for more info."

Could we add some sort of environment variable at job queue time that indicates that the job can upload artifacts (limit hasn't been reached)?
We could check that before trying to upload here in the plugin and just fail the step then.

I could also see something like conditionally uploading based on that value become a thing, something like:

- name: Try Upload Artifact
  if: env.ArtifactLimitReached != 'true'
  uses: actions/upload-artifact@v1
  with:
    name: my-artifact
    path: samples

Users could modify that value, but it would just be a convenience thing as upload itself would fail if they tried to circumvent it.

src/Runner.Plugins/Artifact/FileContainerServer.cs

bryanmacfarlane · 2019-10-16T15:28:59Z

Regarding feedback, you definately want a good experience when they run out of paid storage. Forbidden with no info will just generate LSIs and dissat.

Options are the forbidden should have a good message (as josh suggested) coming from the service. (also why you don't want to retry after the server successfully determines it can't fullfill the request.

Or you go with another http response code in that case but there doesn't seem to be an appropriate 4xx (429 is the closest but that's usually for throttling and too many requests)

lkillgore · 2019-10-17T13:46:31Z

Looking at the logs in the PR description

To clear up some confusion, those logs are incomplete. Here's a log in context which has the user-facing error at the end:

Uploading artifact 'test_artifact_item-1' from 'C:\work\runner\_layout\_work\_temp\_github_workflow\event.json' for run #5
Uploading 1 files
Unable to copy file to server StatusCode=NotFound:  . Source file path: C:\work\runner\_layout\_work\_temp\_github_workflow\event.json. Target server path: test_artifact_item-1/event.json
Detail upload trace for file that fail to upload: test_artifact_item-1/event.json
File 'test_artifact_item-1/event.json' will be uploaded in one chunk.
Attempt '1' for uploading chunk '1' of file 'test_artifact_item-1/event.json'.
Generate new HttpRequest for uploading file 'test_artifact_item-1/event.json', chunk '1' of '1'.
Start uploading file 'test_artifact_item-1/event.json' to server, chunk '1'.
Finished upload chunk '1' of file 'test_artifact_item-1/event.json', elapsed 270 (ms), response code 'NotFound'.
Chunk '1' attempt '1' of file 'test_artifact_item-1/event.json' received non-success status code NotFound for sending request and cannot continue.
Cannot continue uploading files, so draining upload queue of 0 items.
File upload complete.
Uploaded 'C:\work\runner\_layout\_work\_temp\_github_workflow\event.json' to server
##[error]Cannot upload artifacts. You may have reached your capacity.

The error message coming from the service which will show up in their annotations is currently: Cannot upload artifacts. You may have reached your capacity. (I think there's room for improvement in the message here, but I don't think that's for consideration in this PR.) If there's a lot of files to upload, there will be a separate error for each of the max concurrent attempts to upload a file. My guess is a high of ~4 on our hosted agents.

As a reaction from this, I don't like seeing Uploaded 'file-path' to server when it errors like this, so I think I'm going to change that to be Uploaded 'XX' bytes from 'file-path' to server.

joshmgross

I still feel like it's worth pre-empting the upload if we know at job queue time that storage is full and the upload is failed, but it's outside the scope of this PR as it's focused on failing the upload in the first place (something we'll need regardless)

thboop · 2019-10-18T15:36:40Z

src/Sdk/WebApi/WebApi/HttpClients/FileContainerHttpClient.cs

+        public bool IsFastFailResponse(HttpResponseMessage response)
+        {
+            int statusCode = (int)response?.StatusCode;
+            return statusCode >= 400 && statusCode <= 499;


Do we need to be concerned with 429 - Rate limiting and react differently on that. The server may not handle rate limiting by responding with a 429 so it may be a moot point.

That's a good question! I'm pretty sure it doesn't respond with 429, but if you'd like, I would be happy to make the adjustment.

If we are sure we don't respond with a 429 (which I don't believe is that common of a practice most servers just drop connections) we can keep this, otherwise we probably would want to retry and we should return false in here. Honestly the second is probably safer but either is fine for me as long as we verify the server behavior for the former!

I'd also double check if the FCS can potentially respond with a "409 - Conflict", which could mean the file has already been uploaded.

@juliobbv It does return Conflict. I think that should be fine with this change, though. Or am I missing something? (I don't think we would want to retry the upload in the case that it already exists.)

I think that, for Conflict, we should handle the code so the runner doesn't retry the upload (which is already doing), but log the trace in a way that's not seen as an "error" per se.

Wouldn't you want the task to error out and for the message to say 'Conflict'?

After some discussion offline with @thboop, I think I understand what you're saying now. We don't want to fast-fail on a Conflict. Rather, we want to treat that as a 'regular' error and keep attempting to upload the rest of the batch of files if needed.

src/Runner.Plugins/Artifact/FileContainerServer.cs

thboop · 2019-10-18T15:49:48Z

src/Runner.Plugins/Artifact/FileContainerServer.cs

+                            {
+                                context.Output($"Cannot continue uploading files, so draining upload queue of {_fileUploadQueue.Count} items.");
+                                DrainUploadQueue(context);
+                                failedFiles.Clear();


We react on the number of failedFiles in the FileContainerService to determine if our upload was successful or not. Including logging of messages like

if (uploadResult.FailedFiles.Count == 0) { // all files have been upload succeed. context.Output("File upload succeed."); return uploadResult.TotalFileSizeUploaded; }

Clearing these files likely has unintended side effects we may need to react on

I thought I had fixed most of those. Can you call out any I missed?

...also, I agree with your point. I'm not sure if I've covered all the cases with this change.

It definitely feels like the pattern we are using here was not intended for this use case.

In the past, we looked at the number of failed files and used that as essentially an error response if its count != 0. We also made sure to retry all of those files that failed, so it served two purposes: error reporting and list of files to retry

Now, we don't want to retry these files, so we remove them from the response, and we sort of lose our "error" status.

We react on the number of files in a lot of locations:

SetupParallelUpload: https://github.com/actions/runner/blob/users/lkillgore/fastfailforbidden/src/Runner.Plugins/Artifact/FileContainerServer.cs#L369
Which calls AddUploadresult which creates one big list with all files failed
https://github.com/actions/runner/blob/users/lkillgore/fastfailforbidden/src/Runner.Plugins/Artifact/FileContainerServer.cs#L642
Which we then check in case we need to see if we are complete or need to retry
https://github.com/actions/runner/blob/users/lkillgore/fastfailforbidden/src/Runner.Plugins/Artifact/FileContainerServer.cs#L239

And if we eventually fail that retry, we throw:
https://github.com/actions/runner/blob/users/lkillgore/fastfailforbidden/src/Runner.Plugins/Artifact/FileContainerServer.cs#L269

Is the intention to have this step succeed or fail if we fail to upload some artifacts due to a 400 error code? Before this pr, it looks like we would throw if the retry fails, which should fail the step, but now it looks like we are making the step succeed and creating annotations, is that correct?

Which is a lot of words for just:

What happens currently if we fail to upload something due to 4xx errors? I think the step fails

What happens after this pr? I think according to the code above it would pass with annotations, correct?

If this is a new behavior, and its the new behavior we want (should confirm with ChrisPat or Bryan), we may need to version the artifact action

Good point! This was failing, but for the wrong reasons. The upload was faililng, but would continue to process the task. The second call to attach the artifacts to the build was failing the task. I'm changing this to throw as happened with the other failures, but still try to attach the files to the build.

I reworked this file quite a bit. The results:

This function will fail with anything but a 'create' response from the server

'Conflict' will fail the task with an error message, but will keep trying to upload the rest of the batch of files (should there be a batch of files)

5xx will go through a set of retries

Any failure will still try to attach the container to the build. This has the benefit of returning potentially improved error messages, and if a portion of files were uploaded, they will be accessible to the build.

As far as I know, there aren't any occasions where the existing uploadtask's final status will be different from this change. This one will not retry uploads on 4xx failures, but likely all of those retries would have ultimately ended in failure.

So, while the behavior may be different, it likely won't be detectable by the end user. My personal feelings are that this won't require versioning this task.

Don't fast-fail on 'Conflict'

lkillgore · 2019-10-18T21:34:23Z

@bryanmacfarlane A couple of things to call out here:

StatusCode 429 (Conflict) - is special-cased to fast-fail (when the file already exists for the build), but will still nominally fail the task.
This change includes a change for 'normal failure'. If 5 files are uploaded and one of them gets a 500 and the retry fails, this PR will now attach the 4 successful files to the build.

Do you think that this change will require versioning the plugin? (I would think not, though it's not clear to me all the factors that go into the decision on whether to bump the version or not.)

Called out specialized 'Conflict' logic.

src/Runner.Plugins/Artifact/PublishArtifact.cs

src/Runner.Plugins/Artifact/FileContainerServer.cs

pjquirk

I'm no runner expert, but LGTM :)

madhurig · 2019-10-23T13:32:41Z

@joshmgross @lkillgore - The question of whether we should fail the job on the server itself if no storage bytes are available was asked and we decided to go with running the build and only blocking artifact upload. (agreed out of scope for this PR).

The changes look good, I see you are handling conflict vs other 400 errors separately. I think we need a dial for safe rollout of this change. I don't think the server side feature flag gives us enough control. I am not familiar with plugin versions. Does updating the plugin version give more control on rolling the change to a subset of customers?

joshmgross · 2019-10-23T14:26:00Z

The question of whether we should fail the job on the server itself if no storage bytes are available was asked

@madhurig I agree out of scope, but my point was more "how can we easily fail at artifact upload" instead of failing when queueing the job. I think an env variable that users can check and this task can check to see if out of storage would be a better user experience overall.

Does updating the plugin version give more control on rolling the change to a subset of customers?

No, I don't believe so. Only one plugin version lives on the runner unless we package multiple like the Pipeline Artifact team did on the AzP agent. So this change will roll out with the next runner update. I believe we can do that by ring though, so we will get a subset of customers using the new runner at first.

lkillgore · 2019-10-23T18:36:54Z

@madhurig @joshmgross I missed your comments before I merged. Seeing as how this is limited to failure responses for uploads, I think this is a pretty safe change. It's only for FileContainerService, which doesn't have caching issues the way other CDNs have issues and will return 401's or 404's in that case.

FF's aren't supported in the runner. The pattern is to have a server side FF w/ environment variables.
It seems pretty heavy to me. The easiest method is to revert and roll forward.

Alternatively, as I understand, this plugin could be versioned on the plugin side (see comment above to Bryan re: versioning the plugin). I'm not sure what that would entail. I'll do some digging.

…r is in the 400's (actions#131) * Don't retry uploads when the http status code response from the server is in the 400's * Don't retry on fast-fail * Feedback from code review * Always try to attach any uploaded files to the build Don't fast-fail on 'Conflict' * Add dispose * Refactored upload code. Called out specialized 'Conflict' logic. * Added typed exception.

Lucas Killgore added 2 commits October 15, 2019 17:10

Don't retry uploads when the http status code

abf2016

response from the server is in the 400's

Don't retry on fast-fail

94fefa7

lkillgore requested review from bryanmacfarlane and madhurig October 15, 2019 21:21

joshmgross reviewed Oct 16, 2019

View reviewed changes

src/Runner.Plugins/Artifact/FileContainerServer.cs Outdated Show resolved Hide resolved

Feedback from code review

886be18

lkillgore requested a review from TingluoHuang October 17, 2019 15:30

joshmgross approved these changes Oct 17, 2019

View reviewed changes

thboop reviewed Oct 18, 2019

View reviewed changes

src/Runner.Plugins/Artifact/FileContainerServer.cs Show resolved Hide resolved

thboop reviewed Oct 18, 2019

View reviewed changes

lkillgore mentioned this pull request Oct 18, 2019

Don't retry downloading from file container service when the http status code response from the server is in the 400's #134

Closed

Lucas Killgore added 2 commits October 18, 2019 15:44

Always try to attach any uploaded files to the build

7110222

Don't fast-fail on 'Conflict'

Add dispose

4a3e9f4

Refactored upload code.

118f0ee

Called out specialized 'Conflict' logic.

pjquirk reviewed Oct 22, 2019

View reviewed changes

src/Runner.Plugins/Artifact/PublishArtifact.cs Show resolved Hide resolved

pjquirk reviewed Oct 22, 2019

View reviewed changes

src/Runner.Plugins/Artifact/FileContainerServer.cs Outdated Show resolved Hide resolved

pjquirk reviewed Oct 22, 2019

View reviewed changes

src/Runner.Plugins/Artifact/FileContainerServer.cs Outdated Show resolved Hide resolved

pjquirk reviewed Oct 22, 2019

View reviewed changes

src/Runner.Plugins/Artifact/FileContainerServer.cs Outdated Show resolved Hide resolved

pjquirk approved these changes Oct 22, 2019

View reviewed changes

Added typed exception.

8c360a2

lkillgore merged commit e89148e into master Oct 23, 2019

lkillgore deleted the users/lkillgore/fastfailforbidden branch October 23, 2019 16:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't retry uploads when the http status code response from the server is in the 400's #131

Don't retry uploads when the http status code response from the server is in the 400's #131

lkillgore commented Oct 15, 2019

joshmgross left a comment

bryanmacfarlane commented Oct 16, 2019

lkillgore commented Oct 17, 2019

joshmgross left a comment

thboop Oct 18, 2019

lkillgore Oct 18, 2019

thboop Oct 18, 2019

juliobbv Oct 18, 2019

lkillgore Oct 18, 2019

juliobbv Oct 18, 2019 •

edited

Loading

lkillgore Oct 18, 2019

lkillgore Oct 18, 2019

thboop Oct 18, 2019

lkillgore Oct 18, 2019

lkillgore Oct 18, 2019

thboop Oct 18, 2019 •

edited

Loading

thboop Oct 18, 2019 •

edited

Loading

lkillgore Oct 18, 2019

lkillgore Oct 21, 2019

lkillgore commented Oct 18, 2019 •

edited

Loading

pjquirk left a comment

madhurig commented Oct 23, 2019

joshmgross commented Oct 23, 2019

lkillgore commented Oct 23, 2019

Don't retry uploads when the http status code response from the server is in the 400's #131

Don't retry uploads when the http status code response from the server is in the 400's #131

Conversation

lkillgore commented Oct 15, 2019

joshmgross left a comment

Choose a reason for hiding this comment

bryanmacfarlane commented Oct 16, 2019

lkillgore commented Oct 17, 2019

joshmgross left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juliobbv Oct 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thboop Oct 18, 2019 • edited Loading

Choose a reason for hiding this comment

thboop Oct 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lkillgore commented Oct 18, 2019 • edited Loading

pjquirk left a comment

Choose a reason for hiding this comment

madhurig commented Oct 23, 2019

joshmgross commented Oct 23, 2019

lkillgore commented Oct 23, 2019

juliobbv Oct 18, 2019 •

edited

Loading

thboop Oct 18, 2019 •

edited

Loading

thboop Oct 18, 2019 •

edited

Loading

lkillgore commented Oct 18, 2019 •

edited

Loading