Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't retry uploads when the http status code response from the server is in the 400's #131

Merged
merged 7 commits into from
Oct 23, 2019

Conversation

lkillgore
Copy link
Contributor

We shouldn't retry if the server is sending back 'Forbidden' or other 400 errors that indicate a problem with the caller. This change short-circuits the retry logic in those instances, and stops the rest of the batch from being sent.

Example console with 2 concurrent uploads for 7 files:

Uploading artifact 'test_artifact_item-1' from 'C:\work\runner\_layout\_work\lkillgore01\lkillgore01' for run #41
Uploading 7 files
Unable to copy file to server StatusCode=Forbidden:  . Source file path: C:\work\runner\_layout\_work\lkillgore01\lkillgore01\matrix.expected.json. Target server path: test_artifact_item-1/matrix.expected.json
Detail upload trace for file that fail to upload: test_artifact_item-1/matrix.expected.json
File 'test_artifact_item-1/matrix.expected.json' will be uploaded in one chunk.
Attempt '1' for uploading chunk '1' of file 'test_artifact_item-1/matrix.expected.json'.
Generate new HttpRequest for uploading file 'test_artifact_item-1/matrix.expected.json', chunk '1' of '1'.
Start uploading file 'test_artifact_item-1/matrix.expected.json' to server, chunk '1'.
Finished upload chunk '1' of file 'test_artifact_item-1/matrix.expected.json', elapsed 215 (ms), response code 'Forbidden'.
Unable to copy file to server StatusCode=Forbidden:  . Source file path: C:\work\runner\_layout\_work\lkillgore01\lkillgore01\matrix.actual.json. Target server path: test_artifact_item-1/matrix.actual.json
Detail upload trace for file that fail to upload: test_artifact_item-1/matrix.actual.json
File 'test_artifact_item-1/matrix.actual.json' will be uploaded in one chunk.
Chunk '1' attempt '1' of file 'test_artifact_item-1/matrix.expected.json' received non-success status code Forbidden for sending request and cannot continue.
Attempt '1' for uploading chunk '1' of file 'test_artifact_item-1/matrix.actual.json'.
Generate new HttpRequest for uploading file 'test_artifact_item-1/matrix.actual.json', chunk '1' of '1'.
Start uploading file 'test_artifact_item-1/matrix.actual.json' to server, chunk '1'.
Finished upload chunk '1' of file 'test_artifact_item-1/matrix.actual.json', elapsed 216 (ms), response code 'Forbidden'.
Chunk '1' attempt '1' of file 'test_artifact_item-1/matrix.actual.json' received non-success status code Forbidden for sending request and cannot continue.
Cannot continue uploading files, so draining upload queue of 5 items.
Cannot continue uploading files, so draining upload queue of 5 items.
File upload complete.

Lucas Killgore added 2 commits October 15, 2019 17:10
Copy link
Member

@joshmgross joshmgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the user experience around hitting a limit on artifact uploads could be improved here. In the current experience, they'll get through their build and start uploading artifacts only to see a lot of errors that upload failed. Looking at the logs in the PR description, it's hard to find the forbidden status code and any associated error message.
I think ideally we would have one simple error like "Unable to upload artifacts, size limit reach. See [link to billing page] for more info."

Could we add some sort of environment variable at job queue time that indicates that the job can upload artifacts (limit hasn't been reached)?
We could check that before trying to upload here in the plugin and just fail the step then.

I could also see something like conditionally uploading based on that value become a thing, something like:

- name: Try Upload Artifact
  if: env.ArtifactLimitReached != 'true'
  uses: actions/upload-artifact@v1
  with:
    name: my-artifact
    path: samples

Users could modify that value, but it would just be a convenience thing as upload itself would fail if they tried to circumvent it.

src/Runner.Plugins/Artifact/FileContainerServer.cs Outdated Show resolved Hide resolved
@bryanmacfarlane
Copy link
Member

Regarding feedback, you definately want a good experience when they run out of paid storage. Forbidden with no info will just generate LSIs and dissat.

Options are the forbidden should have a good message (as josh suggested) coming from the service. (also why you don't want to retry after the server successfully determines it can't fullfill the request.

Or you go with another http response code in that case but there doesn't seem to be an appropriate 4xx (429 is the closest but that's usually for throttling and too many requests)

@lkillgore
Copy link
Contributor Author

Looking at the logs in the PR description

To clear up some confusion, those logs are incomplete. Here's a log in context which has the user-facing error at the end:

Uploading artifact 'test_artifact_item-1' from 'C:\work\runner\_layout\_work\_temp\_github_workflow\event.json' for run #5
Uploading 1 files
Unable to copy file to server StatusCode=NotFound:  . Source file path: C:\work\runner\_layout\_work\_temp\_github_workflow\event.json. Target server path: test_artifact_item-1/event.json
Detail upload trace for file that fail to upload: test_artifact_item-1/event.json
File 'test_artifact_item-1/event.json' will be uploaded in one chunk.
Attempt '1' for uploading chunk '1' of file 'test_artifact_item-1/event.json'.
Generate new HttpRequest for uploading file 'test_artifact_item-1/event.json', chunk '1' of '1'.
Start uploading file 'test_artifact_item-1/event.json' to server, chunk '1'.
Finished upload chunk '1' of file 'test_artifact_item-1/event.json', elapsed 270 (ms), response code 'NotFound'.
Chunk '1' attempt '1' of file 'test_artifact_item-1/event.json' received non-success status code NotFound for sending request and cannot continue.
Cannot continue uploading files, so draining upload queue of 0 items.
File upload complete.
Uploaded 'C:\work\runner\_layout\_work\_temp\_github_workflow\event.json' to server
##[error]Cannot upload artifacts. You may have reached your capacity.

The error message coming from the service which will show up in their annotations is currently: Cannot upload artifacts. You may have reached your capacity. (I think there's room for improvement in the message here, but I don't think that's for consideration in this PR.) If there's a lot of files to upload, there will be a separate error for each of the max concurrent attempts to upload a file. My guess is a high of ~4 on our hosted agents.

As a reaction from this, I don't like seeing Uploaded 'file-path' to server when it errors like this, so I think I'm going to change that to be Uploaded 'XX' bytes from 'file-path' to server.

Copy link
Member

@joshmgross joshmgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still feel like it's worth pre-empting the upload if we know at job queue time that storage is full and the upload is failed, but it's outside the scope of this PR as it's focused on failing the upload in the first place (something we'll need regardless)

public bool IsFastFailResponse(HttpResponseMessage response)
{
int statusCode = (int)response?.StatusCode;
return statusCode >= 400 && statusCode <= 499;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to be concerned with 429 - Rate limiting and react differently on that. The server may not handle rate limiting by responding with a 429 so it may be a moot point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question! I'm pretty sure it doesn't respond with 429, but if you'd like, I would be happy to make the adjustment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are sure we don't respond with a 429 (which I don't believe is that common of a practice most servers just drop connections) we can keep this, otherwise we probably would want to retry and we should return false in here. Honestly the second is probably safer but either is fine for me as long as we verify the server behavior for the former!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also double check if the FCS can potentially respond with a "409 - Conflict", which could mean the file has already been uploaded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliobbv It does return Conflict. I think that should be fine with this change, though. Or am I missing something? (I don't think we would want to retry the upload in the case that it already exists.)

Copy link
Contributor

@juliobbv juliobbv Oct 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that, for Conflict, we should handle the code so the runner doesn't retry the upload (which is already doing), but log the trace in a way that's not seen as an "error" per se.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't you want the task to error out and for the message to say 'Conflict'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some discussion offline with @thboop, I think I understand what you're saying now. We don't want to fast-fail on a Conflict. Rather, we want to treat that as a 'regular' error and keep attempting to upload the rest of the batch of files if needed.

{
context.Output($"Cannot continue uploading files, so draining upload queue of {_fileUploadQueue.Count} items.");
DrainUploadQueue(context);
failedFiles.Clear();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We react on the number of failedFiles in the FileContainerService to determine if our upload was successful or not. Including logging of messages like

                    if (uploadResult.FailedFiles.Count == 0)
                    {
                        // all files have been upload succeed.
                        context.Output("File upload succeed.");
                        return uploadResult.TotalFileSizeUploaded;
                    }

Clearing these files likely has unintended side effects we may need to react on

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I had fixed most of those. Can you call out any I missed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...also, I agree with your point. I'm not sure if I've covered all the cases with this change.

Copy link
Collaborator

@thboop thboop Oct 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely feels like the pattern we are using here was not intended for this use case.

In the past, we looked at the number of failed files and used that as essentially an error response if its count != 0. We also made sure to retry all of those files that failed, so it served two purposes: error reporting and list of files to retry

Now, we don't want to retry these files, so we remove them from the response, and we sort of lose our "error" status.

We react on the number of files in a lot of locations:

SetupParallelUpload: https://github.com/actions/runner/blob/users/lkillgore/fastfailforbidden/src/Runner.Plugins/Artifact/FileContainerServer.cs#L369
Which calls AddUploadresult which creates one big list with all files failed
https://github.com/actions/runner/blob/users/lkillgore/fastfailforbidden/src/Runner.Plugins/Artifact/FileContainerServer.cs#L642
Which we then check in case we need to see if we are complete or need to retry
https://github.com/actions/runner/blob/users/lkillgore/fastfailforbidden/src/Runner.Plugins/Artifact/FileContainerServer.cs#L239

And if we eventually fail that retry, we throw:
https://github.com/actions/runner/blob/users/lkillgore/fastfailforbidden/src/Runner.Plugins/Artifact/FileContainerServer.cs#L269

Is the intention to have this step succeed or fail if we fail to upload some artifacts due to a 400 error code? Before this pr, it looks like we would throw if the retry fails, which should fail the step, but now it looks like we are making the step succeed and creating annotations, is that correct?

Copy link
Collaborator

@thboop thboop Oct 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is a lot of words for just:

  • What happens currently if we fail to upload something due to 4xx errors? I think the step fails
  • What happens after this pr? I think according to the code above it would pass with annotations, correct?
  • If this is a new behavior, and its the new behavior we want (should confirm with ChrisPat or Bryan), we may need to version the artifact action

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! This was failing, but for the wrong reasons. The upload was faililng, but would continue to process the task. The second call to attach the artifacts to the build was failing the task. I'm changing this to throw as happened with the other failures, but still try to attach the files to the build.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworked this file quite a bit. The results:

  1. This function will fail with anything but a 'create' response from the server
  2. 'Conflict' will fail the task with an error message, but will keep trying to upload the rest of the batch of files (should there be a batch of files)
  3. 5xx will go through a set of retries
  4. Any failure will still try to attach the container to the build. This has the benefit of returning potentially improved error messages, and if a portion of files were uploaded, they will be accessible to the build.

As far as I know, there aren't any occasions where the existing uploadtask's final status will be different from this change. This one will not retry uploads on 4xx failures, but likely all of those retries would have ultimately ended in failure.

So, while the behavior may be different, it likely won't be detectable by the end user. My personal feelings are that this won't require versioning this task.

Lucas Killgore added 2 commits October 18, 2019 15:44
@lkillgore
Copy link
Contributor Author

lkillgore commented Oct 18, 2019

@bryanmacfarlane A couple of things to call out here:

  1. StatusCode 429 (Conflict) - is special-cased to fast-fail (when the file already exists for the build), but will still nominally fail the task.
  2. This change includes a change for 'normal failure'. If 5 files are uploaded and one of them gets a 500 and the retry fails, this PR will now attach the 4 successful files to the build.

Do you think that this change will require versioning the plugin? (I would think not, though it's not clear to me all the factors that go into the decision on whether to bump the version or not.)

Called out specialized 'Conflict' logic.
Copy link
Contributor

@pjquirk pjquirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm no runner expert, but LGTM :)

@madhurig
Copy link

@joshmgross @lkillgore - The question of whether we should fail the job on the server itself if no storage bytes are available was asked and we decided to go with running the build and only blocking artifact upload. (agreed out of scope for this PR).

The changes look good, I see you are handling conflict vs other 400 errors separately. I think we need a dial for safe rollout of this change. I don't think the server side feature flag gives us enough control. I am not familiar with plugin versions. Does updating the plugin version give more control on rolling the change to a subset of customers?

@joshmgross
Copy link
Member

The question of whether we should fail the job on the server itself if no storage bytes are available was asked

@madhurig I agree out of scope, but my point was more "how can we easily fail at artifact upload" instead of failing when queueing the job. I think an env variable that users can check and this task can check to see if out of storage would be a better user experience overall.

Does updating the plugin version give more control on rolling the change to a subset of customers?

No, I don't believe so. Only one plugin version lives on the runner unless we package multiple like the Pipeline Artifact team did on the AzP agent. So this change will roll out with the next runner update. I believe we can do that by ring though, so we will get a subset of customers using the new runner at first.

@lkillgore lkillgore merged commit e89148e into master Oct 23, 2019
@lkillgore lkillgore deleted the users/lkillgore/fastfailforbidden branch October 23, 2019 16:48
@lkillgore
Copy link
Contributor Author

@madhurig @joshmgross I missed your comments before I merged. Seeing as how this is limited to failure responses for uploads, I think this is a pretty safe change. It's only for FileContainerService, which doesn't have caching issues the way other CDNs have issues and will return 401's or 404's in that case.

FF's aren't supported in the runner. The pattern is to have a server side FF w/ environment variables.
It seems pretty heavy to me. The easiest method is to revert and roll forward.

Alternatively, as I understand, this plugin could be versioned on the plugin side (see comment above to Bryan re: versioning the plugin). I'm not sure what that would entail. I'll do some digging.

AdamOlech pushed a commit to antmicro/runner that referenced this pull request Jan 28, 2021
…r is in the 400's (actions#131)

* Don't retry uploads when the http status code
response from the server is in the 400's

* Don't retry on fast-fail

* Feedback from code review

* Always try to attach any uploaded files to the build
Don't fast-fail on 'Conflict'

* Add dispose

* Refactored upload code.
Called out specialized 'Conflict' logic.

* Added typed exception.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants