-
Notifications
You must be signed in to change notification settings - Fork 958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't retry uploads when the http status code response from the server is in the 400's #131
Conversation
response from the server is in the 400's
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the user experience around hitting a limit on artifact uploads could be improved here. In the current experience, they'll get through their build and start uploading artifacts only to see a lot of errors that upload failed. Looking at the logs in the PR description, it's hard to find the forbidden status code and any associated error message.
I think ideally we would have one simple error like "Unable to upload artifacts, size limit reach. See [link to billing page] for more info."
Could we add some sort of environment variable at job queue time that indicates that the job can upload artifacts (limit hasn't been reached)?
We could check that before trying to upload here in the plugin and just fail the step then.
I could also see something like conditionally uploading based on that value become a thing, something like:
- name: Try Upload Artifact
if: env.ArtifactLimitReached != 'true'
uses: actions/upload-artifact@v1
with:
name: my-artifact
path: samples
Users could modify that value, but it would just be a convenience thing as upload itself would fail if they tried to circumvent it.
Regarding feedback, you definately want a good experience when they run out of paid storage. Forbidden with no info will just generate LSIs and dissat. Options are the forbidden should have a good message (as josh suggested) coming from the service. (also why you don't want to retry after the server successfully determines it can't fullfill the request. Or you go with another http response code in that case but there doesn't seem to be an appropriate 4xx (429 is the closest but that's usually for throttling and too many requests) |
To clear up some confusion, those logs are incomplete. Here's a log in context which has the user-facing error at the end:
The error message coming from the service which will show up in their annotations is currently: As a reaction from this, I don't like seeing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still feel like it's worth pre-empting the upload if we know at job queue time that storage is full and the upload is failed, but it's outside the scope of this PR as it's focused on failing the upload in the first place (something we'll need regardless)
public bool IsFastFailResponse(HttpResponseMessage response) | ||
{ | ||
int statusCode = (int)response?.StatusCode; | ||
return statusCode >= 400 && statusCode <= 499; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to be concerned with 429 - Rate limiting and react differently on that. The server may not handle rate limiting by responding with a 429 so it may be a moot point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question! I'm pretty sure it doesn't respond with 429, but if you'd like, I would be happy to make the adjustment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are sure we don't respond with a 429 (which I don't believe is that common of a practice most servers just drop connections) we can keep this, otherwise we probably would want to retry and we should return false in here. Honestly the second is probably safer but either is fine for me as long as we verify the server behavior for the former!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also double check if the FCS can potentially respond with a "409 - Conflict", which could mean the file has already been uploaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that, for Conflict, we should handle the code so the runner doesn't retry the upload (which is already doing), but log the trace in a way that's not seen as an "error" per se.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't you want the task to error out and for the message to say 'Conflict'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some discussion offline with @thboop, I think I understand what you're saying now. We don't want to fast-fail on a Conflict. Rather, we want to treat that as a 'regular' error and keep attempting to upload the rest of the batch of files if needed.
{ | ||
context.Output($"Cannot continue uploading files, so draining upload queue of {_fileUploadQueue.Count} items."); | ||
DrainUploadQueue(context); | ||
failedFiles.Clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We react on the number of failedFiles in the FileContainerService to determine if our upload was successful or not. Including logging of messages like
if (uploadResult.FailedFiles.Count == 0)
{
// all files have been upload succeed.
context.Output("File upload succeed.");
return uploadResult.TotalFileSizeUploaded;
}
Clearing these files likely has unintended side effects we may need to react on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I had fixed most of those. Can you call out any I missed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...also, I agree with your point. I'm not sure if I've covered all the cases with this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It definitely feels like the pattern we are using here was not intended for this use case.
In the past, we looked at the number of failed files and used that as essentially an error response if its count != 0
. We also made sure to retry all of those files that failed, so it served two purposes: error reporting and list of files to retry
Now, we don't want to retry these files, so we remove them from the response, and we sort of lose our "error" status.
We react on the number of files in a lot of locations:
SetupParallelUpload: https://github.com/actions/runner/blob/users/lkillgore/fastfailforbidden/src/Runner.Plugins/Artifact/FileContainerServer.cs#L369
Which calls AddUploadresult which creates one big list with all files failed
https://github.com/actions/runner/blob/users/lkillgore/fastfailforbidden/src/Runner.Plugins/Artifact/FileContainerServer.cs#L642
Which we then check in case we need to see if we are complete or need to retry
https://github.com/actions/runner/blob/users/lkillgore/fastfailforbidden/src/Runner.Plugins/Artifact/FileContainerServer.cs#L239
And if we eventually fail that retry, we throw:
https://github.com/actions/runner/blob/users/lkillgore/fastfailforbidden/src/Runner.Plugins/Artifact/FileContainerServer.cs#L269
Is the intention to have this step succeed or fail if we fail to upload some artifacts due to a 400 error code? Before this pr, it looks like we would throw if the retry fails, which should fail the step, but now it looks like we are making the step succeed and creating annotations, is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which is a lot of words for just:
- What happens currently if we fail to upload something due to 4xx errors? I think the step fails
- What happens after this pr? I think according to the code above it would pass with annotations, correct?
- If this is a new behavior, and its the new behavior we want (should confirm with ChrisPat or Bryan), we may need to version the artifact action
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! This was failing, but for the wrong reasons. The upload was faililng, but would continue to process the task. The second call to attach the artifacts to the build was failing the task. I'm changing this to throw as happened with the other failures, but still try to attach the files to the build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reworked this file quite a bit. The results:
- This function will fail with anything but a 'create' response from the server
- 'Conflict' will fail the task with an error message, but will keep trying to upload the rest of the batch of files (should there be a batch of files)
- 5xx will go through a set of retries
- Any failure will still try to attach the container to the build. This has the benefit of returning potentially improved error messages, and if a portion of files were uploaded, they will be accessible to the build.
As far as I know, there aren't any occasions where the existing uploadtask's final status will be different from this change. This one will not retry uploads on 4xx failures, but likely all of those retries would have ultimately ended in failure.
So, while the behavior may be different, it likely won't be detectable by the end user. My personal feelings are that this won't require versioning this task.
Don't fast-fail on 'Conflict'
@bryanmacfarlane A couple of things to call out here:
Do you think that this change will require versioning the plugin? (I would think not, though it's not clear to me all the factors that go into the decision on whether to bump the version or not.) |
Called out specialized 'Conflict' logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm no runner expert, but LGTM :)
@joshmgross @lkillgore - The question of whether we should fail the job on the server itself if no storage bytes are available was asked and we decided to go with running the build and only blocking artifact upload. (agreed out of scope for this PR). The changes look good, I see you are handling conflict vs other 400 errors separately. I think we need a dial for safe rollout of this change. I don't think the server side feature flag gives us enough control. I am not familiar with plugin versions. Does updating the plugin version give more control on rolling the change to a subset of customers? |
@madhurig I agree out of scope, but my point was more "how can we easily fail at artifact upload" instead of failing when queueing the job. I think an
No, I don't believe so. Only one plugin version lives on the runner unless we package multiple like the Pipeline Artifact team did on the AzP agent. So this change will roll out with the next runner update. I believe we can do that by ring though, so we will get a subset of customers using the new runner at first. |
@madhurig @joshmgross I missed your comments before I merged. Seeing as how this is limited to failure responses for uploads, I think this is a pretty safe change. It's only for FileContainerService, which doesn't have caching issues the way other CDNs have issues and will return 401's or 404's in that case. FF's aren't supported in the runner. The pattern is to have a server side FF w/ environment variables. Alternatively, as I understand, this plugin could be versioned on the plugin side (see comment above to Bryan re: versioning the plugin). I'm not sure what that would entail. I'll do some digging. |
…r is in the 400's (actions#131) * Don't retry uploads when the http status code response from the server is in the 400's * Don't retry on fast-fail * Feedback from code review * Always try to attach any uploaded files to the build Don't fast-fail on 'Conflict' * Add dispose * Refactored upload code. Called out specialized 'Conflict' logic. * Added typed exception.
We shouldn't retry if the server is sending back 'Forbidden' or other 400 errors that indicate a problem with the caller. This change short-circuits the retry logic in those instances, and stops the rest of the batch from being sent.
Example console with 2 concurrent uploads for 7 files: