-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TaskCluster sometimes fails due to network issues #21529
Comments
https://status.mozilla.org/ reports that everything is green now. If your PR failed due to this issue, you should be able to close and re-open the PR to trigger the checks to re-run. |
I think we have a more general problem regarding network. Anecdotally, I'm seeing more failures due to network errors recently (though I don't have any real stats). I think it makes sense to implement a generic retry mechanism for all the downloads. |
^ there is some data, @Hexcles . I just triaged 11 blocked Chromium exports, and 8 of them had red tasks due to network failure downloading Chrome. Will see how many of the 8 fail again on retry... |
(I'm going to stop linking PRs to this issue because it just creates useless spam, but generally note that this is causing big pain for me sheriffing Chromium exports atm). For a retry mechanism, we already have |
For the record: I ran a |
Brian did some further investigation and was able to reproduce quite frequently -- https://community-tc.services.mozilla.com/tasks/groups/UyRo436cTXC-Zez_lMgzPQ (the green are the reproductions). So, that's pretty common! Some random googling revealed this article. It's a slow start, but gets to some details about conntrack labeling packets as INVALID and not translating, and that leading to RSTs. It doesn't quite match what we're seeing (the author saw RSTs from the client, while we see it from the server; and the author's network was set up to forward un-translated packets from client to server, whereas in our case such a packet would be dropped. However,
so there are invalid packets. Those might be just random internet scanning noise, but they might be evidence of something real here. A few thoughts on how to follow up:
|
Retry 5 times and increase the initial wait time to 2 secs. Another attempt to work around #21529.
Just to update we've reproduce on a host where we captured traces inside the container and on the host as well. We're looking into them will update more tomorrow. Sorry again this is causing troubles. I hope that increase in retries and wait time will help for the time being! |
So what's our takeaway here? IIUC, this sounds somewhat similar to moby/libnetwork#1090 where suprious retransmits out of the TCP window cannot be NAT'ed into the container and the host considers them invalid hence sending RST. Shall we try one of the workarounds on the host? |
Hey @Hexcles, Yeah, I think it is time to just try one of the workarounds you mentioned. I'm in the process of baking an image with the offloading settings you mentioned disabled now. If it helps in my testing, you can try using it asap. p.s. for my own learning for next time, how did you figure to point the blame at TCP segmentation offloading? Is there a troubleshooting doc you found it in? |
Ah, now that I read the bug you linked to tonight I see what they suggest to do there. I'll bake an image with that and send it your way tonight or tomorrow! |
@imbstack oh I'm sorry that wasn't clear. I was referring to the workarounds in moby/libnetwork#1090 , which seems to directly tackle the the symptom that we are seeing -- either accepting the spurious retransmits into the Docker container (hopefully it will handle them better) or dropping them completely on the host to stop the host from sending RSTs. |
Tried this out on Thanks again for all of the help! |
Ok, looks like this will be pushed out Monday morning instead. I'll update here when that happens. |
Images with this fix are deployed now. My testing of them so far in our workers seems to indicate the workaround fixes things. Please let me know if this either breaks something else or doesn't fix your initial issue! |
Should the new images have led to a fix without any extra work on our part? https://community-tc.services.mozilla.com/tasks/QpzzvdbIQlSWamTGOyX4kg is a recent failure due to network issues, 10 hours ago. |
It should. That error came from a different place with a different exception ( |
Is the |
Anecdotally, I think it's a lot better this week. We are trying to chase
down the root cause internally, too. I'll follow up here once we have an
update. Thanks again, everyone!
…On Thu, Apr 30, 2020 at 4:34 PM Brian Stack ***@***.***> wrote:
Is the connection broken issue happening as frequently as the reset
issues from before? Also did those errors seem to reduce in frequency this
week? We don't have any empirical data to show any success other than our
direct testing before release.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#21529 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAK6ZDACAVWAERBWRVFJUDDRPHOEVANCNFSM4KOGPWSA>
.
|
Tentatively switching to |
@Hexcles can you summarize the outcome of the internal investigations (if appropriate), and then we can close this out? |
Unfortunately, the internal investigation has somewhat stalled. The current consensus is that something is wrong with Docker's iptables configurations w.r.t. NAT into the containers. The "workaround" we applied should have been the default configuration. This workaround is being applied on a ad-hoc basis in various places. |
Since there is nothing actionable on our side, I'm closing this issue. |
There is an ongoing Mozilla infrastructure issue that is causing network failures in TaskCluster runs when attempting to fetch Firefox testing profiles:
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='hg.mozilla.org', port=443): Max retries exceeded with url: /mozilla-central/archive/tip.zip/testing/profiles/ (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f5d88c94f50>: Failed to establish a new connection: [Errno 110] Connection timed out',))
This is just a tracking issue to link to from blocked PRs; I will attempt to post updates here, but see https://status.mozilla.org/ for the latest information on the outage.
The text was updated successfully, but these errors were encountered: