Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TaskCluster sometimes fails due to network issues #21529

Closed
stephenmcgruer opened this issue Jan 31, 2020 · 61 comments
Closed

TaskCluster sometimes fails due to network issues #21529

stephenmcgruer opened this issue Jan 31, 2020 · 61 comments

Comments

@stephenmcgruer
Copy link
Contributor

There is an ongoing Mozilla infrastructure issue that is causing network failures in TaskCluster runs when attempting to fetch Firefox testing profiles:

requests.exceptions.ConnectionError: HTTPSConnectionPool(host='hg.mozilla.org', port=443): Max retries exceeded with url: /mozilla-central/archive/tip.zip/testing/profiles/ (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f5d88c94f50>: Failed to establish a new connection: [Errno 110] Connection timed out',))

This is just a tracking issue to link to from blocked PRs; I will attempt to post updates here, but see https://status.mozilla.org/ for the latest information on the outage.

@stephenmcgruer
Copy link
Contributor Author

https://status.mozilla.org/ reports that everything is green now. If your PR failed due to this issue, you should be able to close and re-open the PR to trigger the checks to re-run.

@Hexcles
Copy link
Member

Hexcles commented Jan 31, 2020

I also saw some 5xx errors from this URL earlier this week. This has been happening relatively often.

I think the testing profile needs to be treated in the same way as #21194 (which is also more technically correct), with retry mechanism. @jgraham

@Hexcles Hexcles reopened this Jan 31, 2020
@Hexcles Hexcles assigned jgraham and unassigned stephenmcgruer Jan 31, 2020
@Hexcles Hexcles changed the title TaskCluster runs are failing due to network issues in Mozilla infrastructure TaskCluster sometimes fails due to network issues Apr 9, 2020
@Hexcles
Copy link
Member

Hexcles commented Apr 9, 2020

I think we have a more general problem regarding network. Anecdotally, I'm seeing more failures due to network errors recently (though I don't have any real stats). I think it makes sense to implement a generic retry mechanism for all the downloads.

@stephenmcgruer
Copy link
Contributor Author

^ there is some data, @Hexcles . I just triaged 11 blocked Chromium exports, and 8 of them had red tasks due to network failure downloading Chrome. Will see how many of the 8 fail again on retry...

@stephenmcgruer
Copy link
Contributor Author

(I'm going to stop linking PRs to this issue because it just creates useless spam, but generally note that this is causing big pain for me sheriffing Chromium exports atm).

For a retry mechanism, we already have download_url_to_descriptor in run_tc.py, that might just work for Chrome too.

@stephenmcgruer
Copy link
Contributor Author

For the record: I ran a urllib2.urlopen(...); resp.read() loop on my machine for a while and couldn't get dl.google.com to flake, so it seems related to the TaskCluster infrastructure somehow.

@djmitche
Copy link
Contributor

Brian did some further investigation and was able to reproduce quite frequently -- https://community-tc.services.mozilla.com/tasks/groups/UyRo436cTXC-Zez_lMgzPQ (the green are the reproductions). So, that's pretty common!

Some random googling revealed this article. It's a slow start, but gets to some details about conntrack labeling packets as INVALID and not translating, and that leading to RSTs. It doesn't quite match what we're seeing (the author saw RSTs from the client, while we see it from the server; and the author's network was set up to forward un-translated packets from client to server, whereas in our case such a packet would be dropped.

However,

bstack@wpt-testing:~$ sudo conntrack -S
cpu=0           found=0 invalid=338 ignore=152363 insert=0 insert_failed=0 drop=0 early_drop=0 error=2 search_restart=127 

so there are invalid packets. Those might be just random internet scanning noise, but they might be evidence of something real here.

A few thoughts on how to follow up:

  • reproduce this in a situation where we can capture tcpdump on both sides of the FORWARDING chain
  • watch conntrack events when this reproduces

Hexcles added a commit that referenced this issue Apr 22, 2020
Retry 5 times and increase the initial wait time to 2 secs.

Another attempt to work around #21529.
Hexcles added a commit that referenced this issue Apr 22, 2020
Retry 5 times and increase the initial wait time to 2 secs.

Another attempt to work around #21529 to get some Chrome Stable runs
(#22879).
@imbstack
Copy link

Just to update we've reproduce on a host where we captured traces inside the container and on the host as well. We're looking into them will update more tomorrow. Sorry again this is causing troubles. I hope that increase in retries and wait time will help for the time being!

Hexcles added a commit that referenced this issue Apr 23, 2020
Retry 5 times and increase the initial wait time to 2 secs.

Another attempt to work around #21529 to get some Chrome Stable runs
(#22879).
@djmitche
Copy link
Contributor

Well, there are a bunch of things wrong there. Top three:

First, toward the end of the download, the sequence numbers pushed from dl.google.com break into two streams:
Screenshot from 2020-04-23 17-06-14
(x axis is seconds, y axis is sequence number of the first octet in each packet from dl.google.com). Looking at the packets, each is answered with a DUP-ACK, which is an ACK containing a sack option for a sequence range earlier than the ack value. Basically it's a way of saying "thanks, but you sent me that a long time ago". So the lower line here were all duplicate segments.
I'm guessing that a bunch of packets got delayed 2-3ms and retransmitted, and when the delayed packets finally arrived they were duplicates.

We actually got four copies of each packet, for the most part. For example, here's a data segment:

22:26:19.413470  In 42:01:0a:8a:00:01 ethertype IPv4 (0x0800), length 40900: 74.125.195.136.https > 10.138.0.12.60790: Flags [.], seq 35299270:35340102, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320825], length 40832 
22:26:19.413483 Out 02:42:4b:3a:62:d8 ethertype IPv4 (0x0800), length 40900: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35299270:35340102, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320825], length 40832 
22:26:19.413487 Out 02:42:4b:3a:62:d8 ethertype IPv4 (0x0800), length 40900: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35299270:35340102, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320825], length 40832 
22:26:19.413490 02:42:4b:3a:62:d8 > 02:42:ac:11:00:02, ethertype IPv4 (0x0800), length 40898: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35299270:35340102, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320825], length 40832 

(74.x is dl.google.com, 10.138.x is the host IP, and 172.17.x is the container IP). So from the first to the second packet, we see the SNAT reverse translation occurring. The remaining packets are all copies of one another as they pass through docker0, veth, etc. In the below I've separated distinct packets with a newline, and labeled each.

Second, the kernel failed to do its SNAT thing a few times at the very end, eventually killing the connection:

A26
22:26:19.414064  In 42:01:0a:8a:00:01 ethertype IPv4 (0x0800), length 26820: 74.125.195.136.https > 10.138.0.12.60790: Flags [.], seq 35556934:35583686, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 26752
22:26:19.414073 Out 02:42:4b:3a:62:d8 ethertype IPv4 (0x0800), length 26820: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35556934:35583686, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 26752
22:26:19.414075 Out 02:42:4b:3a:62:d8 ethertype IPv4 (0x0800), length 26820: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35556934:35583686, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 26752
22:26:19.414076 02:42:4b:3a:62:d8 > 02:42:ac:11:00:02, ethertype IPv4 (0x0800), length 26818: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35556934:35583686, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 26752

Low-seq segment L18 (not forwarded)
22:26:19.414077  In 42:01:0a:8a:00:01 ethertype IPv4 (0x0800), length 2884: 74.125.195.136.https > 10.138.0.12.60790: Flags [.], seq 34355910:34358726, ack 789, win 265, options [nop,nop,TS val 2940395383 ecr 3057320821], length 2816

RST for that
22:26:19.414091 Out 42:01:0a:8a:00:0c ethertype IPv4 (0x0800), length 56: 10.138.0.12.60790 > 74.125.195.136.https: Flags [R], seq 821696165, win 0, length 0

A27
22:26:19.414157  In 42:01:0a:8a:00:01 ethertype IPv4 (0x0800), length 53572: 74.125.195.136.https > 10.138.0.12.60790: Flags [.], seq 35583686:35637190, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 53504
22:26:19.414174 Out 02:42:4b:3a:62:d8 ethertype IPv4 (0x0800), length 53572: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35583686:35637190, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 53504
22:26:19.414177 Out 02:42:4b:3a:62:d8 ethertype IPv4 (0x0800), length 53572: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35583686:35637190, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 53504
22:26:19.414180 02:42:4b:3a:62:d8 > 02:42:ac:11:00:02, ethertype IPv4 (0x0800), length 53570: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35583686:35637190, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 53504

Low-seq segment L19 (not forwarded)
22:26:19.414182  In 42:01:0a:8a:00:01 ethertype IPv4 (0x0800), length 1476: 74.125.195.136.https > 10.138.0.12.60790: Flags [.], seq 34358726:34360134, ack 789, win 265, options [nop,nop,TS val 2940395383 ecr 3057320821], length 1408

RST for that
22:26:19.414186 Out 42:01:0a:8a:00:0c ethertype IPv4 (0x0800), length 56: 10.138.0.12.60790 > 74.125.195.136.https: Flags [R], seq 821696165, win 0, length 0

A28
22:26:19.414188  In 42:01:0a:8a:00:01 ethertype IPv4 (0x0800), length 5700: 74.125.195.136.https > 10.138.0.12.60790: Flags [.], seq 35637190:35642822, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 5632
22:26:19.414192 Out 02:42:4b:3a:62:d8 ethertype IPv4 (0x0800), length 5700: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35637190:35642822, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 5632
22:26:19.414193 Out 02:42:4b:3a:62:d8 ethertype IPv4 (0x0800), length 5700: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35637190:35642822, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 5632
22:26:19.414194 02:42:4b:3a:62:d8 > 02:42:ac:11:00:02, ethertype IPv4 (0x0800), length 5698: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35637190:35642822, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 5632

Low-seq segment L20 (not forwarded)
22:26:19.414194  In 42:01:0a:8a:00:01 ethertype IPv4 (0x0800), length 1476: 74.125.195.136.https > 10.138.0.12.60790: Flags [.], seq 34360134:34361542, ack 789, win 265, options [nop,nop,TS val 2940395383 ecr 3057320821], length 1408

RST for that
22:26:19.414198 Out 42:01:0a:8a:00:0c ethertype IPv4 (0x0800), length 56: 10.138.0.12.60790 > 74.125.195.136.https: Flags [R], seq 821696165, win 0, length 0

A29
22:26:19.414248  In 42:01:0a:8a:00:01 ethertype IPv4 (0x0800), length 21188: 74.125.195.136.https > 10.138.0.12.60790: Flags [.], seq 35642822:35663942, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 21120
22:26:19.414256 Out 02:42:4b:3a:62:d8 ethertype IPv4 (0x0800), length 21188: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35642822:35663942, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 21120
22:26:19.414257 Out 02:42:4b:3a:62:d8 ethertype IPv4 (0x0800), length 21188: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35642822:35663942, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 21120
22:26:19.414258 02:42:4b:3a:62:d8 > 02:42:ac:11:00:02, ethertype IPv4 (0x0800), length 21186: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35642822:35663942, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 21120

A30 (only seen inside docker container)
22:26:19.414311 02:42:4b:3a:62:d8 > 02:42:ac:11:00:02, ethertype IPv4 (0x0800), length 56386: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35663942:35720262, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 56320

A31 (only seen inside docker container)
22:26:19.414354 02:42:4b:3a:62:d8 > 02:42:ac:11:00:02, ethertype IPv4 (0x0800), length 8514: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35720262:35728710, ack 789, win 265, options [nop,nop,TS val 2940395388 ecr 3057320826], length 8448

A32 (only seen inside docker container)
22:26:19.414406 02:42:4b:3a:62:d8 > 02:42:ac:11:00:02, ethertype IPv4 (0x0800), length 2882: 74.125.195.136.https > 172.17.0.2.60790: Flags [.], seq 35728710:35731526, ack 789, win 265, options [nop,nop,TS val 2940395389 ecr 3057320826], length 2816

Low-seq segment L21 (not forwarded)
22:26:19.414565  In 42:01:0a:8a:00:01 ethertype IPv4 (0x0800), length 5700: 74.125.195.136.https > 10.138.0.12.60790: Flags [.], seq 34361542:34367174, ack 789, win 265, options [nop,nop,TS val 2940395383 ecr 3057320821], length 5632

RST for that
22:26:19.414572 Out 42:01:0a:8a:00:0c ethertype IPv4 (0x0800), length 56: 10.138.0.12.60790 > 74.125.195.136.https: Flags [R], seq 821696165, win 0, length 0

Low-seq segment L22 (not forwarded)                                                                                                                                                                                                                                                                                            
22:26:19.414742  In 42:01:0a:8a:00:01 ethertype IPv4 (0x0800), length 1476: 74.125.195.136.https > 10.138.0.12.60790: Flags [.], seq 34367174:34368582, ack 789, win 265, options [nop,nop,TS val 2940395383 ecr 3057320821], length 1408

RST for that
22:26:19.414746 Out 42:01:0a:8a:00:0c ethertype IPv4 (0x0800), length 56: 10.138.0.12.60790 > 74.125.195.136.https: Flags [R], seq 821696165, win 0, length 0

ACK for A32
22:26:19.416605 02:42:ac:11:00:02 > 02:42:4b:3a:62:d8, ethertype IPv4 (0x0800), length 66: 172.17.0.2.60790 > 74.125.195.136.https: Flags [.], ack 35731526, win 1408, options [nop,nop,TS val 3057320829 ecr 2940395388], length 0
22:26:19.416608  In 02:42:ac:11:00:02 ethertype IPv4 (0x0800), length 68: 172.17.0.2.60790 > 74.125.195.136.https: Flags [.], ack 35731526, win 1408, options [nop,nop,TS val 3057320829 ecr 2940395388], length 0
22:26:19.416608   P 02:42:ac:11:00:02 ethertype IPv4 (0x0800), length 68: 172.17.0.2.60790 > 74.125.195.136.https: Flags [.], ack 35731526, win 1408, options [nop,nop,TS val 3057320829 ecr 2940395388], length 0
22:26:19.416626 Out 42:01:0a:8a:00:0c ethertype IPv4 (0x0800), length 68: 10.138.0.12.60790 > 74.125.195.136.https: Flags [.], ack 35731526, win 1408, options [nop,nop,TS val 3057320829 ecr 2940395388], length 0

RST replying to ACK
22:26:19.416715  In 42:01:0a:8a:00:01 ethertype IPv4 (0x0800), length 56: 74.125.195.136.https > 10.138.0.12.60790: Flags [R], seq 3900322453, win 0, length 0
22:26:19.416732 Out 02:42:4b:3a:62:d8 ethertype IPv4 (0x0800), length 56: 74.125.195.136.https > 172.17.0.2.60790: Flags [R], seq 3900322453, win 0, length 0
22:26:19.416734 Out 02:42:4b:3a:62:d8 ethertype IPv4 (0x0800), length 56: 74.125.195.136.https > 172.17.0.2.60790: Flags [R], seq 3900322453, win 0, length 0
22:26:19.416735 02:42:4b:3a:62:d8 > 02:42:ac:11:00:02, ethertype IPv4 (0x0800), length 54: 74.125.195.136.https > 172.17.0.2.60790: Flags [R], seq 3900322453, win 0, length 0

Here the Axx packets have "normal" data segments and the Lxx packets have data segments with low sequence numbers -- Basically Axx is the high line of dots in the graph, and Lxx is the low line. Up until now, the kernel has been happily forwarding both kinds of packets to the container, so L17 was forwarded just fine. A26 is forwarded just fine. But L18 is not, and instead the kernel replies to dl.google.com with an RST. This repeats through L22, with A27-A29 being forwarded just fine in the interim. Finally, there's an ack for A32 sent to dl.google.com, and dl.google.com sensibly replies with an RST since we've already sent it a bunch of RSTs. This final RST from dl.google.com is what causes the ECONNRESET that we see in the socket API.

Third, a few packets arrived in the docker container without being captured by the tcpdump outside the container. Just before the RST was received, A30-A32 are only seen in the internal dump. So clearly something weird happened that caused pcap to miss those packets.

Long story short, there's a network hiccup that should be well within TCP's ability to recover, but the kernel doesn't handle it well and eventually gets confused and shuts things down.

@Hexcles
Copy link
Member

Hexcles commented Apr 24, 2020

So what's our takeaway here?

IIUC, this sounds somewhat similar to moby/libnetwork#1090 where suprious retransmits out of the TCP window cannot be NAT'ed into the container and the host considers them invalid hence sending RST.

Shall we try one of the workarounds on the host?

@imbstack
Copy link

Hey @Hexcles,

Yeah, I think it is time to just try one of the workarounds you mentioned. I'm in the process of baking an image with the offloading settings you mentioned disabled now. If it helps in my testing, you can try using it asap.

p.s. for my own learning for next time, how did you figure to point the blame at TCP segmentation offloading? Is there a troubleshooting doc you found it in?

@imbstack
Copy link

Ah, now that I read the bug you linked to tonight I see what they suggest to do there. I'll bake an image with that and send it your way tonight or tomorrow!

@Hexcles
Copy link
Member

Hexcles commented Apr 24, 2020

@imbstack oh I'm sorry that wasn't clear. I was referring to the workarounds in moby/libnetwork#1090 , which seems to directly tackle the the symptom that we are seeing -- either accepting the spurious retransmits into the Docker container (hopefully it will handle them better) or dropping them completely on the host to stop the host from sending RSTs.

@imbstack
Copy link

Tried this out on projects/taskcluster-imaging/global/images/docker-worker-gcp-community-googlecompute-2020-04-24t05-52-00z and it appears to work. I'll update this bug after we land the patch and have an official image with the fix (should happen tomorrow first thing) but if you want to try that image out, it should be functionally the same to the one we release.

Thanks again for all of the help!

@imbstack
Copy link

Ok, looks like this will be pushed out Monday morning instead. I'll update here when that happens.

@imbstack
Copy link

Images with this fix are deployed now. My testing of them so far in our workers seems to indicate the workaround fixes things. Please let me know if this either breaks something else or doesn't fix your initial issue!

@foolip
Copy link
Member

foolip commented Apr 30, 2020

Should the new images have led to a fix without any extra work on our part?

https://community-tc.services.mozilla.com/tasks/QpzzvdbIQlSWamTGOyX4kg is a recent failure due to network issues, 10 hours ago.

@Hexcles
Copy link
Member

Hexcles commented Apr 30, 2020

It should. That error came from a different place with a different exception (Connection broken) but it's possible that it was also a connection reset.

@imbstack
Copy link

Is the connection broken issue happening as frequently as the reset issues from before? Also did those errors seem to reduce in frequency this week? We don't have any empirical data to show any success other than our direct testing before release.

@Hexcles
Copy link
Member

Hexcles commented Apr 30, 2020 via email

@stephenmcgruer
Copy link
Contributor Author

Tentatively switching to roadmap since we seem to be in a better place re connection resets.

@stephenmcgruer
Copy link
Contributor Author

@Hexcles can you summarize the outcome of the internal investigations (if appropriate), and then we can close this out?

@Hexcles
Copy link
Member

Hexcles commented Jul 7, 2020

Unfortunately, the internal investigation has somewhat stalled.

The current consensus is that something is wrong with Docker's iptables configurations w.r.t. NAT into the containers. The "workaround" we applied should have been the default configuration. This workaround is being applied on a ad-hoc basis in various places.

@Hexcles
Copy link
Member

Hexcles commented Jul 7, 2020

Since there is nothing actionable on our side, I'm closing this issue.

@Hexcles Hexcles closed this as completed Jul 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants