Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve network outage in the CI lab #1800

Closed
fweikert opened this issue Nov 17, 2023 · 14 comments · Fixed by #1842
Closed

Resolve network outage in the CI lab #1800

fweikert opened this issue Nov 17, 2023 · 14 comments · Fixed by #1842
Assignees
Labels

Comments

@fweikert
Copy link
Member

Currently we've lost all of our Macs due to networking issues in the CI lab. Since it's late Friday in EU we're unlikely to see a resolution before Monday.

@fweikert fweikert added the P0 label Nov 17, 2023
@fweikert fweikert self-assigned this Nov 17, 2023
fweikert added a commit to fweikert/continuous-integration that referenced this issue Nov 17, 2023
1. Skip all MacOS builds
2. Show emergency banner

bazelbuild#1800
@fweikert fweikert added the bug label Nov 17, 2023
fweikert added a commit that referenced this issue Nov 17, 2023
1. Skip all MacOS builds
2. Show emergency banner

#1800
@fweikert
Copy link
Member Author

Update: Most Macs are slowly coming back online. Due to the long duration of the outage we've accumulated a significant backlog of jobs :(

fweikert added a commit to fweikert/continuous-integration that referenced this issue Nov 20, 2023
Due to the outage we went without MacOS coverage for three days, which means that there is a significant backlog.
This change enables MacOS jobs for high-priority jobs in order to help us clear the backlog. Hopefully we can enable MacOS for all jobs soon.

bazelbuild#1800
fweikert added a commit that referenced this issue Nov 20, 2023
Due to the outage we went without MacOS coverage for three days, which
means that there is a significant backlog. This change enables MacOS
jobs for high-priority jobs in order to help us clear the backlog.
Hopefully we can enable MacOS for all jobs soon.

#1800
@fweikert
Copy link
Member Author

Clarification: CI simply skips MacOS tasks for most of the pipelines, since we thought this would be less disruptive than failing. As a result, there's a chance that any change that is being merged now will cause MacOS breakages later. However, we hope that full MacOS testing will be possible at the end of the week - then we can find breakages via post-submit pipelines.

@meteorcloudy
Copy link
Member

/cc @keith @BalestraPatrick Sorry for not pinging you already. But this has already affected rules_python. Some commits have been merged without actually going through CI test.

image

@keith
Copy link
Member

keith commented Nov 28, 2023

ah yea. in our case instead of filtering out the jobs we'd probably prefer they fail, but that might be unique to us since we disproportionally care about apple support. we have a branch protection on the overall job but since that ended up being green we didn't notice

@meteorcloudy
Copy link
Member

Yeah, for now you'll have to assume you don't have CI coverage at all for apple rules. I'm hoping we can get at least postsubmit working today. Will update here.

@BalestraPatrick
Copy link
Member

@meteorcloudy @fweikert Any news regarding getting the macOS CI back to work? We'd like to merge a few PRs and cut new releases.

@meteorcloudy
Copy link
Member

Unfortunately, the issue is still ongoing... we are still working on fixing it

@keith
Copy link
Member

keith commented Dec 18, 2023

any news?

@fweikert
Copy link
Member Author

New network equipment was installed in the lab today. I hope that we have good news later this week.

@brandjon
Copy link
Member

I'm seeing a dns failure on mac on presubmit. Is that known / expected, or a separate issue?

$ git --git-dir /usr/local/var/bazelbuild/https---bazel-googlesource-com-bazel-git fetch origin master
fatal: unable to access 'https://bazel.googlesource.com/bazel.git/': Could not resolve host: bazel.googlesource.com

fweikert added a commit to fweikert/continuous-integration that referenced this issue Dec 20, 2023
The new network infrastructure has been installed, thus resolving the outage.

Progress towards bazelbuild#1800
@fweikert
Copy link
Member Author

@brandjon Looks like a transient error while Yun was fixing the network. Can you please retry?

fweikert added a commit that referenced this issue Dec 20, 2023
The new network infrastructure has been installed, thus resolving the
outage.

Fixes #1800
@brandjon
Copy link
Member

Yeah, different failure mode now, instead of all shards having dns trouble.

gRPC server failed to bind to IPv4 and IPv6 localhosts on port 0: [IPv4] Failed to bind to address /127.0.0.1:0
[IPv6] Failed to bind to address /[0:0:0:0:0:0:0:1]:0
com.google.devtools.build.lib.util.AbruptExitException: gRPC server failed to bind to IPv4 and IPv6 localhosts on port 0: [IPv4] Failed to bind to address /127.0.0.1:0
[IPv6] Failed to bind to address /[0:0:0:0:0:0:0:1]:0
	at com.google.devtools.build.lib.server.GrpcServerImpl.serve(GrpcServerImpl.java:438)
	at com.google.devtools.build.lib.runtime.BlazeRuntime.serverMain(BlazeRuntime.java:1068)
	at com.google.devtools.build.lib.runtime.BlazeRuntime.main(BlazeRuntime.java:771)
	at com.google.devtools.build.lib.bazel.Bazel.main(Bazel.java:95)
Caused by: java.io.IOException: Failed to bind to address /127.0.0.1:0
	at io.grpc.netty.NettyServer.start(NettyServer.java:328)
	at io.grpc.internal.ServerImpl.start(ServerImpl.java:183)
	at io.grpc.internal.ServerImpl.start(ServerImpl.java:92)
	at com.google.devtools.build.lib.server.GrpcServerImpl.serve(GrpcServerImpl.java:435)
	... 3 more
Caused by: java.net.SocketException: Operation not permitted

@Wyverald
Copy link
Member

I've found bk-imacpro-6 and bk-imacpro-4 especially prone to this specific failure. Maybe we should take them offline? As it is, them being online actually costs us more resources due to retries.

@fweikert
Copy link
Member Author

Technically bazelbuild/bazel@2c51a0c should have fixed this problem, but it's hard to see whether that change was in the tree used for presubmit. Post-submit looks fine (minus two unrelated failures)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants