Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

azure: Investigate seemingly spurious failures on CI #61301

Closed
alexcrichton opened this issue May 29, 2019 · 21 comments
Closed

azure: Investigate seemingly spurious failures on CI #61301

alexcrichton opened this issue May 29, 2019 · 21 comments
Labels
T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue.

Comments

@alexcrichton alexcrichton added T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. azure-evaluation labels May 29, 2019
@pietroalbini
Copy link
Member

fatal: reference is not a tree: 3b109e51cc8d9a3a833cf7e28db7911a8a0aa503
##[warning]Git checkout failed on shallow repository, this might because of git fetch with depth '2' doesn't include the checkout commit '3b109e51cc8d9a3a833cf7e28db7911a8a0aa503'. Please reference documentation (http://go.microsoft.com/fwlink/?LinkId=829603)
##[error]Git checkout failed with exit code: 128

This doesn't seem to be a spurious failure: these jobs started after bors marked the PR as failed due to other failures on Travis/AppVeyor, so the auto branch was already force-pushed to the next PR. We shouldn't see these anymore once we start gating on Azure.

@pietroalbini
Copy link
Member

For the submodules weirdness I opened #61322 to investigate.

For the git checkout failures (logs) it seems the error is happening on the Azure side, and there is not much we can do on our end. Any idea on what could have caused it, or what steps we can take to investigate it @rylev @johnterickson?

Centril added a commit to Centril/rust that referenced this issue May 30, 2019
…e-cloning, r=alexcrichton

ci: display more debug information in the init_repo script

I'm *really* confused about the error message [while cloning submodules on Windows on Azure](https://dev.azure.com/rust-lang/e71b0ddf-dd27-435a-873c-e30f86eea377/_apis/build/builds/295/logs/506):

```
/usr/bin/tar: You must specify one of the '-Acdtrux', '--delete' or '--test-label' options
Try '/usr/bin/tar --help' or '/usr/bin/tar --usage' for more information.
```

It doesn't make sense for it to execute a command without any of those flags since they're clearly added:

https://github.com/rust-lang/rust/blob/81970852e172c04322cbf8ba23effabeb491c83c/src/ci/init_repo.sh#L45

So this adds `set -x` to the script to hopefully catch what command it's executing.

r? @alexcrichton
cc rust-lang#61301
Centril added a commit to Centril/rust that referenced this issue May 30, 2019
…e-cloning, r=alexcrichton

ci: display more debug information in the init_repo script

I'm *really* confused about the error message [while cloning submodules on Windows on Azure](https://dev.azure.com/rust-lang/e71b0ddf-dd27-435a-873c-e30f86eea377/_apis/build/builds/295/logs/506):

```
/usr/bin/tar: You must specify one of the '-Acdtrux', '--delete' or '--test-label' options
Try '/usr/bin/tar --help' or '/usr/bin/tar --usage' for more information.
```

It doesn't make sense for it to execute a command without any of those flags since they're clearly added:

https://github.com/rust-lang/rust/blob/81970852e172c04322cbf8ba23effabeb491c83c/src/ci/init_repo.sh#L45

So this adds `set -x` to the script to hopefully catch what command it's executing.

r? @alexcrichton
cc rust-lang#61301
@alexcrichton
Copy link
Member Author

@johnterickson
Copy link
Contributor

I see two trends - let me know if I'm missing another:

This looks like the auth to GitHub is not working and so git is falling back to a command prompt for creds. Hmm...

##[command]git -c http.extraheader="AUTHORIZATION: basic ***" fetch --tags --prune --progress --no-recurse-submodules --depth=2 origin
fatal: could not read Username for 'https://github.com': terminal prompts disabled
##[warning]Git fetch failed with exit code 128, back off 4.338 seconds before retry.

This sounds vaguely like a race we were seeing where we are getting a notification from GitHub before the commit has been replicated everywhere.

fatal: reference is not a tree: 3b109e51cc8d9a3a833cf7e28db7911a8a0aa503
##[warning]Git checkout failed on shallow repository, this might because of git fetch with depth '2' doesn't include the checkout commit '3b109e51cc8d9a3a833cf7e28db7911a8a0aa503'. Please reference documentation (http://go.microsoft.com/fwlink/?LinkId=829603)
##[error]Git checkout failed with exit code: 128

@pietroalbini
Copy link
Member

This sounds vaguely like a race we were seeing where we are getting a notification from GitHub before the commit has been replicated everywhere.

@johnterickson I'm pretty sure that's our fault and "expected behavior".

Every time a new build needs to start bors force-pushes the merge commit on the auto branch, thus deleting the previous build's merge commit. If the new build is started shortly after the previous one (for example if someone cancels the build on the bors side to let an higher priority one start) the checkout step of the previous one might have not started yet and it will try to get the old HEAD, causing that error.

I've also seen the error only on builds that ran for ~5 minutes before being killed by our cancelbot because a newer build started, confirming the hypothesis.

@pietroalbini
Copy link
Member

There is another spurious failure which is pretty bad: this build failed to "prepare" but was marked as successful on the GitHub Checks.

@johnterickson
Copy link
Contributor

@pietroalbini I raised that "succeeded but actually failed" build with the right team - definitely concerning and I appreciate you reporting it.

@alexcrichton
Copy link
Member Author

Another two that have come up:

+ /tmp/checktools.sh ../x.py /tmp/toolstates.json linux
fatal: ambiguous argument 'HEAD^': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'

@pietroalbini or @kennytm do y'all know what might be causing that?

  • All of a sudden our script to enable ipv6 in docker is failing. I don't know why this changed all of a sudden but it shouldn't be the hardest thing to work around (testing something on try now)

@kennytm
Copy link
Member

kennytm commented Jun 4, 2019

Maybe the HEAD commit did not have a parent commit? Is the commit 322afaf associated to any branch during git fetch --depth=2?

@mati865
Copy link
Contributor

mati865 commented Jun 4, 2019

For IPv6 error /etc/docker seems to not exist.

@alexcrichton
Copy link
Member Author

Ah ok @kennytm that's probably it. It sounds like a specific git history is expected and/or fetch depth, but I'm just pushing up raw commits which probably breaks the script's assumption. For the failing builds I'm just making manual commits and pushing them to try. Seems like an expected failure.

@johnterickson
Copy link
Contributor

For what it's worth, I had "fixed" this in my original branch in that it would swallow this error.

@pietroalbini
Copy link
Member

@ethomson said on Discord the checkout issues are now fixed 🎉

On a related note, GitHub has deployed some fixes that they think will mitigate your sporadic build failures. Could you please @ ping me if you see another beginning today?
Yes, GitHub thinks that they've resolved the issue, but if not, I think that we can add some logic on our side as well.

@ethomson
Copy link

Thanks for linking me to this issue, @pietroalbini. Looking at the topic, I'm looking at this section:

Submodule weirdness on Windows
https://dev.azure.com/rust-lang/rust/_build/results?buildId=295
https://dev.azure.com/rust-lang/rust/_build/results?buildId=287

Naive question: what is the weirdness that you're describing? It's not obvious from the logs, because I don't really know what I'm looking for.

@alexcrichton
Copy link
Member Author

Ah yeah sorry I should have been more descriptive there! On build 287 the x86_64-msvc-2 build fails the build due to an odd error message (presumably a submodule missing) and the logs for the "Check out submodules (windows)" step, while successful, are sort of funny. I'm not sure if this is really an Azure issue though, it may have been a case that we got a bad tarball from GitHub and swallowed the error by accident (this is what #61322 was hoping to help diagnose).

The 295 build is similar except for just a different windows builder (dist-x86_64-msvc)

I haven't seen these happen again myself, though, so it may have been just a transient issue. AFAIK most issues we've listed here have been addressed one way or another, so it may actually be time to close this!

@pietroalbini
Copy link
Member

By the way, since I couldn't reproduce that failure locally I also added some extra debug info to see what's actually happening there. Didn't get a chance to see the output when the spurious failure happens yet.

@pietroalbini
Copy link
Member

@ethomson
Copy link

@pietroalbini :( Thanks. Let me see where the fix for that is in the deployment queue and if we expect that your account has it ...

@ethomson
Copy link

@pietroalbini Update on 403's ("could not read Username") - that fix was still making its way through progressive deployment. We explicitly enabled it on your account. Please let me know if you see it again, as that means our fix wasn't actually a fix for the issue.

@steveklabnik
Copy link
Member

Triage: given that we're not longer using Azure pipelines, I imagine this issue can be closed. Anyone from @rust-lang/infra who can confirm?

@Mark-Simulacrum
Copy link
Member

Indeed! Closing.

We're still using macOS on Azure, but I don't think we've seen these failures there recently at least.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

8 participants