Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel CI: Android Testing is failing on Windows with Bazel@HEAD #6847

Closed
meteorcloudy opened this issue Dec 5, 2018 · 22 comments
Closed

Bazel CI: Android Testing is failing on Windows with Bazel@HEAD #6847

meteorcloudy opened this issue Dec 5, 2018 · 22 comments
Assignees
Labels
breakage P1 I'll work on this now. (Assignee required) type: bug

Comments

@meteorcloudy
Copy link
Member

meteorcloudy commented Dec 5, 2018

https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/649#8425be19-6d12-4ec4-934b-228d76be474c

ERROR: D:/b/ny24af4c/external/bazel_tools/src/main/native/windows/BUILD:42:1: Couldn't build file external/bazel_tools/src/main/native/windows/_objs/windows_jni.dll/file-jni.obj: undeclared inclusion(s) in rule '@bazel_tools//src/main/native/windows:windows_jni.dll':
--
  | this rule is missing dependency declarations for the following files included by 'external/bazel_tools/src/main/native/windows/file-jni.cc':
  | 'D:/b/hh7flgnl/execroot/__main__/bazel-out/host/genfiles/external/bazel_tools/src/main/native/jni_md.h'

Culprit Finder says d235a06 is the culprt

@laszlocsomor

@meteorcloudy meteorcloudy added type: bug P1 I'll work on this now. (Assignee required) breakage labels Dec 5, 2018
@meteorcloudy
Copy link
Member Author

FYI @buchgr

@meteorcloudy meteorcloudy added the area-Windows Windows-specific issues and feature requests label Dec 5, 2018
@laszlocsomor
Copy link
Contributor

@meteorcloudy , how do i find the commit that Bazel was built at? I cannot repro this locally with Bazel built at f2b3bec.

Also, the output root on CI is under C:/windows/system32/config/systemprofile -- that doesn't look good.

@laszlocsomor
Copy link
Contributor

Never mind, found it in the "Build Bazel (Windows)" task on the same page!

@meteorcloudy
Copy link
Member Author

Great, we explicitly set the output user root to D:/b, so it shouldn't be C:/windows/system32/config/systemprofile
https://github.com/bazelbuild/continuous-integration/blob/master/buildkite/bazelci.py#L856

@laszlocsomor
Copy link
Contributor

Oh, you're right! I was looking at the "Bazel Info" part of the log. That does not use the --output_user_root flag.

@meteorcloudy
Copy link
Member Author

I see, but it's better we use --output_user_root flag also for bazel info. I'll send a change.

@meteorcloudy
Copy link
Member Author

I tried to rerun the job in culprit finder
https://buildkite.com/bazel/culprit-finder/builds/41#b24d1754-a989-47f4-86e9-0345e771560a
This time it says everything is passing..

Looks like it's a flaky error, d235a06 is certainly not the culprit, sorry for the noise here.

@laszlocsomor
Copy link
Contributor

No worries! Thanks for looking again!

@laszlocsomor
Copy link
Contributor

Are P1 and Team-Windows adequate?

@meteorcloudy meteorcloudy removed the area-Windows Windows-specific issues and feature requests label Dec 5, 2018
@meteorcloudy
Copy link
Member Author

Still need to look into what caused it, it can be stably reproduced in downstream
https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/649#1064573d-b8f3-4b7a-be4c-cf6467da4312

@laszlocsomor
Copy link
Contributor

As a side-note, it would help repro-ing if some step (maybe "Bazel Info") printed the client environment. Is that possible?

@meteorcloudy
Copy link
Member Author

Yes, that's doable!

@meteorcloudy
Copy link
Member Author

meteorcloudy commented Dec 7, 2018

@laszlocsomor
Copy link
Contributor

Let me take a look!

@laszlocsomor laszlocsomor self-assigned this Dec 7, 2018
laszlocsomor added a commit to laszlocsomor/bazel that referenced this issue Dec 7, 2018
Add the --[no]experimental_header_validation_debug
flag.

This flag tells Bazel to print extra debugging
information when a C++ compilation action fails
header inclusion validation.

We will enable this flag on BuildKite in hopes of
catching the culprit of bazelbuild#6847.

After we find the culprit, this commit should be
reverted and the
--experimental_header_validation_debug flag from
BuildKite's configuration should be removed.

See bazelbuild#6847
@laszlocsomor
Copy link
Contributor

I noticed something in https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/679#71068d7e-b147-46c6-a7ea-dcf6ddb97425 that I didn't before:

ERROR: D:/b/fkulpsov/external/bazel_tools/src/main/native/windows/BUILD:42:1: Couldn't build file external/bazel_tools/src/main/native/windows/_objs/windows_jni.dll/file-jni.obj: undeclared inclusion(s) in rule '@bazel_tools//src/main/native/windows:windows_jni.dll':
--
  | this rule is missing dependency declarations for the following files included by 'external/bazel_tools/src/main/native/windows/file-jni.cc':
  | 'D:/b/hh7flgnl/execroot/__main__/bazel-out/host/genfiles/external/bazel_tools/src/main/native/jni_md.h'

The error comes from D:/b/fkulpsov/... but it complains about an inclusion from D:/b/hh7flgnl/.... I have no idea what hh7flgnl is or where it comes from -- the output base (according to the "Bazel Info" step) is D:/b/fkulpsov.

@laszlocsomor
Copy link
Contributor

laszlocsomor commented Dec 10, 2018

Hmmm, a wild guess: could this be a poisoned cache entry?

Header validation on Windows works by asking the compiler to write the list of included files (via the /showIncludes flag), then validating the list against the allowed list of files. The VC++ compiler however outputs absolute paths. and Bazel strips known prefixes from these paths (e.g. the execution root) to get relative paths.

I haven't yet figured out how, but maybe the compiler's output from an earlier build (with a different output root) was somehow written to file in the output cache, and incorrectly retrieved for header validation?

A cache issue would also explain why this issue is flaky, as pointed out by @meteorcloudy here: #6847 (comment)

@laszlocsomor
Copy link
Contributor

laszlocsomor commented Dec 10, 2018

More evidence for the cache poisoning:

In the previous build of "Android Testing" on Windows (678), the build failed with a similar error (https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/678#515c01ce-c6a1-4f9e-989f-ec4aef899b24):

ERROR: D:/b/ncxufbua/external/bazel_tools/src/main/native/windows/BUILD:42:1: Couldn't build file external/bazel_tools/src/main/native/windows/_objs/windows_jni.dll/processes-jni.obj: undeclared inclusion(s) in rule '@bazel_tools//src/main/native/windows:windows_jni.dll':
--
  | this rule is missing dependency declarations for the following files included by 'external/bazel_tools/src/main/native/windows/processes-jni.cc':
  | 'D:/b/hh7flgnl/execroot/__main__/bazel-out/host/genfiles/external/bazel_tools/src/main/native/jni_md.h'

output base was D:/b/ncxufbua, and the missing inclusion was again from D:/b/hh7flgnl.

In the previous build before that (677) (https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/677#9a4f231f-c38a-4b6a-a691-38be5ec7768b), "Android Testing" on Windows built successfully. Guess what the output base was? It was D:/b/hh7flgnl.

My theory is that build 677 inserted a cache entry that builds 678 and 679 incorrectly retrieved, and failed header validation because they tried stripping the wrong execroot prefix.

@laszlocsomor
Copy link
Contributor

@lberki told me it's quite possible that Bazel stores the action's stdout in the remote cache, because that's how Bazel can replay the action's output if it was cached.

The latest "Android Testing" build as of now is 689 (https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/689#56cc7a37-3397-4d8a-b06b-ea7bb02980fe) and it's failing just like the others, only with different paths:

ERROR: D:/b/wanmbwio/external/bazel_tools/src/main/native/windows/BUILD:42:1: Couldn't build file external/bazel_tools/src/main/native/windows/_objs/windows_jni.dll/processes-jni.obj: undeclared inclusion(s) in rule '@bazel_tools//src/main/native/windows:windows_jni.dll':
--
  | this rule is missing dependency declarations for the following files included by 'external/bazel_tools/src/main/native/windows/processes-jni.cc':
  | 'D:/b/tggxeo2z/execroot/__main__/bazel-out/host/genfiles/external/bazel_tools/src/main/native/jni_md.h'

This can be traced back to build 682 (https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/682#98aa10ac-cbc6-43de-9aea-dd9cda8ce0a4) that also failed (which I find strange), because how did it insert a cache entry then? Or maybe the output root D:/b/tggxeo2z was reused in this build and was already used in an earlier successful build that inserted this cache entry?

ERROR: D:/b/tggxeo2z/external/bazel_tools/src/main/native/windows/BUILD:42:1: Couldn't build file external/bazel_tools/src/main/native/windows/_objs/windows_jni.dll/jni-util.obj: undeclared inclusion(s) in rule '@bazel_tools//src/main/native/windows:windows_jni.dll':
--
  | this rule is missing dependency declarations for the following files included by 'external/bazel_tools/src/main/native/windows/jni-util.cc':
  | 'D:/b/ny24af4c/execroot/__main__/bazel-out/host/genfiles/external/bazel_tools/src/main/native/jni_md.h'

Build 681 also used D:/b/tggxeo2z output root, and failed with inclusion ostensibly from D:/b/ny24af4c:

ERROR: D:/b/tggxeo2z/external/bazel_tools/src/main/native/windows/BUILD:42:1: Couldn't build file external/bazel_tools/src/main/native/windows/_objs/windows_jni.dll/jni-util.obj: undeclared inclusion(s) in rule '@bazel_tools//src/main/native/windows:windows_jni.dll':
--
  | this rule is missing dependency declarations for the following files included by 'external/bazel_tools/src/main/native/windows/jni-util.cc':
  | 'D:/b/ny24af4c/execroot/__main__/bazel-out/host/genfiles/external/bazel_tools/src/main/native/jni_md.h'

Build 680 (https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/680#905b6889-1c55-43a4-af02-22d19f9b4e33) used the D:/b/ny24af4c output root and failed with undeclared inclusion from D:/b/hh7flgnl that we already know from builds 678 and 679.

@laszlocsomor
Copy link
Contributor

laszlocsomor commented Dec 10, 2018

@buchgr , @meteorcloudy :
Do you know:

  • whether my theory is plausible, i.e. whether Bazel really stores the compilation action's outErr in the remote cache?
  • how we could postprocess the action's outErr to remove the execroot prefix, so the cached file will only have relative paths?
  • if it's possible to purge bad entries from the cache, or purge the whole cache that CI uses?

@meteorcloudy
Copy link
Member Author

meteorcloudy commented Dec 13, 2018

@laszlocsomor Thanks for looking into this! Your conclusion is very plausible, I suspect it's caused by cbacbb9.

@ulfjack confirmed we do cache the action's outErr in remote cache. But we haven't understand why it's not happening before.

@meteorcloudy
Copy link
Member Author

I believe I have understand the underlying problem. cbacbb9 actually revealed a header checking bug when using remote execution.
Before cbacbb9, we filter the original output during the actual execution of the compiling action, which means the store action output is already filtered. In later builds, when we reply the output from the stored file due to remote cache hit, the output is already filtered, thus we capture no header file dependencies anymore.

After cbacbb9, we store the output to file first, so no matter remote cache is hit or not, we always parse the original output. Because the output could contain absolute path (execroot) from a different machine, we got this problem.

I'm working on a fix to remote the execroot prefilx from include file paths in the output. We'll see if that works.

@meteorcloudy
Copy link
Member Author

After some investigation, I think filtering the original output before storing it to file is not possible without applying a filter to the FileOutErr class, but that's incompatible with some SpawnRunner implementations. That's why @ulfjack authorized cbacbb9. So I had to go with a suboptimal solution, please see #6931 .

luca-digrazia pushed a commit to luca-digrazia/DatasetCommitsDiffSearch that referenced this issue Sep 4, 2022
    Fixes bazelbuild/bazel#6847

    This change is for making including scanning work on Windows for builds with remote caching or remote execution enabled.

    After this change, the ShowIncludesFilter will look for the first `execroot\<workspace_name>` in the output header file paths, then it considers `C:\...\execroot\<workspace_name>` as the execroot path. Because execroot path could be different if remote cache is hit, we ignore it and only add the relative path as dependencies.

    I'm quite unwilling to make this change, because parsing `execroot\\<workspace_name>` for execroot is not guaranteed to work always. But considering the only case this could go wrong is when people use an output base that already contains `execroot\\<workspace_name>`, which I think should never happen.

    Closes #6931.

    Change-Id: Ife2cb91c75f1b5b297851400e672db2b35ff09e0
    PiperOrigin-RevId: 225553627
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breakage P1 I'll work on this now. (Assignee required) type: bug
Projects
None yet
Development

No branches or pull requests

2 participants