Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

race condition when building glibc stubs #13160

Closed
motiejus opened this issue Oct 14, 2022 · 6 comments · Fixed by #13342
Closed

race condition when building glibc stubs #13160

motiejus opened this issue Oct 14, 2022 · 6 comments · Fixed by #13342
Labels
bug Observed behavior contradicts documented or intended behavior frontend Tokenization, parsing, AstGen, Sema, and Liveness. os-linux
Milestone

Comments

@motiejus
Copy link
Contributor

Summary: multiple zig invocations can stumble upon each other when they are creating glibc stubs.

Steps to reproduce

main.c

int main() {}

Command: run zig c++ many times in parallel:

until ! sh -c 'rm -fr ~/.cache/zig /tmp/main-*; for i in `seq 100`; do echo "zig c++ -target x86_64-linux-gnu.2.28 main.c -o /tmp/main-$i |& cat"; done | xargs -P100 -I{} bash -eo pipefail -c "{}"'; do date; done

And get an error message similar to this (more examples of error messages later in the issue):

/home/user/.cache/zig/o/f4abebc517abb01aedecbd5f414572c4/m.s:2740:22: error: expected a '@' in the name
.symver fabsf32, fabs

It takes me between 5 and 60 minutes to reproduce on a 96-core machine. It took a few hours on an 8-core laptop.

If we trim everything after that last fabs (i.e. where the error message says), the file is 65536 bytes long, suggesting an unfinished write by another zig.

Cursory look at glibc.zig shows no collaboration between multiple zig invocations that want to build the same stubs. That's perhaps the issue, but I didn't (yet) have a chance to dig into this deeper.

Other messages that I saw when running this test. I ran the command overnight (without the `-eo pipefail`, so it would keep going after errors). Here are a few selected ones:
ld.lld: error: /home/user/.cache/zig/o/c341a8cf346804bb91c0fc09dd1b917e/all.map:1: unexpected EOF
ld.lld: error: /home/user/.cache/zig/o/c341a8cf346804bb91c0fc09dd1b917e/all.map:1: unexpected EOF
warning(compilation): failed to write cache manifest: Unexpected
<many more warnings exactly like above>
error(compilation): /home/user/.cache/zig/o/c341a8cf346804bb91c0fc09dd1b917e/pthread.s:1:1: unable to build C object: FileNotFound
error: unable to build glibc shared objects: BuildingLibCObjectFailed
warning(compilation): failed to write cache manifest: Unexpected
<many more warnings exactly like above>
zig: error: clang integrated assembler command failed with exit code 135 (use -v to see invocation)
clang version 14.0.6 (git@github.com:ziglang/zig-bootstrap.git 73fba22053d7668fa2448a13b32c772f95c6e1c6)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/user
zig: note: diagnostic msg: Error generating preprocessed source(s) - no preprocessable inputs.
zig: error: clang integrated assembler command failed with exit code 135 (use -v to see invocation)
clang version 14.0.6 (git@github.com:ziglang/zig-bootstrap.git 73fba22053d7668fa2448a13b32c772f95c6e1c6)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/user
zig: note: diagnostic msg: Error generating preprocessed source(s) - no preprocessable inputs.
ld.lld: error: /home/user/.cache/zig/o/c341a8cf346804bb91c0fc09dd1b917e/all.map:1: unexpected EOF
ld.lld: error: /home/user/.cache/zig/o/c341a8cf346804bb91c0fc09dd1b917e/all.map:1: unexpected EOF
warning(compilation): failed to write cache manifest: Unexpected
zig: error: clang integrated assembler command failed with exit code 135 (use -v to see invocation)
clang version 14.0.6 (git@github.com:ziglang/zig-bootstrap.git 73fba22053d7668fa2448a13b32c772f95c6e1c6)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/user
zig: note: diagnostic msg: Error generating preprocessed source(s) - no preprocessable inputs.
warning(compilation): failed to write cache manifest: Unexpected
ld.lld: error: /home/user/.cache/zig/o/c341a8cf346804bb91c0fc09dd1b917e/all.map:1: unexpected EOF
ld.lld: error: /home/user/.cache/zig/o/c341a8cf346804bb91c0fc09dd1b917e/all.map:1: unexpected EOF
warning(compilation): failed to write cache manifest: Unexpected
<many more warnings exactly like above>
zig: error: clang integrated assembler command failed with exit code 135 (use -v to see invocation)
clang version 14.0.6 (git@github.com:ziglang/zig-bootstrap.git 73fba22053d7668fa2448a13b32c772f95c6e1c6)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/user
zig: note: diagnostic msg: Error generating preprocessed source(s) - no preprocessable inputs.
warning(compilation): failed to write cache manifest: Unexpected
warning(compilation): failed to write cache manifest: Unexpected
error(compilation): /home/user/.cache/zig/o/c341a8cf346804bb91c0fc09dd1b917e/c.s:1:1: unable to build C object: FileNotFound
error: unable to build glibc shared objects: BuildingLibCObjectFailed
warning(compilation): failed to write cache manifest: Unexpected
error(compilation): /home/user/.cache/zig/o/c341a8cf346804bb91c0fc09dd1b917e/pthread.s:1:1: unable to build C object: FileNotFound
error: unable to build glibc shared objects: BuildingLibCObjectFailed
warning(compilation): failed to write cache manifest: Unexpected
<many more warnings exactly like above>
@andrewrk
Copy link
Member

Potential cause: #9836

@andrewrk andrewrk added this to the 0.10.0 milestone Oct 14, 2022
@andrewrk andrewrk added the bug Observed behavior contradicts documented or intended behavior label Oct 14, 2022
@motiejus
Copy link
Contributor Author

Potential cause: #9836

Looking at src/glibc.zig, I think this is unrelated: to my understanding, both processes are not locking the output directory at all, thus the sequence of events is quite plausible as described above.

I am now testing this patch:

diff --git a/src/glibc.zig b/src/glibc.zig
index 3dd7565e9..44789afc3 100644
--- a/src/glibc.zig
+++ b/src/glibc.zig
@@ -687,6 +687,15 @@ pub fn buildSharedObjects(comp: *Compilation) !void {
         .handle = try comp.global_cache_directory.handle.makeOpenPath(o_sub_path, .{}),
         .path = try path.join(arena, &[_][]const u8{ comp.global_cache_directory.path.?, o_sub_path }),
     };
+
+    // Multiple zig invocations may try to write and compile the same stubs.
+    // When they do, they stumble upon each other. Lock the directory to avoid
+    // that.
+    //
+    // Note that we do not lose speed here; writing the files is the fast part,
+    // compiling the stubs is the slow part; and compiling is done in parallel.
+    try std.os.flock(o_directory.handle.fd, std.os.LOCK.EX);
+    defer std.os.flock(o_directory.handle.fd, std.os.LOCK.UN) catch unreachable;
     defer o_directory.handle.close();
 
     const ok_basename = "ok";

Am I missing something?

@andrewrk andrewrk added os-linux frontend Tokenization, parsing, AstGen, Sema, and Liveness. labels Oct 18, 2022
@andrewrk
Copy link
Member

Yes - at this point where you have inserted lines, if you look up a few lines there is this:

const hit = try cache.hit();

If it is a cache hit (hit == true) then the code holds a shared (read-only) lock. If it is a cache miss (hit == false) then the code holds an exclusive (read-write) lock. In the case of a cache miss, the logic must then populate the directory with build artifacts, and then indicate that this process is complete by calling writeManifest() which calls downgradeToSharedLock(). This allows the process to wait and not release the lock until it has finished using the artifacts it just finished creating, however still share the artifacts with other processes that are waiting for read-only locks. Unfortunately, as #9836 points out, this "downgrade from exclusive to shared" is not actually guaranteed to be atomic by POSIX. Maybe it is anyway though on every OS; I suspect that may be the case, but I haven't verified this hunch yet.

Before we go down that road, however, I noticed a more important problem: there is this "actual hit" logic:

zig/src/glibc.zig

Lines 683 to 699 in c66d3f6

// Even if we get a hit, it doesn't guarantee that we finished the job last time.
// We use the presence of an "ok" file to determine if it is a true hit.
var o_directory: Compilation.Directory = .{
.handle = try comp.global_cache_directory.handle.makeOpenPath(o_sub_path, .{}),
.path = try path.join(arena, &[_][]const u8{ comp.global_cache_directory.path.?, o_sub_path }),
};
defer o_directory.handle.close();
const ok_basename = "ok";
const actual_hit = if (hit) blk: {
o_directory.handle.access(ok_basename, .{}) catch |err| switch (err) {
error.FileNotFound => break :blk false,
else => |e| return e,
};
break :blk true;
} else false;

Problem is that if hit==false and actual_hit==true, a shared lock is held, not exclusive, meaning two processes could be writing over each other's work at the same time. I'm not sure what problem I thought this was solving however because I noticed another problem which is that it never calls writeManifest. It looks like I can delete that "actual_hit" logic in favor of properly calling writeManifest when all the glibc artifacts have been fully populated. I am investigating this now. If I create a pull request, would you be willing to try it out on your 96-core machine? I can provide a binary if that helps.

andrewrk added a commit that referenced this issue Oct 29, 2022
Before, the code for building glibc stubs used a special case of the
Cache API that did not add any file inputs, and did not use
writeManifest(). This is not really how the Cache API is designed to
work and it shows because there was a race condition.

This commit adds as an input file the abilists file that comes with
Zig's installation, which has the added benefit of making glibc stub
caching properly detect cache invalidation when the user decides to
overwrite their abilists file. This harmonizes with the rest of how Zig
works, which intentionally allows you to hack the installation files and
have it behave properly with the cache system.

Finally, because of having any file inputs, the normal API flow of the
Cache system can be used, eliminating the one place that used the Cache
API in a special way. In other words, it uses writeManifest() now and
properly obeys the cache hit/miss semantics.

closes #13160
andrewrk added a commit that referenced this issue Oct 29, 2022
Before, the code for building glibc stubs used a special case of the
Cache API that did not add any file inputs, and did not use
writeManifest(). This is not really how the Cache API is designed to
work and it shows because there was a race condition.

This commit adds as an input file the abilists file that comes with
Zig's installation, which has the added benefit of making glibc stub
caching properly detect cache invalidation when the user decides to
overwrite their abilists file. This harmonizes with the rest of how Zig
works, which intentionally allows you to hack the installation files and
have it behave properly with the cache system.

Finally, because of having any file inputs, the normal API flow of the
Cache system can be used, eliminating the one place that used the Cache
API in a special way. In other words, it uses writeManifest() now and
properly obeys the cache hit/miss semantics.

closes #13160
@andrewrk
Copy link
Member

Re-opening since @motiejus reported reproducing the problem in #13342 (comment)

@andrewrk andrewrk reopened this Oct 29, 2022
@andrewrk andrewrk modified the milestones: 0.10.0, 0.10.1 Oct 29, 2022
@motiejus
Copy link
Contributor Author

Re-opening since @motiejus reported reproducing the problem in #13342 (comment)

I can no longer reproduce it. I suspect I tested master after all. In that case, sorry about the noise.

I will leave this running for the rest of the weekend (next 24h or so) and then close the task if it does not pop up again.

@motiejus
Copy link
Contributor Author

It's been running two 96-core servers for >24 hours now without failure. Good job!

@andrewrk andrewrk modified the milestones: 0.10.1, 0.10.0 Oct 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Observed behavior contradicts documented or intended behavior frontend Tokenization, parsing, AstGen, Sema, and Liveness. os-linux
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants