lib/commonio: make lock failures more detailed #562

lucab · 2022-08-29T12:51:26Z

This tweaks the database locking logic so that failures in the
link-checking paths are more detailed.
The rationale for this is that I've experienced a non-deterministic
bug which seems to be coming from this logic, and I'd like to get
more details about the actual failing condition.

Ref: coreos/fedora-coreos-tracker#1250

This tweaks the database locking logic so that failures in the link-checking paths are more detailed. The rationale for this is that I've experienced a non-deterministic bug which seems to be coming from this logic, and I'd like to get more details about the actual failing condition.

lucab · 2022-09-02T14:03:47Z

lib/commonio.c

+			(void) fprintf (shadow_logfd,
+			                "%s: %s file stat error: %s\n",
+			                shadow_progname, file, strerror (errno));
+		}
 		return 0;
 	}

 	if (sb.st_nlink != 2) {


I haven't yet managed to reproduce the failure in a controlled environment, but I'm suspicious of this st_nlink check.
In particular, I think it is theoretically possible on a multi-processor system to race the inode update and observe a stale link-count of 1 here after the link() call succeeded.
I've seen other comments hinting at similar caching effects, e.g. https://github.com/miquels/liblockfile/blob/v1.17/lockfile.c#L319-L320.

What is this check actually meant to protect against? Would it be possible to safely drop it? Or maybe perform a fsync() on the file before this?

If two processes are calling commonio_lock() at the same time, the first will link %s to %s.$pid , causing %s to have link count2; then the second will link %s to %s.pid2, causing %s to have link count 3. (either $pid or $pid2 could end up finding the link count 3, or both).

I would be very surprised if the kernel's vs code was keeping the link count high, causing a race, but your results are interesting. One suspect would be overlay, if you're using that.

Hmm, but here the file-linking goes in the other direction, i.e. link(dbname.pid, dbname.lock) will create dbname.lock as a new name for the existing dbname.pid.
At that point a second call like link(dbname.pid2, dbname.lock) will directly fail with EEXIST if the .lock file already exists, without even reaching this point.

I don't think the link count here could ever reach anything greater than 2. On the other hand I think a (wrong/cached) value of 1 could be observed in some situations.
Yes it is happening in a bit of complex environment with bindmounts and overlays involved, but I suspect the real culprit is some kind of weakly-synchronized caching on SMP machines.

lucab · 2022-09-12T09:34:35Z

@hallyn while we aren't fully aligned on where to pinpoint the real issue and how to fix it, I think this logging-only PR should be safe to land without introducing further havoc and may help shedding some light on the low-level behavior at play. Could you maybe review/merge this?

hallyn · 2022-09-15T15:30:37Z

@hallyn while we aren't fully aligned on where to pinpoint the real issue and how to fix it, I think this logging-only PR should be safe to land without introducing further havoc and may help shedding some light on the low-level behavior at play. Could you maybe review/merge this?

Yes looks good, thanks.

lib/commonio.c

+		if (log) {
+			(void) fprintf (shadow_logfd,
+			                "%s: %s: lock file already used (nlink: %u)\n",
+			                shadow_progname, file, sb.st_nlink);


lucab commented Sep 2, 2022

View reviewed changes

hallyn merged commit 14e7caf into shadow-maint:master Sep 15, 2022

lucab deleted the ups/nlink-check-verbose-failure branch September 15, 2022 15:35

github-advanced-security bot found potential problems Sep 15, 2022

View reviewed changes

lucab mentioned this pull request Sep 15, 2022

tests/basic: useradd: lock /etc/group.lock already used by PID 4 coreos/fedora-coreos-tracker#1250

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lib/commonio: make lock failures more detailed #562

lib/commonio: make lock failures more detailed #562

lucab commented Aug 29, 2022

lucab Sep 2, 2022

hallyn Sep 3, 2022

lucab Sep 5, 2022 •

edited

Loading

lucab commented Sep 12, 2022

hallyn commented Sep 15, 2022

lib/commonio: make lock failures more detailed #562

lib/commonio: make lock failures more detailed #562

Conversation

lucab commented Aug 29, 2022

lucab Sep 2, 2022

Choose a reason for hiding this comment

hallyn Sep 3, 2022

Choose a reason for hiding this comment

lucab Sep 5, 2022 • edited Loading

Choose a reason for hiding this comment

lucab commented Sep 12, 2022

hallyn commented Sep 15, 2022

lucab Sep 5, 2022 •

edited

Loading