Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libct/nsenter: fix extra runc re-exec on tmpfs #3342

Merged
merged 1 commit into from
Feb 16, 2022

Conversation

kolyshkin
Copy link
Contributor

After adding some debug info to cloned_binary.c I found out that
is_self_cloned() is not working right when runc binary is on tmpfs,
resulting in one extra re-exec of runc.

With some added debug:

$ mkdir bin
$ sudo mount -t tmpfs tmp bin
$ sudo cp runc bin
$ sudo ./bin/runc --debug exec xxx true
DEBU[0000] nsexec[763590]: => is_self_cloned
DEBU[0000] nsexec[763590]: got seals 1 (want 15)
DEBU[0000] nsexec[763590]: <= is_self_cloned, is_cloned = 0
DEBU[0000] nsexec[763590]: try_bindfd: 5
DEBU[0000] nsexec[763590]: re-exec itself...
DEBU[0000] nsexec[763590]: => is_self_cloned
DEBU[0000] nsexec[763590]: got seals 1 (want 15)
DEBU[0000] nsexec[763590]: <= is_self_cloned, is_cloned = 0
DEBU[0000] nsexec[763590]: try_bindfd: -1
DEBU[0000] nsexec[763590]: fallback to make_execfd: 5
DEBU[0000] nsexec[763590]: re-exec itself...
DEBU[0000] nsexec[763590]: => is_self_cloned
DEBU[0000] nsexec[763590]: got seals 15 (want 15)
DEBU[0000] nsexec[763590]: <= is_self_cloned, is_cloned = 1

From the above, it is seen that

  • is_self_cloned returns 0,
  • try_bindfd is called and succeeds,
  • runc re-execs itself,
  • the second call to is_self_cloned returns 0 again (because GET_SEALS returns 1),
  • runc falls back to make_execfd, and re-execs again,
  • finally, the third is_self_cloned returns 1.

I guess that the code relied on the following (quoting fcntl(2)):

Currently, file seals can be applied only to a file descriptor
returned by memfd_create(2) (if the MFD_ALLOW_SEALING was employed).
On other filesystems, all fcntl() operations that operate on seals
will return EINVAL.

It looks like in case of a file on tmpfs it returns 1 (F_SEAL_SEAL).

After the fix:

DEBU[0000] nsexec[768367]: => is_self_cloned
DEBU[0000] nsexec[768367]: got seals 1 (want 15)
DEBU[0000] nsexec[768367]: no CLONED_BINARY_ENV
DEBU[0000] nsexec[768367]: <= is_self_cloned, is_cloned = 0
DEBU[0000] nsexec[768367]: try_bindfd: 5
DEBU[0000] nsexec[768367]: re-exec itself...
DEBU[0000] nsexec[768367]: => is_self_cloned
DEBU[0000] nsexec[768367]: got seals 1 (want 15)
DEBU[0000] nsexec[768367]: fstatfs says ro = 1
DEBU[0000] nsexec[768367]: fstat says nlink = 1
DEBU[0000] nsexec[768367]: <= is_self_cloned, is_cloned = 1

@kolyshkin
Copy link
Contributor Author

The debug statements added are:

diff --git a/libcontainer/nsenter/cloned_binary.c b/libcontainer/nsenter/cloned_binary.c
index 2ed26fcd..e52eb1a1 100644
--- a/libcontainer/nsenter/cloned_binary.c
+++ b/libcontainer/nsenter/cloned_binary.c
@@ -143,6 +143,8 @@ static int is_self_cloned(void)
 	struct stat statbuf = { };
 	struct statfs fsbuf = { };
 
+	write_log(DEBUG, "=> is_self_cloned");
+
 	fd = open("/proc/self/exe", O_RDONLY | O_CLOEXEC);
 	if (fd < 0) {
 		write_log(WARNING, "no read access to runc binary file: %m");
@@ -156,6 +158,7 @@ static int is_self_cloned(void)
 	 * memfd to /usr/bin/runc to allow re-use).
 	 */
 	ret = fcntl(fd, F_GET_SEALS);
+	write_log(DEBUG, "got seals %d (want %d)", ret, RUNC_MEMFD_SEALS);
 	if (ret >= 0) {
 		is_cloned = (ret == RUNC_MEMFD_SEALS);
 		goto out;
@@ -168,6 +171,7 @@ static int is_self_cloned(void)
 	 */
 	if (!getenv(CLONED_BINARY_ENV)) {
 		is_cloned = false;
+		write_log(DEBUG, "no CLONED_BINARY_ENV");
 		goto out;
 	}
 
@@ -177,8 +181,10 @@ static int is_self_cloned(void)
 	 * at least be sure that it's read-only. In addition, to make sure that
 	 * it's *our* bind-mount we check CLONED_BINARY_ENV.
 	 */
-	if (fstatfs(fd, &fsbuf) >= 0)
+	if (fstatfs(fd, &fsbuf) >= 0) {
 		is_cloned |= (fsbuf.f_flags & MS_RDONLY);
+		write_log(DEBUG, "fstatfs says ro = %d", (fsbuf.f_flags & MS_RDONLY) == MS_RDONLY);
+       }
 
 	/*
 	 * Okay, we're a tmpfile -- or we're currently running on RHEL <=7.6
@@ -188,11 +194,14 @@ static int is_self_cloned(void)
 	 * cannot fake (because unlinking requires being able to resolve the
 	 * path that you want to unlink).
 	 */
-	if (fstat(fd, &statbuf) >= 0)
-		is_cloned |= (statbuf.st_nlink == 0);
+	if (fstat(fd, &statbuf) >= 0) {
+		is_cloned = (statbuf.st_nlink == 0);
+		write_log(DEBUG, "fstat says nlink = %lu", statbuf.st_nlink);
+	}
 
 out:
 	close(fd);
+	write_log(DEBUG, "<= is_self_cloned, is_cloned = %d", is_cloned);
 	return is_cloned;
 }
 
@@ -492,6 +501,7 @@ static int clone_binary(void)
 	 * by getting a handle for a read-only bind-mount of the execfd.
 	 */
 	execfd = try_bindfd();
+	write_log(DEBUG, "try_bindfd: %d", execfd);
 	if (execfd >= 0)
 		return execfd;
 
@@ -500,6 +510,7 @@ static int clone_binary(void)
 	 * can seal the contents.
 	 */
 	execfd = make_execfd(&fdtype);
+	write_log(DEBUG, "fallback to make_execfd: %d", execfd);
 	if (execfd < 0 || fdtype == EFD_NONE)
 		return -ENOTRECOVERABLE;
 
@@ -532,6 +543,7 @@ static int clone_binary(void)
 error_binfd:
 	close(binfd);
 error:
+	write_log(WARNING, "clone_binary failed (EIO)");
 	close(execfd);
 	return -EIO;
 }
@@ -559,6 +571,7 @@ int ensure_cloned_binary(void)
 	if (putenv(CLONED_BINARY_ENV "=1"))
 		goto error;
 
+	write_log(DEBUG, "re-exec itself...");
 	fexecve(execfd, argv, environ);
 error:
 	close(execfd);

(plus the ability to use write_log from cloned_binary).

@cyphar
Copy link
Member

cyphar commented Jan 20, 2022

F_SEAL_SEAL is an interesting thing to return. I guess it's because the memfd plumbing is linked to shmem/tmpfs?

Patch LGTM but we should probably send a patch for man-pages to mention this behaviour (and check if it was intentional).

cyphar
cyphar previously approved these changes Jan 20, 2022
Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kolyshkin
Copy link
Contributor Author

Patch LGTM but we should probably send a patch for man-pages to mention this behaviour (and check if it was intentional).

Yes, I was going to do that.

@kolyshkin
Copy link
Contributor Author

https://lore.kernel.org/linux-man/20220122005251.1441343-1-kolyshkin@gmail.com/

mrunalp
mrunalp previously approved these changes Jan 26, 2022
@kolyshkin
Copy link
Contributor Author

The upstream discussion concludes that this (returning F_SEAL_SEAL on tmpfs/hugetlbfs) is the intended behavior.

@kolyshkin kolyshkin dismissed stale reviews from mrunalp and cyphar via 52824b3 January 27, 2022 16:34
@kolyshkin kolyshkin requested review from mrunalp and cyphar January 27, 2022 16:35
After adding some debug info to cloned_binary.c I found out that
is_self_cloned() is not working right when runc binary is on tmpfs,
resulting in one extra re-exec of runc.

With some added debug:

	$ mkdir bin
	$ sudo mount -t tmpfs tmp bin
	$ sudo cp runc bin
	$ sudo ./bin/runc --debug exec xxx true
	DEBU[0000] nsexec[763590]: => is_self_cloned
	DEBU[0000] nsexec[763590]: got seals 1 (want 15)
	DEBU[0000] nsexec[763590]: <= is_self_cloned, is_cloned = 0
	DEBU[0000] nsexec[763590]: try_bindfd: 5
	DEBU[0000] nsexec[763590]: re-exec itself...
	DEBU[0000] nsexec[763590]: => is_self_cloned
	DEBU[0000] nsexec[763590]: got seals 1 (want 15)
	DEBU[0000] nsexec[763590]: <= is_self_cloned, is_cloned = 0
	DEBU[0000] nsexec[763590]: try_bindfd: -1
	DEBU[0000] nsexec[763590]: fallback to make_execfd: 5
	DEBU[0000] nsexec[763590]: re-exec itself...
	DEBU[0000] nsexec[763590]: => is_self_cloned
	DEBU[0000] nsexec[763590]: got seals 15 (want 15)
	DEBU[0000] nsexec[763590]: <= is_self_cloned, is_cloned = 1

From the above, it is seen that
 - `is_self_cloned` returns 0,
 - `try_bindfd` is called and succeeds,
 - runc re-execs itself,
 - the second call to `is_self_cloned` returns 0 again (because GET_SEALS returns 1),
 - runc falls back to `make_execfd`, and re-execs again,
 - finally, the third `is_self_cloned` returns 1.

I guess that the code relied on the following (quoting fcntl(2)):

> Currently, file seals can be applied only to a file descriptor
> returned by memfd_create(2) (if the MFD_ALLOW_SEALING was employed).
> On other filesystems, all fcntl() operations that operate on seals
> will return EINVAL.

It looks like in case of a file on tmpfs it returns 1 (F_SEAL_SEAL).

With the fix:

	DEBU[0000] nsexec[768367]: => is_self_cloned
	DEBU[0000] nsexec[768367]: got seals 1 (want 15)
	DEBU[0000] nsexec[768367]: no CLONED_BINARY_ENV
	DEBU[0000] nsexec[768367]: <= is_self_cloned, is_cloned = 0
	DEBU[0000] nsexec[768367]: try_bindfd: 5
	DEBU[0000] nsexec[768367]: re-exec itself...
	DEBU[0000] nsexec[768367]: => is_self_cloned
	DEBU[0000] nsexec[768367]: got seals 1 (want 15)
	DEBU[0000] nsexec[768367]: fstatfs says ro = 1
	DEBU[0000] nsexec[768367]: fstat says nlink = 1
	DEBU[0000] nsexec[768367]: <= is_self_cloned, is_cloned = 1

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin
Copy link
Contributor Author

@cyphar PTAL (I've simplified the patch a bit)

Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

@kolyshkin
Copy link
Contributor Author

@opencontainers/runc-maintainers PTAL

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, but I don't trust myself handling too much C code, so if there's risk involved, it'd be good to have another reviewer.

@dqminh
Copy link
Contributor

dqminh commented Feb 16, 2022

LGTM

@dqminh dqminh merged commit c648753 into opencontainers:main Feb 16, 2022
@cyphar cyphar mentioned this pull request Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants