-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libct/nsenter: fix extra runc re-exec on tmpfs #3342
Conversation
fc3332d
to
77b72a4
Compare
The debug statements added are: diff --git a/libcontainer/nsenter/cloned_binary.c b/libcontainer/nsenter/cloned_binary.c
index 2ed26fcd..e52eb1a1 100644
--- a/libcontainer/nsenter/cloned_binary.c
+++ b/libcontainer/nsenter/cloned_binary.c
@@ -143,6 +143,8 @@ static int is_self_cloned(void)
struct stat statbuf = { };
struct statfs fsbuf = { };
+ write_log(DEBUG, "=> is_self_cloned");
+
fd = open("/proc/self/exe", O_RDONLY | O_CLOEXEC);
if (fd < 0) {
write_log(WARNING, "no read access to runc binary file: %m");
@@ -156,6 +158,7 @@ static int is_self_cloned(void)
* memfd to /usr/bin/runc to allow re-use).
*/
ret = fcntl(fd, F_GET_SEALS);
+ write_log(DEBUG, "got seals %d (want %d)", ret, RUNC_MEMFD_SEALS);
if (ret >= 0) {
is_cloned = (ret == RUNC_MEMFD_SEALS);
goto out;
@@ -168,6 +171,7 @@ static int is_self_cloned(void)
*/
if (!getenv(CLONED_BINARY_ENV)) {
is_cloned = false;
+ write_log(DEBUG, "no CLONED_BINARY_ENV");
goto out;
}
@@ -177,8 +181,10 @@ static int is_self_cloned(void)
* at least be sure that it's read-only. In addition, to make sure that
* it's *our* bind-mount we check CLONED_BINARY_ENV.
*/
- if (fstatfs(fd, &fsbuf) >= 0)
+ if (fstatfs(fd, &fsbuf) >= 0) {
is_cloned |= (fsbuf.f_flags & MS_RDONLY);
+ write_log(DEBUG, "fstatfs says ro = %d", (fsbuf.f_flags & MS_RDONLY) == MS_RDONLY);
+ }
/*
* Okay, we're a tmpfile -- or we're currently running on RHEL <=7.6
@@ -188,11 +194,14 @@ static int is_self_cloned(void)
* cannot fake (because unlinking requires being able to resolve the
* path that you want to unlink).
*/
- if (fstat(fd, &statbuf) >= 0)
- is_cloned |= (statbuf.st_nlink == 0);
+ if (fstat(fd, &statbuf) >= 0) {
+ is_cloned = (statbuf.st_nlink == 0);
+ write_log(DEBUG, "fstat says nlink = %lu", statbuf.st_nlink);
+ }
out:
close(fd);
+ write_log(DEBUG, "<= is_self_cloned, is_cloned = %d", is_cloned);
return is_cloned;
}
@@ -492,6 +501,7 @@ static int clone_binary(void)
* by getting a handle for a read-only bind-mount of the execfd.
*/
execfd = try_bindfd();
+ write_log(DEBUG, "try_bindfd: %d", execfd);
if (execfd >= 0)
return execfd;
@@ -500,6 +510,7 @@ static int clone_binary(void)
* can seal the contents.
*/
execfd = make_execfd(&fdtype);
+ write_log(DEBUG, "fallback to make_execfd: %d", execfd);
if (execfd < 0 || fdtype == EFD_NONE)
return -ENOTRECOVERABLE;
@@ -532,6 +543,7 @@ static int clone_binary(void)
error_binfd:
close(binfd);
error:
+ write_log(WARNING, "clone_binary failed (EIO)");
close(execfd);
return -EIO;
}
@@ -559,6 +571,7 @@ int ensure_cloned_binary(void)
if (putenv(CLONED_BINARY_ENV "=1"))
goto error;
+ write_log(DEBUG, "re-exec itself...");
fexecve(execfd, argv, environ);
error:
close(execfd); (plus the ability to use write_log from cloned_binary). |
F_SEAL_SEAL is an interesting thing to return. I guess it's because the memfd plumbing is linked to shmem/tmpfs? Patch LGTM but we should probably send a patch for man-pages to mention this behaviour (and check if it was intentional). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Yes, I was going to do that. |
The upstream discussion concludes that this (returning |
77b72a4
to
52824b3
Compare
After adding some debug info to cloned_binary.c I found out that is_self_cloned() is not working right when runc binary is on tmpfs, resulting in one extra re-exec of runc. With some added debug: $ mkdir bin $ sudo mount -t tmpfs tmp bin $ sudo cp runc bin $ sudo ./bin/runc --debug exec xxx true DEBU[0000] nsexec[763590]: => is_self_cloned DEBU[0000] nsexec[763590]: got seals 1 (want 15) DEBU[0000] nsexec[763590]: <= is_self_cloned, is_cloned = 0 DEBU[0000] nsexec[763590]: try_bindfd: 5 DEBU[0000] nsexec[763590]: re-exec itself... DEBU[0000] nsexec[763590]: => is_self_cloned DEBU[0000] nsexec[763590]: got seals 1 (want 15) DEBU[0000] nsexec[763590]: <= is_self_cloned, is_cloned = 0 DEBU[0000] nsexec[763590]: try_bindfd: -1 DEBU[0000] nsexec[763590]: fallback to make_execfd: 5 DEBU[0000] nsexec[763590]: re-exec itself... DEBU[0000] nsexec[763590]: => is_self_cloned DEBU[0000] nsexec[763590]: got seals 15 (want 15) DEBU[0000] nsexec[763590]: <= is_self_cloned, is_cloned = 1 From the above, it is seen that - `is_self_cloned` returns 0, - `try_bindfd` is called and succeeds, - runc re-execs itself, - the second call to `is_self_cloned` returns 0 again (because GET_SEALS returns 1), - runc falls back to `make_execfd`, and re-execs again, - finally, the third `is_self_cloned` returns 1. I guess that the code relied on the following (quoting fcntl(2)): > Currently, file seals can be applied only to a file descriptor > returned by memfd_create(2) (if the MFD_ALLOW_SEALING was employed). > On other filesystems, all fcntl() operations that operate on seals > will return EINVAL. It looks like in case of a file on tmpfs it returns 1 (F_SEAL_SEAL). With the fix: DEBU[0000] nsexec[768367]: => is_self_cloned DEBU[0000] nsexec[768367]: got seals 1 (want 15) DEBU[0000] nsexec[768367]: no CLONED_BINARY_ENV DEBU[0000] nsexec[768367]: <= is_self_cloned, is_cloned = 0 DEBU[0000] nsexec[768367]: try_bindfd: 5 DEBU[0000] nsexec[768367]: re-exec itself... DEBU[0000] nsexec[768367]: => is_self_cloned DEBU[0000] nsexec[768367]: got seals 1 (want 15) DEBU[0000] nsexec[768367]: fstatfs says ro = 1 DEBU[0000] nsexec[768367]: fstat says nlink = 1 DEBU[0000] nsexec[768367]: <= is_self_cloned, is_cloned = 1 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
52824b3
to
18e2862
Compare
@cyphar PTAL (I've simplified the patch a bit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks.
@opencontainers/runc-maintainers PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, but I don't trust myself handling too much C code, so if there's risk involved, it'd be good to have another reviewer.
LGTM |
After adding some debug info to cloned_binary.c I found out that
is_self_cloned() is not working right when runc binary is on tmpfs,
resulting in one extra re-exec of runc.
With some added debug:
From the above, it is seen that
is_self_cloned
returns 0,try_bindfd
is called and succeeds,is_self_cloned
returns 0 again (because GET_SEALS returns 1),make_execfd
, and re-execs again,is_self_cloned
returns 1.I guess that the code relied on the following (quoting fcntl(2)):
It looks like in case of a file on tmpfs it returns 1 (F_SEAL_SEAL).
After the fix: