Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEV (signal 11 error) with Go 1.22.0 and Ubuntu 20.04 / Debian 10 #2677

Closed
gregorex333 opened this issue Feb 25, 2024 · 7 comments
Closed
Labels
bug Something isn't working

Comments

@gregorex333
Copy link

Version of Singularity

Singularity-CE 4.1.0
(also tried CE 4.1.1 with same result)

Describe the bug

Fails to mount tempfs or ramfs for "build --sandbox" mainly with this error: signal number 11

"VERBOSE [U=0,P=1] wait_child() rpc server interrupted by signal number 11
FATAL [U=0,P=9662] Master() container creation failed: mount tmpfs->/usr/local/var/singularity/mnt/session error: while mounting tmpfs: can't mount tmpfs filesystem to /usr/local/var/singularity/mnt/session: read unix @->@: read: connection reset by peer
: exit status 255"

To Reproduce

Reset my dual-boot Linux harddrive back to its initial conditions for Ubuntu focal 20.04.6 LTS
installed a small number of programs needed for using the image to be built from my custom definition
full list of commands / packages in attached file:
full_reproduction.odt

Install GO 1.22.0
Install dependencies listed at (https://docs.sylabs.io/guides/main/admin-guide/installation.html) for Ubuntu
Install singularity into usr/local with:

./mconfig &&
make -C ./builddir &&
sudo make -C ./builddir install

Run build -- sandbox on my custom definition file with a library for ubuntu:20.04 as its base
AS WELL AS on this basic test case library:
sudo singularity -d build --sandbox ubuntu/ library://ubuntu
Fails to mount with above error

Tried more by setting these env variables & with or without various config file settings listed
With or Without these env variables
export SINGULARITY_TMPDIR=/home/giovannini/sandbox/temp/tmp
export SINGULARITY_CACHEDIR=/home/giovannini/sandbox/temp/cache
(also tried with or without sudo -E)

(also tried with singularity_conf	mount tmp = no or yes)
(also tried with singularity_conf	mount host fs = no or yes)
(also tried with singularity_conf	sessiondir_max_size = 20480 or 64)

Also tried with "mount fs" set to tempfs and ramfs.
All fail to mount at the same point.

Expected behavior

Expected to mount tempfs/ramfs to create a sandbox folder OR the .sif from my defintion

OS / Linux Distribution

Ubuntu focal 20.04.6 LTS

Installation Method

Install singularity into usr/local with:
./mconfig &&
make -C ./builddir &&
sudo make -C ./builddir install
using your github release source file for 4.1.0 (and previously 4.1.1 before resetting my OS)

Additional context

mount and cat /proc/self/mountinfo
and build config also in attached file:
mount_build_info.odt

DEBUG
Example Debug when attempting to sandbox my definition file.
Same point of failure for the other basic test case.
with or without these variables set
export SINGULARITY_TMPDIR=/home/giovannini/sandbox/temp/tmp
export SINGULARITY_CACHEDIR=/home/giovannini/sandbox/temp/cache

(also tried with singularity_conf	mount tmp = no or yes)
(also tried with singularity_conf	mount host fs = no or yes)
(also tried with singularity_conf	sessiondir_max_size = 20480 or 64)
(also tried with or without sudo -E for those env vars)

Debug Log.odt

@gregorex333 gregorex333 added the bug Something isn't working label Feb 25, 2024
@dtrudg
Copy link
Member

dtrudg commented Feb 26, 2024

This...

wait_child() rpc server interrupted by signal number 11

Indicates that there is a segmentation fault. Please check the output of the dmesg command and provide the error messages that are there.

I believe this is probably related to an issue reported against Go 1.22.0 - golang/go#65625 that is affecting other container runtime projects also (incus / runc).

Please try building Singularity with Go 1.21.7 from https://go.dev/dl/ instead.

@dtrudg
Copy link
Member

dtrudg commented Feb 26, 2024

The bug itself is not in Go - it's in glibc, but this is difficult to avoid:

opencontainers/runc#4193 (comment)

@dtrudg dtrudg changed the title Fail to mount tempfs or ramfs for "build --sandbox" SIGSEV (signal 11 error) with Go 1.22.0 and Ubuntu 20.04 / Debian 10 Feb 26, 2024
@gregorex333
Copy link
Author

This...

wait_child() rpc server interrupted by signal number 11

Indicates that there is a segmentation fault. Please check the output of the dmesg command and provide the error messages that are there.

I believe this is probably related to an issue reported against Go 1.22.0 - golang/go#65625 that is affecting other container runtime projects also (incus / runc).

Please try building Singularity with Go 1.21.7 from https://go.dev/dl/ instead.

dmesg message
dmesg.odt

will try building with different version of GO

@dtrudg
Copy link
Member

dtrudg commented Feb 26, 2024

@gregorex333 - thanks. I believe the dmesg output there confirms it is the same issue.

@gregorex333
Copy link
Author

The bug itself is not in Go - it's in glibc, but this is difficult to avoid:

opencontainers/runc#4193 (comment)

You seem correct. Changing GO installation and reinstalling has allowed the build to complete with exist status 0.
sudo singularity -d build --sandbox ubuntu/ library://ubuntu

VERBOSE [U=0,P=283260] Full() Build complete: /home/giovannini/sandbox/ubuntu
DEBUG [U=0,P=283260] cleanUp() Cleaning up "/home/giovannini/sandbox/build-temp-1227878293/rootfs" and "/tmp/bundle-temp-2089219652"
INFO [U=0,P=283260] runBuild() Build complete: ubuntu/

dtrudg added a commit to dtrudg/singularity that referenced this issue Apr 19, 2024
Adapted from: opencontainers/runc#4247

Execution of a container using a PID namespace can fail on certain
versions of glibc when Singularity is built with Go 1.22.

This is due to Go 1.22 performing calls using pthread_self which,
from glibc 2.25, is not updated for the current TID on clone.

Fixes sylabs#2677

-----

Original runc explanation:

Since glibc 2.25, the thread-local cache of the current TID is no
longer updated in the child when calling clone(2). This results in
very unfortunate behaviour when Go does pthread calls using
pthread_self(), which has the wrong TID stored.

The "simple" solution is to forcefully overwrite this cached value.
Unfortunately (and unsurprisingly), the layout of "struct pthread"
is strictly private and can change without warning.

Luckily, glibc (currently) uses CLONE_CHILD_CLEARTID for all forks
(with the child_tid set to the cached &PTHREAD_SELF->tid), meaning
that as long as runc is using glibc, when "runc init" is spawned
the child process will have a pointer directly to the cached value
we want to change. With CONFIG_CHECKPOINT_RESTORE=y kernels on
Linux 3.5 and later, we can simply use prctl(PR_GET_TID_ADDRESS).
For older kernels we need to memory scan the TLS structure
(pthread_self() returns a pointer to the start of the structure
so we can "just" scan it for a field containing the current TID
and assume that it is the correct field).

Obviously this is all very horrific, and if you are reading this
in the future, it almost certainly has caused some horrific bug
that I did not forsee. Sorry about that. As far as I can tell,
there is no other workable solution that doesn't also depend on the
CLONE_CHILD_CLEARTID behaviour of glibc in some way. We cannot
"just" do a re-exec after clone(2) for security reasons.

Fixes opencontainers/runc#4233
Signed-off-by: Aleksa Sarai cyphar@cyphar.com
dtrudg added a commit to dtrudg/singularity that referenced this issue Apr 19, 2024
Adapted from: opencontainers/runc#4247

Execution of a container using a PID namespace can fail on certain
versions of glibc when Singularity is built with Go 1.22.

This is due to Go 1.22 performing calls using pthread_self which,
from glibc 2.25, is not updated for the current TID on clone.

Fixes sylabs#2677

-----

Original runc explanation:

Since glibc 2.25, the thread-local cache of the current TID is no
longer updated in the child when calling clone(2). This results in
very unfortunate behaviour when Go does pthread calls using
pthread_self(), which has the wrong TID stored.

The "simple" solution is to forcefully overwrite this cached value.
Unfortunately (and unsurprisingly), the layout of "struct pthread"
is strictly private and can change without warning.

Luckily, glibc (currently) uses CLONE_CHILD_CLEARTID for all forks
(with the child_tid set to the cached &PTHREAD_SELF->tid), meaning
that as long as runc is using glibc, when "runc init" is spawned
the child process will have a pointer directly to the cached value
we want to change. With CONFIG_CHECKPOINT_RESTORE=y kernels on
Linux 3.5 and later, we can simply use prctl(PR_GET_TID_ADDRESS).
For older kernels we need to memory scan the TLS structure
(pthread_self() returns a pointer to the start of the structure
so we can "just" scan it for a field containing the current TID
and assume that it is the correct field).

Obviously this is all very horrific, and if you are reading this
in the future, it almost certainly has caused some horrific bug
that I did not forsee. Sorry about that. As far as I can tell,
there is no other workable solution that doesn't also depend on the
CLONE_CHILD_CLEARTID behaviour of glibc in some way. We cannot
"just" do a re-exec after clone(2) for security reasons.

Fixes opencontainers/runc#4233
Signed-off-by: Aleksa Sarai cyphar@cyphar.com
dtrudg added a commit to dtrudg/singularity that referenced this issue Apr 19, 2024
Adapted from: opencontainers/runc#4247

Execution of a container using a PID namespace can fail on certain
versions of glibc when Singularity is built with Go 1.22.

This is due to Go 1.22 performing calls using pthread_self which,
from glibc 2.25, is not updated for the current TID on clone.

Fixes sylabs#2677

-----

Original runc explanation:

Since glibc 2.25, the thread-local cache of the current TID is no
longer updated in the child when calling clone(2). This results in
very unfortunate behaviour when Go does pthread calls using
pthread_self(), which has the wrong TID stored.

The "simple" solution is to forcefully overwrite this cached value.
Unfortunately (and unsurprisingly), the layout of "struct pthread"
is strictly private and can change without warning.

Luckily, glibc (currently) uses CLONE_CHILD_CLEARTID for all forks
(with the child_tid set to the cached &PTHREAD_SELF->tid), meaning
that as long as runc is using glibc, when "runc init" is spawned
the child process will have a pointer directly to the cached value
we want to change. With CONFIG_CHECKPOINT_RESTORE=y kernels on
Linux 3.5 and later, we can simply use prctl(PR_GET_TID_ADDRESS).
For older kernels we need to memory scan the TLS structure
(pthread_self() returns a pointer to the start of the structure
so we can "just" scan it for a field containing the current TID
and assume that it is the correct field).

Obviously this is all very horrific, and if you are reading this
in the future, it almost certainly has caused some horrific bug
that I did not forsee. Sorry about that. As far as I can tell,
there is no other workable solution that doesn't also depend on the
CLONE_CHILD_CLEARTID behaviour of glibc in some way. We cannot
"just" do a re-exec after clone(2) for security reasons.

Fixes opencontainers/runc#4233
Signed-off-by: Aleksa Sarai cyphar@cyphar.com
dtrudg added a commit to dtrudg/singularity that referenced this issue Apr 19, 2024
Adapted from: opencontainers/runc#4247

Execution of a container using a PID namespace can fail on certain
versions of glibc when Singularity is built with Go 1.22.

This is due to Go 1.22 performing calls using pthread_self which,
from glibc 2.25, is not updated for the current TID on clone.

Fixes sylabs#2677

-----

Original runc explanation:

Since glibc 2.25, the thread-local cache of the current TID is no
longer updated in the child when calling clone(2). This results in
very unfortunate behaviour when Go does pthread calls using
pthread_self(), which has the wrong TID stored.

The "simple" solution is to forcefully overwrite this cached value.
Unfortunately (and unsurprisingly), the layout of "struct pthread"
is strictly private and can change without warning.

Luckily, glibc (currently) uses CLONE_CHILD_CLEARTID for all forks
(with the child_tid set to the cached &PTHREAD_SELF->tid), meaning
that as long as runc is using glibc, when "runc init" is spawned
the child process will have a pointer directly to the cached value
we want to change. With CONFIG_CHECKPOINT_RESTORE=y kernels on
Linux 3.5 and later, we can simply use prctl(PR_GET_TID_ADDRESS).
For older kernels we need to memory scan the TLS structure
(pthread_self() returns a pointer to the start of the structure
so we can "just" scan it for a field containing the current TID
and assume that it is the correct field).

Obviously this is all very horrific, and if you are reading this
in the future, it almost certainly has caused some horrific bug
that I did not forsee. Sorry about that. As far as I can tell,
there is no other workable solution that doesn't also depend on the
CLONE_CHILD_CLEARTID behaviour of glibc in some way. We cannot
"just" do a re-exec after clone(2) for security reasons.

Fixes opencontainers/runc#4233
Signed-off-by: Aleksa Sarai cyphar@cyphar.com
@dtrudg
Copy link
Member

dtrudg commented May 28, 2024

Just noting that this will hopefully be solved in Go 1.22.4 by a backport of https://go-review.googlesource.com/c/go/+/587919

@dtrudg
Copy link
Member

dtrudg commented Jun 11, 2024

This is addressed by Go 1.22.4 - tested and confirmed.

@dtrudg dtrudg closed this as completed Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants