Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unknown segfault error 4 in Talos init process #8628

Closed
DmitriyMV opened this issue Apr 22, 2024 · 6 comments · Fixed by #8638
Closed

Unknown segfault error 4 in Talos init process #8628

DmitriyMV opened this issue Apr 22, 2024 · 6 comments · Fixed by #8638
Assignees
Labels

Comments

@DmitriyMV
Copy link
Member

Bug Report

Reproducible with google.golang.org/grpc 1.63.0

Description

[   47.664021] init[1056]: segfault at 40 ip 0000000000858352 sp 000000c0010a0e68 error 4 in init[400000+2409000] likely on CPU 0 (core 0, socket 0)
[   47.666604] Code: 0d 8b b0 a6 04 48 83 c4 30 5d c3 e8 c8 a0 c1 ff e9 a3 fc ff ff cc cc cc 49 3b 66 10 0f 86 b4 00 00 00 55 48 89 e5 48 83 ec 58 <48> 83 78 40 00 74 61 44 0
f 11 7c 24 28 44 0f 11 7c 24 38 48 8b 48
[   47.669882] init[1056]: segfault at 40 ip 0000000000858352 sp 000000c0010a0f58 error 4 in init[400000+2409000] likely on CPU 0 (core 0, socket 0)
[   47.672282] Code: 0d 8b b0 a6 04 48 83 c4 30 5d c3 e8 c8 a0 c1 ff e9 a3 fc ff ff cc cc cc 49 3b 66 10 0f 86 b4 00 00 00 55 48 89 e5 48 83 ec 58 <48> 83 78 40 00 74 61 44 0
f 11 7c 24 28 44 0f 11 7c 24 38 48 8b 48
@DmitriyMV DmitriyMV self-assigned this Apr 22, 2024
@DmitriyMV
Copy link
Member Author

DmitriyMV commented Apr 22, 2024

Ok I'm a little bit stuck here. I'm trying to get the original binary:

buildbox-am:1.22.2 > make initramfs WITH_DEBUG=true

buildbox-am:1.22.2 > unxz -k initramfs-amd64.xz

buildbox-am:1.22.2 > ls
initramfs-amd64  initramfs-amd64.xz  signing_key.x509  talosctl-linux-amd64*  vmlinuz-amd64

buildbox-am:1.22.2 > cpio -i -F initramfs-amd64
176138 blocks

buildbox-am:1.22.2 > ls
init*  initramfs-amd64  initramfs-amd64.xz  rootfs.sqsh  signing_key.x509  talosctl-linux-amd64*  vmlinuz-amd64

buildbox-am:1.22.2 > unsquashfs -d unsqsh rootfs.sqsh
Parallel unsquashfs: Using 16 processors
534 inodes (5076 blocks) to write

[========================================================================================================================================================/] 5076/5076 100%

created 411 files
created 172 directories
created 123 symlinks
created 0 devices
created 0 fifos
created 0 sockets
buildbox-am:1.22.2 > file ./unsqsh/sbin/init
./unsqsh/sbin/init: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, Go BuildID=JvWuT5KlgY8tfTBWFiaA/a8XXu9_4_UdkdmYO0Hgd/8r1W5er2ZsleI4dyphLy/IJg0khWPfyIgfmUHbOjC, stripped

Why is it stripped?

@frezbo
Copy link
Member

frezbo commented Apr 22, 2024

I believe our WITH_DEBUG is broken for talos builds

@DmitriyMV
Copy link
Member Author

DmitriyMV commented Apr 22, 2024

Looks correct to me:

...

#184 [base 10/11] RUN --mount=type=cache,target=/.cache go list all >/dev/null
#184 DONE 0.2s

#185 [base 11/11] WORKDIR /src
#185 DONE 0.0s

#186 [machined-build-amd64 1/3] WORKDIR /src/internal/app/machined
#186 DONE 0.0s

#187 [init-build-amd64 1/3] WORKDIR /src/internal/app/init
#187 DONE 0.0s

#188 [init-build-amd64 2/3] RUN --mount=type=cache,target=/.cache GOOS=linux GOARCH=amd64 GOAMD64=v1 go build -tags tcell_minimal,grpcnotrace,sidero.debug -ldflags "" -o /init
#188 DONE 2.4s

#189 [machined-build-amd64 2/3] RUN --mount=type=cache,target=/.cache GOOS=linux GOARCH=amd64 GOAMD64=v2 go build -tags tcell_minimal,grpcnotrace,sidero.debug -ldflags "" -o /machined
#189 ...

#190 [init-build-amd64 3/3] RUN chmod +x /init
#190 DONE 0.1s

#189 [machined-build-amd64 2/3] RUN --mount=type=cache,target=/.cache GOOS=linux GOARCH=amd64 GOAMD64=v2 go build -tags tcell_minimal,grpcnotrace,sidero.debug -ldflags "" -o /machined
#189 DONE 8.2s

#191 [machined-build-amd64 3/3] RUN chmod +x /machined
#191 DONE 0.3s

#192 [rootfs-base-amd64 18/36] COPY --link --from=pkg-xfsprogs-amd64 / /rootfs
#192 CACHED

#193 [rootfs-base-amd64  1/36] COPY --link --from=pkg-fhs / /rootfs
#193 CACHED

...

#212 [modules-amd64 1/1] COPY --from=depmod-amd64 /build/lib/modules /lib/modules
#212 CACHED

#213 [rootfs-base-amd64 17/36] COPY --link --from=pkg-runc-amd64 / /rootfs
#213 CACHED

#214 [rootfs-base-amd64 10/36] COPY --link --from=pkg-libpopt-amd64 / /rootfs
#214 CACHED

#215 [rootfs-base-amd64 11/36] COPY --link --from=pkg-liburcu-amd64 / /rootfs
#215 CACHED

#216 [rootfs-base-amd64 23/36] COPY --link --from=pkg-kmod-amd64 /usr/bin/kmod /rootfs/sbin/modprobe
#216 CACHED

#217 [rootfs-base-amd64 14/36] COPY --link --from=pkg-lvm2-amd64 / /rootfs
#217 CACHED

#218 [rootfs-base-amd64 12/36] COPY --link --from=pkg-openssl-amd64 / /rootfs
#218 CACHED

#219 [rootfs-base-amd64 22/36] COPY --link --from=pkg-kmod-amd64 /usr/lib/libkmod.* /rootfs/lib/
#219 CACHED

#220 [rootfs-base-amd64  5/36] COPY --link --from=pkg-dosfstools-amd64 / /rootfs
#220 CACHED

#197 [rootfs-base-amd64 24/36] COPY --link --from=modules-amd64 /lib/modules /rootfs/lib/modules
#197 CACHED

#221 [rootfs-base-amd64 25/36] COPY --link --from=machined-build-amd64 /machined /rootfs/sbin/init
#221 merging 0.1s done
#221 DONE 0.1s

#222 [rootfs-base-amd64 26/36] RUN <<END (# the orderly_poweroff call by the kernel will call '/sbin/poweroff'...)
#222 DONE 0.6s

#223 [rootfs-base-amd64 27/36] COPY ./hack/cleanup.sh /toolchain/bin/cleanup.sh
#223 DONE 0.0s

#224 [rootfs-base-amd64 28/36] RUN <<END (cleanup.sh /rootfs...)
#224 0.225 strip: /rootfs/lib/modules/6.6.26-talos/modules.softdep: file format not recognized
#224 0.533 strip: /rootfs/sbin/lvm_import_vdo: file format not recognized
#224 0.901 strip: /rootfs/sbin/iptables-apply: file format not recognized
#224 0.902 strip: /rootfs/sbin/fsadm: file format not recognized
#224 0.902 strip: /rootfs/sbin/fsck.xfs: file format not recognized
#224 0.904 strip: /rootfs/sbin/lvmdump: file format not recognized
#224 0.907 strip: /rootfs/sbin/blkdeactivate: file format not recognized
#224 2.278 strip: /rootfs/usr/bin/c_rehash: file format not recognized
#224 2.299 strip: /rootfs/usr/sbin/xfs_info: file format not recognized
#224 2.300 strip: /rootfs/usr/sbin/xfs_freeze: file format not recognized
#224 2.303 strip: /rootfs/usr/sbin/xfs_mkfile: file format not recognized
#224 2.305 strip: /rootfs/usr/sbin/xfs_admin: file format not recognized
#224 2.306 strip: /rootfs/usr/sbin/xfs_scrub_all: file format not recognized
#224 2.309 strip: /rootfs/usr/sbin/xfs_ncheck: file format not recognized
#224 2.310 strip: /rootfs/usr/sbin/xfs_metadump: file format not recognized
#224 2.322 strip: /rootfs/usr/sbin/xfs_bmap: file format not recognized
#224 2.376 mkdir: created directory '/rootfs/boot'
#224 2.376 mkdir: created directory '/rootfs/boot/EFI'
#224 2.376 mkdir: created directory '/rootfs/etc/cri'
#224 2.376 mkdir: created directory '/rootfs/etc/cri/conf.d'
#224 2.376 mkdir: created directory '/rootfs/etc/cri/conf.d/hosts'
#224 2.376 mkdir: created directory '/rootfs/lib/firmware'
#224 2.376 mkdir: created directory '/rootfs/usr/local/share'
#224 2.376 mkdir: created directory '/rootfs/usr/share/zoneinfo'
#224 2.376 mkdir: created directory '/rootfs/usr/share/zoneinfo/Etc'
#224 2.376 mkdir: created directory '/rootfs/mnt'
#224 2.376 mkdir: created directory '/rootfs/system'
#224 2.376 mkdir: created directory '/rootfs/.extra'
#224 2.377 mkdir: created directory '/rootfs/etc/kubernetes'
#224 2.377 mkdir: created directory '/rootfs/etc/kubernetes/manifests'
#224 2.377 mkdir: created directory '/rootfs/etc/cni'
#224 2.377 mkdir: created directory '/rootfs/etc/cni/net.d'
#224 2.377 mkdir: created directory '/rootfs/usr/libexec/kubernetes'
#224 2.377 mkdir: created directory '/rootfs//usr/local/lib/kubelet'
#224 2.377 mkdir: created directory '/rootfs//usr/local/lib/kubelet/credentialproviders'
#224 2.378 mkdir: created directory '/rootfs/opt/containerd'
#224 2.378 mkdir: created directory '/rootfs/opt/containerd/bin'
#224 2.378 mkdir: created directory '/rootfs/opt/containerd/lib'
#224 DONE 2.4s

#225 [rootfs-base-amd64 29/36] COPY --chmod=0644 hack/zoneinfo/Etc/UTC /rootfs/usr/share/zoneinfo/Etc/UTC
#225 DONE 0.0s

#226 [rootfs-base-amd64 30/36] COPY --chmod=0644 hack/nfsmount.conf /rootfs/etc/nfsmount.conf
#226 DONE 0.1s

#227 [rootfs-base-amd64 31/36] COPY --chmod=0644 hack/containerd.toml /rootfs/etc/containerd/config.toml
#227 DONE 0.0s

#228 [rootfs-base-amd64 32/36] COPY --chmod=0644 hack/cri-containerd.toml /rootfs/etc/cri/containerd.toml
#228 DONE 0.0s

#229 [rootfs-base-amd64 33/36] COPY --chmod=0644 hack/cri-plugin.part /rootfs/etc/cri/conf.d/00-base.part
#229 DONE 0.0s

#230 [rootfs-base-amd64 34/36] COPY --chmod=0644 hack/udevd/80-net-name-slot.rules /rootfs/usr/lib/udev/rules.d/
#230 DONE 0.0s

#231 [rootfs-base-amd64 35/36] COPY --chmod=0644 hack/lvm.conf /rootfs/etc/lvm/lvm.conf
#231 DONE 0.0s

#232 [rootfs-base-amd64 36/36] RUN <<END (ln -s /usr/share/zoneinfo/Etc/UTC /rootfs/etc/localtime...)
#232 DONE 0.0s

#233 [rootfs-squashfs-amd64 1/2] RUN find /rootfs -print0     | xargs -0r touch --no-dereference --date="@1713774245"
#233 DONE 1.8s

#234 [rootfs-squashfs-amd64 2/2] RUN mksquashfs /rootfs /rootfs.sqsh -all-root -noappend -comp xz -Xdict-size 100% -no-progress
#234 0.031 Parallel mksquashfs: Using 16 processors
#234 14.10 Creating 4.0 filesystem on /rootfs.sqsh, block size 131072.
#234 14.10
#234 14.10 Exportable Squashfs 4.0 filesystem, xz compressed, data block size 131072
#234 14.10 	compressed data, compressed metadata, compressed fragments,
#234 14.10 	compressed xattrs, compressed ids
#234 14.10 	duplicates are removed
#234 14.10 Filesystem size 65295.17 Kbytes (63.76 Mbytes)
#234 14.10 	10.95% of uncompressed filesystem size (596516.90 Kbytes)
#234 14.10 Inode table size 12988 bytes (12.68 Kbytes)
#234 14.10 	31.19% of uncompressed inode table size (41640 bytes)
#234 14.10 Directory table size 6964 bytes (6.80 Kbytes)
#234 14.10 	45.86% of uncompressed directory table size (15186 bytes)
#234 14.10 Number of duplicate files found 17
#234 14.10 Number of inodes 706
#234 14.10 Number of files 411
#234 14.10 Number of fragments 81
#234 14.10 Number of symbolic links 123
#234 14.10 Number of device nodes 0
#234 14.10 Number of fifo nodes 0
#234 14.10 Number of socket nodes 0
#234 14.10 Number of directories 172
#234 14.10 Number of hard-links 0
#234 14.10 Number of ids (unique uids + gids) 1
#234 14.10 Number of uids 1
#234 14.10 	unknown (0)
#234 14.10 Number of gids 1
#234 14.10 	unknown (0)
#234 DONE 14.2s

#235 [squashfs-amd64 1/1] COPY --from=rootfs-squashfs-amd64 /rootfs.sqsh /
#235 DONE 0.0s

#236 [initramfs-archive-amd64 1/5] WORKDIR /initramfs
#236 CACHED

#237 [initramfs-archive-amd64 2/5] COPY --from=squashfs-amd64 /rootfs.sqsh .
#237 DONE 0.3s

#238 [initramfs-archive-amd64 3/5] COPY --from=init-build-amd64 /init .
#238 DONE 0.1s

#239 [initramfs-archive-amd64 4/5] RUN find . -print0     | xargs -0r touch --no-dereference --date="@1713774245"
#239 DONE 0.3s

#240 [initramfs-archive-amd64 5/5] RUN set -o pipefail     && find . 2>/dev/null     | LC_ALL=c sort     | cpio --reproducible -H newc -o     | xz -v -C crc32 -0 -e -T 0 -z     > /initramfs.xz
#240 1.995 176138 blocks
#240 2.236 (stdin): 73.0 MiB / 86.0 MiB = 0.849, 0:02
#240 DONE 2.4s

#241 [initramfs 1/1] COPY --from=initramfs-archive /initramfs.xz /initramfs-amd64.xz
#241 DONE 0.0s

#242 exporting to client directory
#242 copying files 46.21MB 0.1s
#242 copying files 76.61MB 0.2s done
#242 DONE 0.2s

@DmitriyMV
Copy link
Member Author

-ldflags are "" so the binary should not be stripped.

@DmitriyMV
Copy link
Member Author

Currently I'm positively certain that this is because Go code panics somewhere (confirmed that while investigating #8626) but I do not understand why it doesn't print any stacktraces. I have another way to investigate that, but currently I'm trying the easy path :)

@DmitriyMV
Copy link
Member Author

Okay we found the root of the problem. To disable stripping you also have to remove cleanup.sh from Dockerfile

diff --git a/Dockerfile b/Dockerfile
@@ -595,9 +595,7 @@
 END
 # NB: We run the cleanup step before creating extra directories, files, and
 # symlinks to avoid accidentally cleaning them up.
-COPY ./hack/cleanup.sh /toolchain/bin/cleanup.sh
 RUN <<END
-    cleanup.sh /rootfs
     mkdir -pv /rootfs/{boot/EFI,etc/cri/conf.d/hosts,lib/firmware,usr/local/share,usr/share/zoneinfo/Etc,mnt,system,opt,.extra}
     mkdir -pv /rootfs/{etc/kubernetes/manifests,etc/cni/net.d,usr/libexec/kubernetes,/usr/local/lib/kubelet/credentialproviders}
     mkdir -pv /rootfs/opt/{containerd/bin,containerd/lib}
@@ -659,9 +657,7 @@
 END
 # NB: We run the cleanup step before creating extra directories, files, and
 # symlinks to avoid accidentally cleaning them up.
-COPY ./hack/cleanup.sh /toolchain/bin/cleanup.sh
 RUN <<END
-    cleanup.sh /rootfs
     mkdir -pv /rootfs/{boot/EFI,etc/cri/conf.d/hosts,lib/firmware,usr/local/share,usr/share/zoneinfo/Etc,mnt,system,opt,.extra}
     mkdir -pv /rootfs/{etc/kubernetes/manifests,etc/cni/net.d,usr/libexec/kubernetes,/usr/local/lib/kubelet/credentialproviders}
     mkdir -pv /rootfs/opt/{containerd/bin,containerd/lib}

Then go tool objdump -S will work properly.

Currently init segfaults here:

690-[    6.193878] [talos] adjusting time (slew) by 69.277557ms via 162.159.200.1, state TIME_OK, status STA_PLL | STA_NANO {"component": "controller-runtime", "controller": "time.SyncController"}
691-[    6.198209] init[1032]: segfault at 40 ip 00000000008cf4f2 sp 000000c000e23500 error 4 in init[400000+245f000] likely on CPU 0 (core 0, socket 0)
692-[    6.199606] Code: 0d 03 4a aa 04 48 83 c4 30 5d c3 e8 48 87 ba ff e9 a3 fc ff ff cc cc cc 49 3b 66 10 0f 86 b4 00 00 00 55 48 89 e5 48 83 ec 58 <48> 83 78 40 00 74 61 44 0f 11 7c 24 28 44 0f 11 7c 24 38 48 8b 48
693-[    6.200379] init[1032]: segfault at 40 ip 00000000008cf4f2 sp 000000c000e48500 error 4 in init[400000+245f000] likely on CPU 0 (core 0, socket 0)
694-[    6.200903] Code: 0d 03 4a aa 04 48 83 c4 30 5d c3 e8 48 87 ba ff e9 a3 fc ff ff cc cc cc 49 3b 66 10 0f 86 b4 00 00 00 55 48 89 e5 48 83 ec 58 <48> 83 78 40 00 74 61 44 0f 11 7c 24 28 44 0f 11 7c 24 38 48 8b 48
695-[    6.201660] init[1032]: segfault at 40 ip 00000000008cf4f2 sp 000000c000e232f0 error 4 in init[400000+245f000] likely on CPU 0 (core 0, socket 0)
696-[    6.202213] Code: 0d 03 4a aa 04 48 83 c4 30 5d c3 e8 48 87 ba ff e9 a3 fc ff ff cc cc cc 49 3b 66 10 0f 86 b4 00 00 00 55 48 89 e5 48 83 ec 58 <48> 83 78 40 00 74 61 44 0f 11 7c 24 28 44 0f 11 7c 24 38 48 8b 48

Looking at the objdump output we find the culprit:

TEXT google.golang.org/grpc/internal/channelz.(*Channel).String(SB) /.cache/mod/google.golang.org/grpc@v1.63.0/internal/channelz/channel.go
  0x8cf4e0		493b6610		CMPQ SP, 0x10(R14)							
  0x8cf4e4		0f86b4000000		JBE 0x8cf59e								
  0x8cf4ea		55			PUSHQ BP								
  0x8cf4eb		4889e5			MOVQ SP, BP								
  0x8cf4ee		4883ec58		SUBQ $0x58, SP								
  0x8cf4f2		4883784000		CMPQ 0x40(AX), $0x0	// segfaults here						
  0x8cf4f7		7461			JE 0x8cf55a			

Or in Go code

func (c *Channel) String() string {
	if c.Parent == nil { // segfaults here if c is nil and we have a panic
		return fmt.Sprintf("Channel #%d", c.ID)
	}
	return fmt.Sprintf("%s Channel #%d", c.Parent, c.ID)
}

The reason we are not seeing stacktraces or anything, is because fmt.*Print* family of functions silently swallow all panics. So there is no way to know that they even exist, if there are no tests for String output, nobody ever attached the debugger or used the Go process as PID 1.

They fixed it in grpc/grpc-go#7101 which is in github.com/grpc/grpc-go v1.63.2 tho I guess they never knew that the problem even existed.

DmitriyMV added a commit to DmitriyMV/talos that referenced this issue Apr 23, 2024
Update other modules while we are at it.

Closes siderolabs#8628

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
DmitriyMV added a commit to DmitriyMV/talos that referenced this issue Apr 23, 2024
Update other modules while we are at it.

Closes siderolabs#8628

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
DmitriyMV added a commit to DmitriyMV/talos that referenced this issue Apr 23, 2024
Update other modules while we are at it.

Closes siderolabs#8628

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
DmitriyMV added a commit to DmitriyMV/talos that referenced this issue Apr 23, 2024
Update other modules while we are at it.

Closes siderolabs#8628

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 23, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants