aws: kola podman.base tests failing with kernel warning percpu_ref_switch_to_atomic_rcu+0x12f/0x140 #507

dustymabe · 2020-05-29T05:34:26Z

This started failing in the past few days. For example from the 32.20200528.20.1 build:

--- FAIL: podman.base (79.93s)
    --- PASS: podman.base/info (3.91s)
    --- PASS: podman.base/resources (11.83s)
            cluster.go:141: Getting image source signatures
            cluster.go:141: Copying blob sha256:8f046f2c85310a09e3aff5883c38690bdee467b6f610e2b88619b2ade716adb1
            cluster.go:141: Copying config sha256:2f32e674bac1629ec7d0f94a17e996ef4b40b1f39d6ecd94e9782b46847c1508
            cluster.go:141: Writing manifest to image destination
            cluster.go:141: Storing signatures
            cluster.go:141: Your kernel does not support Block I/O weight or the cgroup is not mounted. Weight discarded.
        harness.go:777: Found kernel warning (lib/percpu-refcount.c:160 percpu_ref_switch_to_atomic_rcu+0x12f/0x140) on machine i-0720085490c11279d console

The same tests pass on QEMU and GCP. We'll need to dig in to see what the issue is.

The text was updated successfully, but these errors were encountered:

lucab · 2020-05-29T08:07:38Z

This happens on shutdown. Kernel is complaining about systemd-manager and cgroup-bpf:

[ 58.533377] percpu ref (cgroup_bpf_release_fn) <= 0 (-8) after switching to atomic
[ 58.533403] WARNING: CPU: 0 PID: 1 at lib/percpu-refcount.c:160 percpu_ref_switch_to_atomic_rcu+0x12f/0x140

The full warning is:

[   58.530005] ------------[ cut here ]------------
[   58.533377] percpu ref (cgroup_bpf_release_fn) <= 0 (-8) after switching to atomic
[   58.533403] WARNING: CPU: 0 PID: 1 at lib/percpu-refcount.c:160 percpu_ref_switch_to_atomic_rcu+0x12f/0x140
[   58.551375] Modules linked in: overlay vfat fat intel_rapl_msr intel_rapl_common snd_pcsp sb_edac snd_pcm cirrus snd_timer intel_rapl_perf drm_kms_helper snd soundcore i2c_piix4 br_netfilter bridge stp llc drm ip_tables xfs rfkill dm_multipath crct10dif_pclmul crc32_pclmul xen_blkfront ghash_clmulni_intel ixgbevf serio_raw ata_generic pata_acpi btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c crc32c_intel zstd_decompress fuse
[   58.599868] CPU: 0 PID: 1 Comm: systemd Tainted: G        W         5.6.14-300.fc32.x86_64 #1
[   58.605298] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[   58.609543] RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x12f/0x140
[   58.614511] Code: eb 99 80 3d 95 6c 30 01 00 0f 85 4d ff ff ff 48 8b 55 d8 48 8b 75 e8 48 c7 c7 90 e4 3a 9a c6 05 79 6c 30 01 01 e8 18 07 bd ff <0f> 0b e9 2b ff ff ff 0f 0b eb a2 cc cc cc cc cc cc 8d 8c 16 ef be
[   58.627858] RSP: 0018:ffffba9d80003ec0 EFLAGS: 00010282
[   58.631684] RAX: 0000000000000046 RBX: 7ffffffffffffff7 RCX: 0000000000000000
[   58.636654] RDX: 0000000000000046 RSI: ffffffff9b27a486 RDI: 0000000000000246
[   58.641111] RBP: ffffa08ac3a67920 R08: 0000000da0dc74ed R09: 0000000000000046
[   58.645463] R10: 0000000080000000 R11: 000000000001d499 R12: 00003a12b8035a70
[   58.649917] R13: 0000000000000000 R14: ffffa08ac6a0cd00 R15: ffffa08ac7a2bc50
[   58.654215] FS:  00007facdef7f040(0000) GS:ffffa08ac7a00000(0000) knlGS:0000000000000000
[   58.659039] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   58.662720] CR2: 00007f85037c0b60 CR3: 000000020626a003 CR4: 00000000001606f0
[   58.667504] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   58.674616] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   58.682125] Call Trace:
[   58.684607]  <IRQ>
[   58.686758]  rcu_do_batch+0x191/0x3e0
[   58.690032]  rcu_core+0x18a/0x2e0
[   58.693096]  __do_softirq+0xe9/0x2dc
[   58.696335]  irq_exit+0xcf/0x110
[   58.699276]  xen_evtchn_do_upcall+0x2c/0x40
[   58.704779]  xen_hvm_callback_vector+0xf/0x20
[   58.708568]  </IRQ>
[   58.710764] RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x20
[   58.715411] Code: e0 4c 39 f0 76 d8 80 4d 00 04 eb 92 cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 e8 b6 3b 70 ff 66 90 48 89 f7 57 9d <0f> 1f 44 00 00 c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 8b 07
[   58.730911] RSP: 0018:ffffba9d80023cb8 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff0c
[   58.736892] RAX: 0000000000000001 RBX: ffffffff9b3f08c0 RCX: 0000000000000000
[   58.743285] RDX: 00000000000003f9 RSI: 0000000000000286 RDI: 0000000000000286
[   58.749545] RBP: 0000000000000000 R08: 0000000000000ca3 R09: ffffa08aab1eea10
[   58.755949] R10: 0000000000000011 R11: 0000000000000000 R12: 0000000000000011
[   58.761757] R13: 0000000000000000 R14: ffffa08aab1e9811 R15: ffffa08ac25ec000
[   58.767412]  uart_write+0x110/0x1c0
[   58.770724]  n_tty_write+0x135/0x4b0
[   58.774130]  ? init_wait_entry+0x30/0x30
[   58.777018]  tty_write+0x1a0/0x2e0
[   58.779548]  ? n_tty_receive_char_lnext+0x190/0x190
[   58.782939]  do_iter_write+0x151/0x190
[   58.785610]  vfs_writev+0x7a/0xc0
[   58.788030]  do_writev+0x57/0xf0
[   58.794368]  do_syscall_64+0x5b/0xf0
[   58.796975]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   58.800449] RIP: 0033:0x7face00c558d
[   58.803048] Code: 28 89 54 24 1c 48 89 74 24 10 89 7c 24 08 e8 5a eb f8 ff 8b 54 24 1c 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 14 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 48 89 44 24 08 e8 8e eb f8 ff 48
[   58.818028] RSP: 002b:00007fffbae385a0 EFLAGS: 00000293 ORIG_RAX: 0000000000000014
[   58.860655] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007face00c558d
[   58.868900] RDX: 0000000000000005 RSI: 00007fffbae385f0 RDI: 000000000000003a
[   58.877329] RBP: 000000000000003a R08: 0000000000000000 R09: 000000000000002b
[   58.882363] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000004
[   58.921586] R13: 00007fffbae385f0 R14: 0000000000000003 R15: 000056218ca69b60
[   58.937784] ---[ end trace 02fb46d457195408 ]---

lucab · 2020-05-29T08:21:50Z

The kernel stacktrace seems to be going through some Xen-specific callsites, so it sounds legit to me that the CI only hits this on AWS but not on other platforms.

/cc @davdunc

lucab · 2020-06-03T14:05:30Z

Forwarded to Fedora bugzilla at https://bugzilla.redhat.com/show_bug.cgi?id=1843546.

cgwalters · 2020-06-03T15:45:45Z

Note also the reason these Xen calls are here is because kola defaults to m4 and not the modern instance types - xref openshift/installer#2919

Hopefully most people using FCOS are using m5 or better...and we really need our CI to do that too.

m4 is old and shouldn't be the default; this came up as part of coreos/fedora-coreos-tracker#507

cgwalters · 2020-06-03T16:55:12Z

coreos/coreos-assembler#1505

dustymabe · 2020-06-04T03:49:00Z

I went back through our CI tests for our next-devel stream to see when this started failing. It started failing here:

32.20200521.10.0 - last passing build
32.20200527.10.0 - first failing build

The diff in that update seems to be:

$ rpm-ostree --repo=./ db diff 08ce3c2834f96f5ea4afda38c361e4d2e3833e4a1177f077adf15c382fa30a84 ca0b2917364f24c19d54dbfee2c880214ef3ce346da0ce6bab0ee5f9ecd16782
ostree diff commit from: 08ce3c2834f96f5ea4afda38c361e4d2e3833e4a1177f077adf15c382fa30a84
ostree diff commit to:   ca0b2917364f24c19d54dbfee2c880214ef3ce346da0ce6bab0ee5f9ecd16782
Upgraded:
  NetworkManager 1:1.22.10-1.fc32 -> 1:1.22.12-1.fc32
  NetworkManager-libnm 1:1.22.10-1.fc32 -> 1:1.22.12-1.fc32
  NetworkManager-team 1:1.22.10-1.fc32 -> 1:1.22.12-1.fc32
  NetworkManager-tui 1:1.22.10-1.fc32 -> 1:1.22.12-1.fc32
  afterburn 4.3.3-1.fc32 -> 4.4.0-1.fc32
  afterburn-dracut 4.3.3-1.fc32 -> 4.4.0-1.fc32
  cups-libs 1:2.3.1-9.fc32 -> 1:2.3.3-2.fc32
  glib2 2.64.2-2.fc32 -> 2.64.3-1.fc32
  gnutls 3.6.13-2.fc32 -> 3.6.13-3.fc32
  grub2-common 1:2.04-16.fc32 -> 1:2.04-18.fc32
  grub2-efi-x64 1:2.04-16.fc32 -> 1:2.04-18.fc32
  grub2-pc 1:2.04-16.fc32 -> 1:2.04-18.fc32
  grub2-pc-modules 1:2.04-16.fc32 -> 1:2.04-18.fc32
  grub2-tools 1:2.04-16.fc32 -> 1:2.04-18.fc32
  grub2-tools-minimal 1:2.04-16.fc32 -> 1:2.04-18.fc32
  kernel 5.6.13-300.fc32 -> 5.6.14-300.fc32
  kernel-core 5.6.13-300.fc32 -> 5.6.14-300.fc32
  kernel-modules 5.6.13-300.fc32 -> 5.6.14-300.fc32
  libarchive 3.4.2-1.fc32 -> 3.4.3-1.fc32
  libblkid 2.35.1-8.fc32 -> 2.35.2-1.fc32
  libfdisk 2.35.1-8.fc32 -> 2.35.2-1.fc32
  libipa_hbac 2.2.3-13.fc32 -> 2.3.0-1.fc32
  libldb 2.1.2-1.fc32 -> 2.1.3-1.fc32
  libmount 2.35.1-8.fc32 -> 2.35.2-1.fc32
  libsmartcols 2.35.1-8.fc32 -> 2.35.2-1.fc32
  libsmbclient 2:4.12.2-0.fc32.1 -> 2:4.12.3-0.fc32.1
  libsss_certmap 2.2.3-13.fc32 -> 2.3.0-1.fc32
  libsss_idmap 2.2.3-13.fc32 -> 2.3.0-1.fc32
  libsss_nss_idmap 2.2.3-13.fc32 -> 2.3.0-1.fc32
  libsss_sudo 2.2.3-13.fc32 -> 2.3.0-1.fc32
  libuuid 2.35.1-8.fc32 -> 2.35.2-1.fc32
  libwbclient 2:4.12.2-0.fc32.1 -> 2:4.12.3-0.fc32.1
  libzstd 1.4.4-2.fc32 -> 1.4.5-1.fc32
  linux-firmware 20200421-107.fc32 -> 20200519-108.fc32
  linux-firmware-whence 20200421-107.fc32 -> 20200519-108.fc32
  mdadm 4.1-4.fc32 -> 4.1-5.fc32
  microcode_ctl 2:2.1-36.fc32 -> 2:2.1-37.fc32
  moby-engine 19.03.8-1.ce.gitafacb8b.fc32 -> 19.03.8-2.ce.gitafacb8b.fc32
  samba-client-libs 2:4.12.2-0.fc32.1 -> 2:4.12.3-0.fc32.1
  samba-common 2:4.12.2-0.fc32.1 -> 2:4.12.3-0.fc32.1
  samba-common-libs 2:4.12.2-0.fc32.1 -> 2:4.12.3-0.fc32.1
  selinux-policy 3.14.5-38.fc32 -> 3.14.5-39.fc32
  selinux-policy-targeted 3.14.5-38.fc32 -> 3.14.5-39.fc32
  sssd 2.2.3-13.fc32 -> 2.3.0-1.fc32
  sssd-ad 2.2.3-13.fc32 -> 2.3.0-1.fc32
  sssd-client 2.2.3-13.fc32 -> 2.3.0-1.fc32
  sssd-common 2.2.3-13.fc32 -> 2.3.0-1.fc32
  sssd-common-pac 2.2.3-13.fc32 -> 2.3.0-1.fc32
  sssd-ipa 2.2.3-13.fc32 -> 2.3.0-1.fc32
  sssd-krb5 2.2.3-13.fc32 -> 2.3.0-1.fc32
  sssd-krb5-common 2.2.3-13.fc32 -> 2.3.0-1.fc32
  sssd-ldap 2.2.3-13.fc32 -> 2.3.0-1.fc32
  util-linux 2.35.1-8.fc32 -> 2.35.2-1.fc32
  vim-minimal 2:8.2.789-1.fc32 -> 2:8.2.806-1.fc32
  zincati 0.0.10-2.fc32 -> 0.0.11-1.fc32
Added:
  firewalld-filesystem-0.8.2-3.fc32.noarch

So the kernel 5.6.13-300.fc32 -> 5.6.14-300.fc32 update seems to be of note. Also maybe the microcode_ctl 2:2.1-36.fc32 -> 2:2.1-37.fc32 update.

Added an update to the BZ with this info: https://bugzilla.redhat.com/show_bug.cgi?id=1843546#c1

m4 is old and shouldn't be the default; this came up as part of coreos/fedora-coreos-tracker#507

dustymabe · 2020-06-19T19:25:58Z

I was testing a switch to m5.large instances and I happen to see an instance of this (or what looks like the same thing) from the podman.network.single test:

--- FAIL: podman.network-single (385.12s)  
        cluster.go:141: Getting image source signatures
        cluster.go:141: Copying blob sha256:646048ea2a8c1921113b94d45ea1c6fb6dce5876c17c4c60170ce9efc5b199f7                                        
        cluster.go:141: Copying config sha256:3687ac1b92ec846f2d315471ef149d2f450470896af8c8d9381f01daf715ecc1                                                                                                                                                                                           
        cluster.go:141: Writing manifest to image destination
        cluster.go:141: Storing signatures
) on machine i-020068edfa1b6c28e consolening (lib/percpu-refcount.c:160 percpu_ref_switch_to_atomic_rcu+0x12f/0x140

[  223.819773] ------------[ cut here ]------------
[  223.825700] percpu ref (cgroup_bpf_release_fn) <= 0 (0) after switching to atomic
[  223.825718] WARNING: CPU: 0 PID: 10 at lib/percpu-refcount.c:160 percpu_ref_switch_to_atomic_rcu+0x12f/0x140
[  223.848342] Modules linked in: ip6table_nat ip6_tables xt_conntrack iptable_filter xt_MASQUERADE xt_comment iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 veth overlay vfat fat intel_rapl_msr intel_rapl_common isst_if_common nfit libnvdimm snd_pcsp snd_pcm ppdev snd_timer snd intel_rapl_perf soundcore parport_pc i2c_piix4 parport br_netfilter bridge stp llc drm ip_tables xfs rfkill dm_multipath nvme crct10dif_pclmul crc32_pclmul nvme_core ena ghash_clmulni_intel serio_raw btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c crc32c_intel zstd_decompress fuse
[  223.899807] CPU: 0 PID: 10 Comm: ksoftirqd/0 Not tainted 5.6.18-300.fc32.x86_64 #1
[  223.910605] Hardware name: Amazon EC2 m5.large/, BIOS 1.0 10/16/2017
[  223.917189] RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x12f/0x140
[  223.923661] Code: eb 99 80 3d 34 6f 30 01 00 0f 85 4d ff ff ff 48 8b 55 d8 48 8b 75 e8 48 c7 c7 f0 e8 3a af c6 05 18 6f 30 01 01 e8 d8 09 bd ff <0f> 0b e9 2b ff ff ff 0f 0b eb a2 cc cc cc cc cc cc 8d 8c 16 ef be
[  223.942909] RSP: 0018:ffffac6b00057dd0 EFLAGS: 00010286
[  223.948992] RAX: 0000000000000045 RBX: 7fffffffffffffff RCX: 0000000000000000
[  223.955900] RDX: 0000000000000045 RSI: ffffffffb027a485 RDI: 0000000000000246
[  223.962817] RBP: ffff9e996175b920 R08: 000000341d0d270e R09: 0000000000000045
[  223.969755] R10: 0000000080000000 R11: 000000000001d4ac R12: 00002dd19a840d28
[  223.976616] R13: 0000000000000000 R14: ffff9e996462c000 R15: ffff9e996542bc50
[  223.983462] FS:  0000000000000000(0000) GS:ffff9e9965400000(0000) knlGS:0000000000000000
[  223.994476] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  224.000745] CR2: 00007f64a79c0480 CR3: 0000000223e3e003 CR4: 00000000007606f0
[  224.007640] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  224.014554] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  224.021484] PKRU: 55555554
[  224.026507] Call Trace:
[  224.031444]  rcu_do_batch+0x191/0x3e0
[  224.036900]  rcu_core+0x18a/0x2e0
[  224.042185]  __do_softirq+0xe9/0x2dc
[  224.047599]  run_ksoftirqd+0x26/0x40
[  224.053023]  smpboot_thread_fn+0xc5/0x160
[  224.058650]  ? smpboot_register_percpu_thread+0xf0/0xf0
[  224.064859]  kthread+0x115/0x140
[  224.070178]  ? __kthread_bind_mask+0x60/0x60
[  224.075887]  ret_from_fork+0x35/0x40
[  224.081352] ---[ end trace 4facec8a11082f2c ]---

jlebon · 2020-06-22T13:43:55Z

Another possibly related instance of this is:

--- FAIL: fcos.upgrade.basic (277.59s)
    --- PASS: fcos.upgrade.basic/setup (40.92s)
    --- PASS: fcos.upgrade.basic/upgrade-from-previous (83.02s)
    --- PASS: fcos.upgrade.basic/upgrade-from-current (66.90s)
        harness.go:850: Found kernel warning (drivers/xen/events/events_base.c:1111 unbind_from_irqhandler+0x4e/0x60

Though according to Dusty, that doesn't happen on m5.large instances either.

cgwalters · 2020-06-22T14:04:24Z

Doing a web search turns up a few possible causes; e.g. this commit for io_uring. But I don't think anything in default FCOS uses io_uring yet (though it'd be interesting to find out).

Also this bug. And something interesting about that bug is it mentions NVMe, which we also use in AWS, but not in other cloud platforms AFAIK (e.g. GCP uses virtio).

m4 is old and shouldn't be the default; this came up as part of coreos/fedora-coreos-tracker#507

dustymabe · 2020-06-25T13:52:43Z

It seems like this message is the one we're getting now from our CI runs (still using m4 instance type).

        harness.go:850: Found kernel warning (drivers/xen/events/events_base.c:1111 unbind_from_irqhandler+0x4e/0x60
) on machine i-08896b2b0ca58cb16 console

masneyb · 2020-07-14T15:51:44Z

Doing a web search turns up a few possible causes; e.g. this commit for io_uring

Actually this stack track is related to the code in the kernel that manages BPF programs attached to cgroups. The stack trace references cgroup_bpf_release_fn() from kernel/bpf/cgroup.c. I'm still looking at what changed between 5.6.13 and 5.6.14 in fc32.

masneyb · 2020-07-15T13:22:22Z

Before I try reverting some patches in 5.6.14, I need a way to reliably reproduce this issue. On my server, this happens during system shutdown, but it doesn't occur reliably. Since June 20th, I have 5 of these errors in my system log files.

percpu ref (cgroup_bpf_release_fn) <= 0 (-1) after switching to atomic
percpu ref (cgroup_bpf_release_fn) <= 0 (-12) after switching to atomic
percpu ref (cgroup_bpf_release_fn) <= 0 (-55) after switching to atomic
percpu ref (cgroup_bpf_release_fn) <= 0 (-8) after switching to atomic
percpu ref (cgroup_bpf_release_fn) <= 0 (-10) after switching to atomic

My server has a fixed low-volume workload running 10 containers that are started with podman by systemd. Last evening, the error happened on reboot when the server was up for only 30 minutes. I let it run overnight, rebooted it this morning, and no error on restart.

I'm not that familiar with this part of the kernel, however I believe the way to trigger this is with a BPF program that attaches with any of the attachment types that list BPF_CGROUP_* from here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/bpf.h#n194. Perhaps this is triggered by podman or crun polling container networking stats through BPF?

I am able to run arbitrary programs (like execsnoop) from the bcc-tools package on fcos inside a container with these commands:

sudo podman run --privileged -v /lib/modules:/lib/modules:ro -ti fedora:32
# Inside the container
dnf install -y bcc-tools kmod xz
/usr/share/bcc/tools/execsnoop

cgwalters · 2020-07-15T13:39:34Z

@masneyb I want to emphasize - thank you so much for diving in and looking at this.

As far as BPF + cgroups I think it's much more likely to be systemd doing this by default for its services. See e.g.

$ grep -r IPAddress /usr/lib/systemd/system
/usr/lib/systemd/system/systemd-hostnamed.service:IPAddressDeny=any
/usr/lib/systemd/system/systemd-timedated.service:IPAddressDeny=any
/usr/lib/systemd/system/systemd-localed.service:IPAddressDeny=any
/usr/lib/systemd/system/systemd-journald.service:IPAddressDeny=any
/usr/lib/systemd/system/systemd-logind.service:IPAddressDeny=any
/usr/lib/systemd/system/systemd-homed.service:IPAddressDeny=any
/usr/lib/systemd/system/systemd-coredump@.service:IPAddressDeny=any
/usr/lib/systemd/system/systemd-udevd.service:IPAddressDeny=any
/usr/lib/systemd/system/systemd-userdbd.service:IPAddressDeny=any
/usr/lib/systemd/system/systemd-journald@.service:IPAddressDeny=any
/usr/lib/systemd/system/systemd-portabled.service:IPAddressDeny=any
/usr/lib/systemd/system/systemd-machined.service:IPAddressDeny=any

masneyb · 2020-07-17T18:26:48Z

Two days ago I tried to reproduce this on my fcos box with the failed test case listed at https://github.com/coreos/coreos-assembler/blob/master/mantle/kola/tests/podman/podman.go#L249-L284 since that's what's failing CI. I still couldn't reliably reproduce the crash, even after rebooting after running that failed test. (Disclaimer: I copy and pasted those commands into a bash script and didn't run the go test suite.)

@cgwalters : I verified that systemd uses BPF_CGROUP_INET_EGRESS and BPF_CGROUP_INET_INGRESS in the file https://github.com/systemd/systemd/blob/master/src/core/bpf-firewall.c.

If someone is able to come up with a reliable way to trigger this error, then I'll be happy to dig into this further.

masneyb · 2020-08-12T23:53:55Z

There's the possibility that this issue may be fixed by the upstream commit torvalds/linux@94886c8.

dustymabe · 2020-08-18T22:10:12Z

kernel-5.7.15-200.fc32.x86_64 just landed in testing-devel 32.20200818.20.0

jenkins run

We're still seeing the unbind_from_irqhandler+0x4e/0x60 error but we're definitely not seeing the percpu_ref_switch_to_atomic_rcu+0x12f/0x140 error (though that may have been the case for a while now as noted in #507 (comment)).

Here is the full stack trace:

[    1.105786] ------------[ cut here ]------------^M
[    1.106753] WARNING: CPU: 0 PID: 1 at drivers/xen/events/events_base.c:1112 unbind_from_irqhandler+0x4e/0x60^M
[    1.106753] Modules linked in:^M
[    1.106753] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.7.15-200.fc32.x86_64 #1^M
[    1.106753] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006^M
[    1.106753] RIP: 0010:unbind_from_irqhandler+0x4e/0x60^M
[    1.106753] Code: e6 89 ef e8 24 0c af ff 48 c7 c7 80 0e 97 b6 e8 38 1e 42 00 89 ef e8 51 fc ff ff 48 c7 c7 80 0e 97 b6 5d 41 5c e9 32 10 42 00 <0f> 0b 5d 41 5c c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00^M
[    1.106753] RSP: 0000:ffffb284c0023d60 EFLAGS: 00010246^M
[    1.106753] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000^M
[    1.106753] RDX: 0000000000000028 RSI: 00000000ffffffff RDI: 00000000ffffffff^M
[    1.106753] RBP: 00000000ffffffff R08: ffffffffb685e660 R09: 0000000000000000^M
[    1.106753] R10: 0000000000000000 R11: ffffffffb685e668 R12: 0000000000000000^M
[    1.106753] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000^M
[    1.106753] FS:  0000000000000000(0000) GS:ffff9d8747a00000(0000) knlGS:0000000000000000^M
[    1.106753] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
[    1.106753] CR2: ffff9d874ffff000 CR3: 00000000b580a001 CR4: 00000000001606f0^M
[    1.106753] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
[    1.106753] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400^M
[    1.106753] Call Trace:^M
[    1.106753]  xen_uninit_lock_cpu+0x28/0x62^M
[    1.106753]  xen_hvm_cpu_die+0x21/0x30^M
[    1.106753]  takedown_cpu+0x9c/0xe0^M
[    1.106753]  ? trace_suspend_resume+0x60/0x60^M
[    1.106753]  cpuhp_invoke_callback+0x9a/0x530^M
[    1.106753]  _cpu_up+0x11a/0x130^M
[    1.106753]  cpu_up+0x7e/0xc0^M
[    1.106753]  bringup_nonboot_cpus+0x48/0x50^M
[    1.106753]  smp_init+0x26/0x79^M
[    1.106753]  kernel_init_freeable+0xea/0x229^M
[    1.106753]  ? rest_init+0xaa/0xaa^M
[    1.106753]  kernel_init+0xa/0x106^M
[    1.106753]  ret_from_fork+0x35/0x40^M
[    1.106753] ---[ end trace 251e144bc968609c ]---^M

@masneyb should we open a new bug for unbind_from_irqhandler+0x4e/0x60 or do we think it's the same issue?

masneyb · 2020-08-19T13:51:08Z

@dustymabe : That's a separate issue in the suspend/resume path and will need a separate BZ.

That's good to hear the cgroup_bpf_release_fn() error no longer occurs at AWS. My home server is on the stable channel running 5.7.10-201.fc32.x86_64 and I still see the cgroup_bpf_release_fn() error come through periodically. The last time was six days ago. I'm pretty busy at work the next two weeks but after that I'll try the new kernel there to see if the cgroup_bpf_release_fn() BZ can be closed.

dustymabe · 2020-08-19T14:31:15Z

opened #606 and BZ#1870209 for the new kernel issue.

Assuming we don't see the percpu_ref_switch_to_atomic_rcu+0x12f/0x140 issue any more I'll close this bug out.

dustymabe · 2020-10-21T18:54:44Z

Not seeing this any longer.

m4 is old and shouldn't be the default; this came up as part of coreos/fedora-coreos-tracker#507

lucab added area/testing component/kernel component/systemd kind/bug platform/aws labels May 29, 2020

This was referenced Jun 3, 2020

testing: new release on 2020-06-02 (32.20200601.2.1) coreos/fedora-coreos-streams#108

Closed

next: new release on 2020-06-02 (32.20200601.1.1) coreos/fedora-coreos-streams#109

Closed

lucab added the area/upstream label Jun 3, 2020

cgwalters added a commit to cgwalters/coreos-assembler that referenced this issue Jun 3, 2020

kola: Bump default instance type to m5.large

902ab83

m4 is old and shouldn't be the default; this came up as part of coreos/fedora-coreos-tracker#507

cgwalters mentioned this issue Jun 3, 2020

kola: Bump default instance type to m5.large coreos/coreos-assembler#1505

Closed

lucab added the fallout/f32 label Jun 3, 2020

dustymabe mentioned this issue Jun 3, 2020

F32 rebase tracker for changes discussion #372

Closed

This was referenced Jun 11, 2020

testing: new release on 2020-06-11 (32.20200601.2.2) coreos/fedora-coreos-streams#119

Closed

next: new release on 2020-06-11 (32.20200601.1.2) coreos/fedora-coreos-streams#120

Closed

dustymabe pushed a commit to dustymabe/coreos-assembler that referenced this issue Jun 19, 2020

kola: Bump default instance type to m5.large

7d1623a

m4 is old and shouldn't be the default; this came up as part of coreos/fedora-coreos-tracker#507

This was referenced Jun 22, 2020

testing: new release on 2020-06-22 (32.20200615.2.1) coreos/fedora-coreos-streams#131

Closed

next: new release on 2020-06-22 (32.20200615.1.2) coreos/fedora-coreos-streams#132

Closed

dustymabe pushed a commit to dustymabe/coreos-assembler that referenced this issue Jun 22, 2020

kola: Bump default instance type to m5.large

e59cf60

m4 is old and shouldn't be the default; this came up as part of coreos/fedora-coreos-tracker#507

dustymabe mentioned this issue Jun 22, 2020

kola: Bump default instance type to m5.large coreos/coreos-assembler#1550

Merged

dustymabe mentioned this issue Jun 25, 2020

next: new release on 2020-06-25 (32.20200615.1.3) coreos/fedora-coreos-streams#139

Closed

35 tasks

jlebon mentioned this issue Jun 25, 2020

testing: new release on 2020-06-26 (32.20200615.2.2) coreos/fedora-coreos-streams#140

Closed

35 tasks

dustymabe mentioned this issue Jun 26, 2020

next: new release on 2020-06-26 (32.20200625.1.0) coreos/fedora-coreos-streams#142

Closed

35 tasks

This was referenced Jun 30, 2020

stable: new release on 2020-06-30 (32.20200615.3.0) coreos/fedora-coreos-streams#126

Closed

testing: new release on 2020-06-30 (32.20200629.2.0) coreos/fedora-coreos-streams#129

Closed

dustymabe changed the title ~~aws: kola podman.base tests failing with kernel warning~~ aws: kola podman.base tests failing with kernel warning percpu_ref_switch_to_atomic_rcu+0x12f/0x140 Aug 19, 2020

dustymabe mentioned this issue Aug 19, 2020

aws: kola podman.base tests failing with kernel warning unbind_from_irqhandler+0x4e/0x60 #606

Closed

dustymabe closed this as completed Oct 21, 2020

dustymabe pushed a commit to dustymabe/coreos-assembler that referenced this issue Oct 21, 2020

kola: Bump default instance type to m5.large

336366d

m4 is old and shouldn't be the default; this came up as part of coreos/fedora-coreos-tracker#507

openshift-merge-robot pushed a commit to coreos/coreos-assembler that referenced this issue Oct 21, 2020

kola: Bump default instance type to m5.large

e3a815e

m4 is old and shouldn't be the default; this came up as part of coreos/fedora-coreos-tracker#507

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws: kola podman.base tests failing with kernel warning percpu_ref_switch_to_atomic_rcu+0x12f/0x140 #507

aws: kola podman.base tests failing with kernel warning percpu_ref_switch_to_atomic_rcu+0x12f/0x140 #507

dustymabe commented May 29, 2020

lucab commented May 29, 2020

lucab commented May 29, 2020

lucab commented Jun 3, 2020 •

edited

Loading

cgwalters commented Jun 3, 2020

cgwalters commented Jun 3, 2020

dustymabe commented Jun 4, 2020 •

edited

Loading

dustymabe commented Jun 19, 2020

jlebon commented Jun 22, 2020

cgwalters commented Jun 22, 2020

dustymabe commented Jun 25, 2020

masneyb commented Jul 14, 2020

masneyb commented Jul 15, 2020 •

edited

Loading

cgwalters commented Jul 15, 2020

masneyb commented Jul 17, 2020

masneyb commented Aug 12, 2020

dustymabe commented Aug 18, 2020

masneyb commented Aug 19, 2020

dustymabe commented Aug 19, 2020

dustymabe commented Oct 21, 2020

aws: kola podman.base tests failing with kernel warning percpu_ref_switch_to_atomic_rcu+0x12f/0x140 #507

aws: kola podman.base tests failing with kernel warning percpu_ref_switch_to_atomic_rcu+0x12f/0x140 #507

Comments

dustymabe commented May 29, 2020

lucab commented May 29, 2020

lucab commented May 29, 2020

lucab commented Jun 3, 2020 • edited Loading

cgwalters commented Jun 3, 2020

cgwalters commented Jun 3, 2020

dustymabe commented Jun 4, 2020 • edited Loading

dustymabe commented Jun 19, 2020

jlebon commented Jun 22, 2020

cgwalters commented Jun 22, 2020

dustymabe commented Jun 25, 2020

masneyb commented Jul 14, 2020

masneyb commented Jul 15, 2020 • edited Loading

cgwalters commented Jul 15, 2020

masneyb commented Jul 17, 2020

masneyb commented Aug 12, 2020

dustymabe commented Aug 18, 2020

masneyb commented Aug 19, 2020

dustymabe commented Aug 19, 2020

dustymabe commented Oct 21, 2020

lucab commented Jun 3, 2020 •

edited

Loading

dustymabe commented Jun 4, 2020 •

edited

Loading

masneyb commented Jul 15, 2020 •

edited

Loading