Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ore to container #1

Merged
merged 1 commit into from
May 25, 2018
Merged

Add ore to container #1

merged 1 commit into from
May 25, 2018

Conversation

jlebon
Copy link
Member

@jlebon jlebon commented May 24, 2018

This builds and installs mantle's ore binary as well as some other
useful utilities like jq and awscli. Then we should be able to fully
migrate the cloud pipeline.

This builds and installs mantle's `ore` binary as well as some other
useful utilities like `jq` and `awscli`. Then we should be able to fully
migrate the cloud pipeline.
@cgwalters
Copy link
Member

Sounds good to me, thanks!

@cgwalters cgwalters merged commit 2ca0cfc into coreos:master May 25, 2018
jlebon referenced this pull request in jlebon/os May 25, 2018
This should now be in the latest coreos-assembler container. See
https://github.com/cgwalters/coreos-assembler/pull/1.
@jlebon jlebon deleted the pr/add-ore branch July 6, 2020 20:31
dustymabe added a commit that referenced this pull request May 3, 2023
Recently we saw a test with many soft lockup messages like:

```
[ 4159.779792] watchdog: BUG: soft lockup - CPU#0 stuck for 883s! [kworker/u2:0:10]
[ 4159.780488] Modules linked in:
[ 4159.780787] CPU: 0 PID: 10 Comm: kworker/u2:0 Tainted: G             L    -------  ---  6.4.0-0.rc0.20230502git865fdb08197e.11.fc39.x86_64 #1
[ 4159.780787] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014
[ 4159.780787] Workqueue: ftrace_check_wq ftrace_check_work_func
[ 4159.780787] RIP: 0010:get_symbol_pos+0x5d/0x140
[ 4159.780787] Code: 63 0c 8d 08 d0 6e a0 85 c9 79 0a 48 f7 d1 48 03 0d 50 c2 5a 01 48 39 cf 48 0f 42 f0 48 0f 43 d0 48 89 f0 48 29 d0 48 83 f8 01 <77> ca 48 85 d2 75 08 eb 4b 48 83 ea 01 74 45 8d 42 ff 48 98 48 63
[ 4159.780787] RSP: 0018:ffffb1970005bb80 EFLAGS: 00000202
[ 4159.780787] RAX: 0000000000000032 RBX: ffffffff9f22edb4 RCX: ffffffff9f22ed00
[ 4159.780787] RDX: 0000000000004436 RSI: 0000000000004468 RDI: ffffffff9f22edb4
[ 4159.780787] RBP: ffffb1970005bbce R08: 0000000000000000 R09: ffffb1970005bbc0
[ 4159.780787] R10: 0000000000000000 R11: 00000000000320a7 R12: 0000000000000000
[ 4159.780787] R13: 0000000000000000 R14: ffffb1970005bbc0 R15: 0000000000000000
[ 4159.780787] FS:  0000000000000000(0000) GS:ffff9fa07ec00000(0000) knlGS:0000000000000000
[ 4159.780787] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4159.780787] CR2: ffff9fa04fe01000 CR3: 000000000e022000 CR4: 0000000000350ef0
[ 4159.780787] Call Trace:
[ 4159.780787]  <TASK>
[ 4159.780787]  kallsyms_lookup_buildid+0x4d/0x130
[ 4159.780787]  test_for_valid_rec+0x64/0xb0
[ 4159.780787]  ftrace_check_work_func+0x3b/0x60
[ 4159.780787]  process_one_work+0x1c7/0x3d0
[ 4159.780787]  worker_thread+0x51/0x390
[ 4159.780787]  ? __pfx_worker_thread+0x10/0x10
[ 4159.780787]  kthread+0xf7/0x130
[ 4159.780787]  ? __pfx_kthread+0x10/0x10
[ 4159.780787]  ret_from_fork+0x2c/0x50
[ 4159.780787]  </TASK>
```

Let's try to detect and report this.
jlebon added a commit to jlebon/coreos-assembler that referenced this pull request Sep 12, 2023
For some reason, if we try to SSH right after detaching the primary
block device, the OS will sometimes crash with:

```
[  100.662358] watchdog: watchdog0: watchdog did not stop!
[  100.969436] watchdog: watchdog0: watchdog did not stop!
[  100.998017] BUG: Unable to handle kernel data access at 0x5deadbeef0000100
[  100.998158] Faulting instruction address: 0xc000000000f219d4
[  100.998264] Oops: Kernel access of bad area, sig: 11 [coreos#1]
[  100.998348] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
[  100.998454] Modules linked in: rfkill crct10dif_vpmsum binfmt_misc raid1 xfs zram virtio_net net_failover vmx_crypto
 crc32c_vpmsum pseries_wdt virtio_console failover virtio_blk scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables dm_multipath fuse
[  100.998822] CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 6.4.15-200.fc38.ppc64le coreos#1
[  100.998947] Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw) 0x4e1203 0xf000005 of:SLOF,HEAD hv:linux,kvm pSeries
[  100.999107] NIP:  c000000000f219d4 LR: c000000000f219c8 CTR: c0000000001c60e0
[  100.999229] REGS: c0000000085a3860 TRAP: 0380   Not tainted  (6.4.15-200.fc38.ppc64le)
[  100.999352] MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 44442404  XER: 00000092
[  100.999511] CFAR: c000000000f2191c IRQMASK: 0
[  100.999511] GPR00: c000000000f219c8 c0000000085a3b00 c000000001eea800 c000000002c88ac8
[  100.999511] GPR04: c000000002c88ac8 0000000000000001 0000000000000001 fffffffffffe0000
[  100.999511] GPR08: 0000000000000001 0000000000000001 5deadbeef0000100 0000000000002000
[  100.999511] GPR12: 0000000000000000 c000000002ca0000 0000000000000000 0000000000000000
[  100.999511] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  100.999511] GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  100.999511] GPR24: 0000000000000002 0000000000000000 c000000002c888e0 c00000000294e6e8
[  100.999511] GPR28: c000000002c88ac8 5deadbeeeffffd28 c00000001f2f1218 c0000000b8d9d000
[  101.000532] NIP [c000000000f219d4] md_notify_reboot+0x154/0x250
[  101.000643] LR [c000000000f219c8] md_notify_reboot+0x148/0x250
[  101.000748] Call Trace:
[  101.000793] [c0000000085a3b00] [c000000000f219a0] md_notify_reboot+0x120/0x250 (unreliable)
[  101.000922] [c0000000085a3b60] [c000000000199e30] notifier_call_chain+0xc0/0x1b0
[  101.001049] [c0000000085a3bc0] [c00000000019a114] blocking_notifier_call_chain+0x64/0xa0
[  101.001176] [c0000000085a3c00] [c00000000019dbb8] kernel_restart+0x38/0xe0
[  101.001282] [c0000000085a3c70] [c00000000019dfbc] __do_sys_reboot+0x12c/0x2c0
[  101.001409] [c0000000085a3dd0] [c000000000030f34] system_call_exception+0x174/0x320
[  101.001537] [c0000000085a3e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
[  101.001681] --- interrupt: 3000 at 0x7fffb995aa88
[  101.001767] NIP:  00007fffb995aa88 LR: 0000000000000000 CTR: 0000000000000000
[  101.001888] REGS: c0000000085a3e80 TRAP: 3000   Not tainted  (6.4.15-200.fc38.ppc64le)
[  101.002010] MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 48442403  XER: 00000000
[  101.002167] IRQMASK: 0
[  101.002167] GPR00: 0000000000000058 00007fffd8caa5c0 00007fffb9a76f00 fffffffffee1dead
[  101.002167] GPR04: 0000000028121969 0000000001234567 0000000000003a5d 0000000000000020
[  101.002167] GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  101.002167] GPR12: 0000000000000000 00007fffba163680 0000000000000000 0000000000000040
[  101.002167] GPR16: 0000000000000138 0000000000000001 000000010b9213c8 000000010b921a90
[  101.002167] GPR20: 0000000000000000 000000010b9212e0 0000000000000010 000000010b921318
[  101.002167] GPR24: 000000010b9212d0 00007fffd8caa6e0 00007fffd8caa6f8 00007fffd8caa6d8
[  101.002167] GPR28: 00007fffd8caada8 00007fffd8caa6e8 00007fffd8caa6c8 0000000000000000
[  101.003150] NIP [00007fffb995aa88] 0x7fffb995aa88
[  101.003234] LR [0000000000000000] 0x0
[  101.003299] --- interrupt: 3000
[  101.003363] Code: 7f84e378 48423c31 60000000 2c030000 4182000c 7fe3fb78 4bff859d 7f83e378 48465645 60000000 39000001 395d03d8 <e93d03d8> 7fbfeb78 7c2ad800 3929fc28
[  101.003604] ---[ end trace 0000000000000000 ]---
[  101.014145] pstore: backend (nvram) writing error (-1)
[  101.014255]
[  102.014301] note: systemd-shutdow[1] exited with irqs disabled
[  102.014503] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
```

The `md_notify_reboot` implicates the MD code in the kernel.
jlebon added a commit to jlebon/coreos-assembler that referenced this pull request Sep 12, 2023
For some reason, if we try to reboot too quickly after detaching the
primary block device, the OS will sometimes crash with:

```
[  100.662358] watchdog: watchdog0: watchdog did not stop!
[  100.969436] watchdog: watchdog0: watchdog did not stop!
[  100.998017] BUG: Unable to handle kernel data access at 0x5deadbeef0000100
[  100.998158] Faulting instruction address: 0xc000000000f219d4
[  100.998264] Oops: Kernel access of bad area, sig: 11 [coreos#1]
[  100.998348] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
[  100.998454] Modules linked in: rfkill crct10dif_vpmsum binfmt_misc raid1 xfs zram virtio_net net_failover vmx_crypto
 crc32c_vpmsum pseries_wdt virtio_console failover virtio_blk scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables dm_multipath fuse
[  100.998822] CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 6.4.15-200.fc38.ppc64le coreos#1
[  100.998947] Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw) 0x4e1203 0xf000005 of:SLOF,HEAD hv:linux,kvm pSeries
[  100.999107] NIP:  c000000000f219d4 LR: c000000000f219c8 CTR: c0000000001c60e0
[  100.999229] REGS: c0000000085a3860 TRAP: 0380   Not tainted  (6.4.15-200.fc38.ppc64le)
[  100.999352] MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 44442404  XER: 00000092
[  100.999511] CFAR: c000000000f2191c IRQMASK: 0
[  100.999511] GPR00: c000000000f219c8 c0000000085a3b00 c000000001eea800 c000000002c88ac8
[  100.999511] GPR04: c000000002c88ac8 0000000000000001 0000000000000001 fffffffffffe0000
[  100.999511] GPR08: 0000000000000001 0000000000000001 5deadbeef0000100 0000000000002000
[  100.999511] GPR12: 0000000000000000 c000000002ca0000 0000000000000000 0000000000000000
[  100.999511] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  100.999511] GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  100.999511] GPR24: 0000000000000002 0000000000000000 c000000002c888e0 c00000000294e6e8
[  100.999511] GPR28: c000000002c88ac8 5deadbeeeffffd28 c00000001f2f1218 c0000000b8d9d000
[  101.000532] NIP [c000000000f219d4] md_notify_reboot+0x154/0x250
[  101.000643] LR [c000000000f219c8] md_notify_reboot+0x148/0x250
[  101.000748] Call Trace:
[  101.000793] [c0000000085a3b00] [c000000000f219a0] md_notify_reboot+0x120/0x250 (unreliable)
[  101.000922] [c0000000085a3b60] [c000000000199e30] notifier_call_chain+0xc0/0x1b0
[  101.001049] [c0000000085a3bc0] [c00000000019a114] blocking_notifier_call_chain+0x64/0xa0
[  101.001176] [c0000000085a3c00] [c00000000019dbb8] kernel_restart+0x38/0xe0
[  101.001282] [c0000000085a3c70] [c00000000019dfbc] __do_sys_reboot+0x12c/0x2c0
[  101.001409] [c0000000085a3dd0] [c000000000030f34] system_call_exception+0x174/0x320
[  101.001537] [c0000000085a3e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
[  101.001681] --- interrupt: 3000 at 0x7fffb995aa88
[  101.001767] NIP:  00007fffb995aa88 LR: 0000000000000000 CTR: 0000000000000000
[  101.001888] REGS: c0000000085a3e80 TRAP: 3000   Not tainted  (6.4.15-200.fc38.ppc64le)
[  101.002010] MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 48442403  XER: 00000000
[  101.002167] IRQMASK: 0
[  101.002167] GPR00: 0000000000000058 00007fffd8caa5c0 00007fffb9a76f00 fffffffffee1dead
[  101.002167] GPR04: 0000000028121969 0000000001234567 0000000000003a5d 0000000000000020
[  101.002167] GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  101.002167] GPR12: 0000000000000000 00007fffba163680 0000000000000000 0000000000000040
[  101.002167] GPR16: 0000000000000138 0000000000000001 000000010b9213c8 000000010b921a90
[  101.002167] GPR20: 0000000000000000 000000010b9212e0 0000000000000010 000000010b921318
[  101.002167] GPR24: 000000010b9212d0 00007fffd8caa6e0 00007fffd8caa6f8 00007fffd8caa6d8
[  101.002167] GPR28: 00007fffd8caada8 00007fffd8caa6e8 00007fffd8caa6c8 0000000000000000
[  101.003150] NIP [00007fffb995aa88] 0x7fffb995aa88
[  101.003234] LR [0000000000000000] 0x0
[  101.003299] --- interrupt: 3000
[  101.003363] Code: 7f84e378 48423c31 60000000 2c030000 4182000c 7fe3fb78 4bff859d 7f83e378 48465645 60000000 39000001 395d03d8 <e93d03d8> 7fbfeb78 7c2ad800 3929fc28
[  101.003604] ---[ end trace 0000000000000000 ]---
[  101.014145] pstore: backend (nvram) writing error (-1)
[  101.014255]
[  102.014301] note: systemd-shutdow[1] exited with irqs disabled
[  102.014503] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
```

The `md_notify_reboot` implicates the MD code in the kernel.

Obviously this is likely a bug that needs fixing
jlebon added a commit to jlebon/coreos-assembler that referenced this pull request Sep 13, 2023
For some reason, if we try to reboot too quickly after detaching the
primary block device, the OS will sometimes crash with:

```
[  100.662358] watchdog: watchdog0: watchdog did not stop!
[  100.969436] watchdog: watchdog0: watchdog did not stop!
[  100.998017] BUG: Unable to handle kernel data access at 0x5deadbeef0000100
[  100.998158] Faulting instruction address: 0xc000000000f219d4
[  100.998264] Oops: Kernel access of bad area, sig: 11 [coreos#1]
[  100.998348] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
[  100.998454] Modules linked in: rfkill crct10dif_vpmsum binfmt_misc raid1 xfs zram virtio_net net_failover vmx_crypto crc32c_vpmsum pseries_wdt virtio_console failover virtio_blk scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables dm_multipath fuse
[  100.998822] CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 6.4.15-200.fc38.ppc64le coreos#1
[  100.998947] Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw) 0x4e1203 0xf000005 of:SLOF,HEAD hv:linux,kvm pSeries
[  100.999107] NIP:  c000000000f219d4 LR: c000000000f219c8 CTR: c0000000001c60e0
[  100.999229] REGS: c0000000085a3860 TRAP: 0380   Not tainted  (6.4.15-200.fc38.ppc64le)
[  100.999352] MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 44442404  XER: 00000092
[  100.999511] CFAR: c000000000f2191c IRQMASK: 0
[  100.999511] GPR00: c000000000f219c8 c0000000085a3b00 c000000001eea800 c000000002c88ac8
[  100.999511] GPR04: c000000002c88ac8 0000000000000001 0000000000000001 fffffffffffe0000
[  100.999511] GPR08: 0000000000000001 0000000000000001 5deadbeef0000100 0000000000002000
[  100.999511] GPR12: 0000000000000000 c000000002ca0000 0000000000000000 0000000000000000
[  100.999511] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  100.999511] GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  100.999511] GPR24: 0000000000000002 0000000000000000 c000000002c888e0 c00000000294e6e8
[  100.999511] GPR28: c000000002c88ac8 5deadbeeeffffd28 c00000001f2f1218 c0000000b8d9d000
[  101.000532] NIP [c000000000f219d4] md_notify_reboot+0x154/0x250
[  101.000643] LR [c000000000f219c8] md_notify_reboot+0x148/0x250
[  101.000748] Call Trace:
[  101.000793] [c0000000085a3b00] [c000000000f219a0] md_notify_reboot+0x120/0x250 (unreliable)
[  101.000922] [c0000000085a3b60] [c000000000199e30] notifier_call_chain+0xc0/0x1b0
[  101.001049] [c0000000085a3bc0] [c00000000019a114] blocking_notifier_call_chain+0x64/0xa0
[  101.001176] [c0000000085a3c00] [c00000000019dbb8] kernel_restart+0x38/0xe0
[  101.001282] [c0000000085a3c70] [c00000000019dfbc] __do_sys_reboot+0x12c/0x2c0
[  101.001409] [c0000000085a3dd0] [c000000000030f34] system_call_exception+0x174/0x320
[  101.001537] [c0000000085a3e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
[  101.001681] --- interrupt: 3000 at 0x7fffb995aa88
[  101.001767] NIP:  00007fffb995aa88 LR: 0000000000000000 CTR: 0000000000000000
[  101.001888] REGS: c0000000085a3e80 TRAP: 3000   Not tainted  (6.4.15-200.fc38.ppc64le)
[  101.002010] MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 48442403  XER: 00000000
[  101.002167] IRQMASK: 0
[  101.002167] GPR00: 0000000000000058 00007fffd8caa5c0 00007fffb9a76f00 fffffffffee1dead
[  101.002167] GPR04: 0000000028121969 0000000001234567 0000000000003a5d 0000000000000020
[  101.002167] GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  101.002167] GPR12: 0000000000000000 00007fffba163680 0000000000000000 0000000000000040
[  101.002167] GPR16: 0000000000000138 0000000000000001 000000010b9213c8 000000010b921a90
[  101.002167] GPR20: 0000000000000000 000000010b9212e0 0000000000000010 000000010b921318
[  101.002167] GPR24: 000000010b9212d0 00007fffd8caa6e0 00007fffd8caa6f8 00007fffd8caa6d8
[  101.002167] GPR28: 00007fffd8caada8 00007fffd8caa6e8 00007fffd8caa6c8 0000000000000000
[  101.003150] NIP [00007fffb995aa88] 0x7fffb995aa88
[  101.003234] LR [0000000000000000] 0x0
[  101.003299] --- interrupt: 3000
[  101.003363] Code: 7f84e378 48423c31 60000000 2c030000 4182000c 7fe3fb78 4bff859d 7f83e378 48465645 60000000 39000001 395d03d8 <e93d03d8> 7fbfeb78 7c2ad800 3929fc28
[  101.003604] ---[ end trace 0000000000000000 ]---
[  101.014145] pstore: backend (nvram) writing error (-1)
[  101.014255]
[  102.014301] note: systemd-shutdow[1] exited with irqs disabled
[  102.014503] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
```

The `md_notify_reboot` implicates the MD code in the kernel.

Obviously this is likely a bug that needs fixing
jlebon added a commit to jlebon/coreos-assembler that referenced this pull request Sep 13, 2023
For some reason, if we try to reboot too quickly after detaching the
primary block device, the OS will sometimes crash with:

```
[  100.662358] watchdog: watchdog0: watchdog did not stop!
[  100.969436] watchdog: watchdog0: watchdog did not stop!
[  100.998017] BUG: Unable to handle kernel data access at 0x5deadbeef0000100
[  100.998158] Faulting instruction address: 0xc000000000f219d4
[  100.998264] Oops: Kernel access of bad area, sig: 11 [coreos#1]
[  100.998348] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
[  100.998454] Modules linked in: rfkill crct10dif_vpmsum binfmt_misc raid1 xfs zram virtio_net net_failover vmx_crypto crc32c_vpmsum pseries_wdt virtio_console failover virtio_blk scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables dm_multipath fuse
[  100.998822] CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 6.4.15-200.fc38.ppc64le coreos#1
[  100.998947] Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw) 0x4e1203 0xf000005 of:SLOF,HEAD hv:linux,kvm pSeries
[  100.999107] NIP:  c000000000f219d4 LR: c000000000f219c8 CTR: c0000000001c60e0
[  100.999229] REGS: c0000000085a3860 TRAP: 0380   Not tainted  (6.4.15-200.fc38.ppc64le)
[  100.999352] MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 44442404  XER: 00000092
[  100.999511] CFAR: c000000000f2191c IRQMASK: 0
[  100.999511] GPR00: c000000000f219c8 c0000000085a3b00 c000000001eea800 c000000002c88ac8
[  100.999511] GPR04: c000000002c88ac8 0000000000000001 0000000000000001 fffffffffffe0000
[  100.999511] GPR08: 0000000000000001 0000000000000001 5deadbeef0000100 0000000000002000
[  100.999511] GPR12: 0000000000000000 c000000002ca0000 0000000000000000 0000000000000000
[  100.999511] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  100.999511] GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  100.999511] GPR24: 0000000000000002 0000000000000000 c000000002c888e0 c00000000294e6e8
[  100.999511] GPR28: c000000002c88ac8 5deadbeeeffffd28 c00000001f2f1218 c0000000b8d9d000
[  101.000532] NIP [c000000000f219d4] md_notify_reboot+0x154/0x250
[  101.000643] LR [c000000000f219c8] md_notify_reboot+0x148/0x250
[  101.000748] Call Trace:
[  101.000793] [c0000000085a3b00] [c000000000f219a0] md_notify_reboot+0x120/0x250 (unreliable)
[  101.000922] [c0000000085a3b60] [c000000000199e30] notifier_call_chain+0xc0/0x1b0
[  101.001049] [c0000000085a3bc0] [c00000000019a114] blocking_notifier_call_chain+0x64/0xa0
[  101.001176] [c0000000085a3c00] [c00000000019dbb8] kernel_restart+0x38/0xe0
[  101.001282] [c0000000085a3c70] [c00000000019dfbc] __do_sys_reboot+0x12c/0x2c0
[  101.001409] [c0000000085a3dd0] [c000000000030f34] system_call_exception+0x174/0x320
[  101.001537] [c0000000085a3e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
[  101.001681] --- interrupt: 3000 at 0x7fffb995aa88
[  101.001767] NIP:  00007fffb995aa88 LR: 0000000000000000 CTR: 0000000000000000
[  101.001888] REGS: c0000000085a3e80 TRAP: 3000   Not tainted  (6.4.15-200.fc38.ppc64le)
[  101.002010] MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 48442403  XER: 00000000
[  101.002167] IRQMASK: 0
[  101.002167] GPR00: 0000000000000058 00007fffd8caa5c0 00007fffb9a76f00 fffffffffee1dead
[  101.002167] GPR04: 0000000028121969 0000000001234567 0000000000003a5d 0000000000000020
[  101.002167] GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  101.002167] GPR12: 0000000000000000 00007fffba163680 0000000000000000 0000000000000040
[  101.002167] GPR16: 0000000000000138 0000000000000001 000000010b9213c8 000000010b921a90
[  101.002167] GPR20: 0000000000000000 000000010b9212e0 0000000000000010 000000010b921318
[  101.002167] GPR24: 000000010b9212d0 00007fffd8caa6e0 00007fffd8caa6f8 00007fffd8caa6d8
[  101.002167] GPR28: 00007fffd8caada8 00007fffd8caa6e8 00007fffd8caa6c8 0000000000000000
[  101.003150] NIP [00007fffb995aa88] 0x7fffb995aa88
[  101.003234] LR [0000000000000000] 0x0
[  101.003299] --- interrupt: 3000
[  101.003363] Code: 7f84e378 48423c31 60000000 2c030000 4182000c 7fe3fb78 4bff859d 7f83e378 48465645 60000000 39000001 395d03d8 <e93d03d8> 7fbfeb78 7c2ad800 3929fc28
[  101.003604] ---[ end trace 0000000000000000 ]---
[  101.014145] pstore: backend (nvram) writing error (-1)
[  101.014255]
[  102.014301] note: systemd-shutdow[1] exited with irqs disabled
[  102.014503] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
```

The `md_notify_reboot` implicates the MD code in the kernel.

I will file a kernel bug once I've finished gathering more information.

For now, a workaround seems to be to simply sleep for 30s before
rebooting.
jlebon added a commit to jlebon/coreos-assembler that referenced this pull request Sep 13, 2023
All our other root reprovisioning tests double the memory request on
ppc64le and aarch64 due to the larger page size. Do this for the boot
mirroring tests too.

Without this, the tests would sometimes trigger a kernel panic during
the reboot right after the primary block device detach:

```
[  100.662358] watchdog: watchdog0: watchdog did not stop!
[  100.969436] watchdog: watchdog0: watchdog did not stop!
[  100.998017] BUG: Unable to handle kernel data access at 0x5deadbeef0000100
[  100.998158] Faulting instruction address: 0xc000000000f219d4
[  100.998264] Oops: Kernel access of bad area, sig: 11 [coreos#1]
[  100.998348] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
[  100.998454] Modules linked in: rfkill crct10dif_vpmsum binfmt_misc raid1 xfs zram virtio_net net_failover vmx_crypto crc32c_vpmsum pseries_wdt virtio_console failover virtio_blk scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables dm_multipath fuse
[  100.998822] CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 6.4.15-200.fc38.ppc64le coreos#1
[  100.998947] Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw) 0x4e1203 0xf000005 of:SLOF,HEAD hv:linux,kvm pSeries
[  100.999107] NIP:  c000000000f219d4 LR: c000000000f219c8 CTR: c0000000001c60e0
[  100.999229] REGS: c0000000085a3860 TRAP: 0380   Not tainted  (6.4.15-200.fc38.ppc64le)
[  100.999352] MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 44442404  XER: 00000092
[  100.999511] CFAR: c000000000f2191c IRQMASK: 0
[  100.999511] GPR00: c000000000f219c8 c0000000085a3b00 c000000001eea800 c000000002c88ac8
[  100.999511] GPR04: c000000002c88ac8 0000000000000001 0000000000000001 fffffffffffe0000
[  100.999511] GPR08: 0000000000000001 0000000000000001 5deadbeef0000100 0000000000002000
[  100.999511] GPR12: 0000000000000000 c000000002ca0000 0000000000000000 0000000000000000
[  100.999511] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  100.999511] GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  100.999511] GPR24: 0000000000000002 0000000000000000 c000000002c888e0 c00000000294e6e8
[  100.999511] GPR28: c000000002c88ac8 5deadbeeeffffd28 c00000001f2f1218 c0000000b8d9d000
[  101.000532] NIP [c000000000f219d4] md_notify_reboot+0x154/0x250
[  101.000643] LR [c000000000f219c8] md_notify_reboot+0x148/0x250
[  101.000748] Call Trace:
[  101.000793] [c0000000085a3b00] [c000000000f219a0] md_notify_reboot+0x120/0x250 (unreliable)
[  101.000922] [c0000000085a3b60] [c000000000199e30] notifier_call_chain+0xc0/0x1b0
[  101.001049] [c0000000085a3bc0] [c00000000019a114] blocking_notifier_call_chain+0x64/0xa0
[  101.001176] [c0000000085a3c00] [c00000000019dbb8] kernel_restart+0x38/0xe0
[  101.001282] [c0000000085a3c70] [c00000000019dfbc] __do_sys_reboot+0x12c/0x2c0
[  101.001409] [c0000000085a3dd0] [c000000000030f34] system_call_exception+0x174/0x320
[  101.001537] [c0000000085a3e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
[  101.001681] --- interrupt: 3000 at 0x7fffb995aa88
[  101.001767] NIP:  00007fffb995aa88 LR: 0000000000000000 CTR: 0000000000000000
[  101.001888] REGS: c0000000085a3e80 TRAP: 3000   Not tainted  (6.4.15-200.fc38.ppc64le)
[  101.002010] MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 48442403  XER: 00000000
[  101.002167] IRQMASK: 0
[  101.002167] GPR00: 0000000000000058 00007fffd8caa5c0 00007fffb9a76f00 fffffffffee1dead
[  101.002167] GPR04: 0000000028121969 0000000001234567 0000000000003a5d 0000000000000020
[  101.002167] GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  101.002167] GPR12: 0000000000000000 00007fffba163680 0000000000000000 0000000000000040
[  101.002167] GPR16: 0000000000000138 0000000000000001 000000010b9213c8 000000010b921a90
[  101.002167] GPR20: 0000000000000000 000000010b9212e0 0000000000000010 000000010b921318
[  101.002167] GPR24: 000000010b9212d0 00007fffd8caa6e0 00007fffd8caa6f8 00007fffd8caa6d8
[  101.002167] GPR28: 00007fffd8caada8 00007fffd8caa6e8 00007fffd8caa6c8 0000000000000000
[  101.003150] NIP [00007fffb995aa88] 0x7fffb995aa88
[  101.003234] LR [0000000000000000] 0x0
[  101.003299] --- interrupt: 3000
[  101.003363] Code: 7f84e378 48423c31 60000000 2c030000 4182000c 7fe3fb78 4bff859d 7f83e378 48465645 60000000 39000001 395d03d8 <e93d03d8> 7fbfeb78 7c2ad800 3929fc28
[  101.003604] ---[ end trace 0000000000000000 ]---
[  101.014145] pstore: backend (nvram) writing error (-1)
[  101.014255]
[  102.014301] note: systemd-shutdow[1] exited with irqs disabled
[  102.014503] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
```

Even with 8G, the panic still rarely happens. Rather than bumping the
memory even more, I've found that sleeping a bit before rebooting does
the trick.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants