Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kola: ext.config.kdump.crash test fails on AWS aarch64 instances #1187

Closed
dustymabe opened this issue May 2, 2022 · 6 comments · Fixed by coreos/fedora-coreos-config#2845

Comments

@dustymabe
Copy link
Member

This is a follow on issue to #1075. This is specific to AWS aarch64 instances.

kdump on aarch64 AWS instances (in this case c6g.xlarge) gets stuck. This is somehow related to the serial console of the machine.

When setting up kdump and using sysrq to trigger a crash we notice that the crash kernel hangs and never completes. It always gets stuck at a particular point:

[   10.506150] printk: console [ttyS0] disabled

If I then type some characters into the serial console the system (or the console) gets unstuck, but it looks like another kexec happens in the background. That kernel eventually bails out (though I do notice this interested stack trace before it does bail out):

[   79.141909] irq 14: nobody cared (try booting with the "irqpoll" option)                                                                                   
[   79.141916] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.17.4-300.fc36.aarch64 #1 
[   79.141919] Hardware name: Amazon EC2 c6g.xlarge/, BIOS 1.0 11/1/2018       
[   79.141921] Call trace:                                                         
[   79.141922]  dump_backtrace+0xfc/0x134                          
[   79.141927]  show_stack+0x24/0x6c                                               
[   79.141929]  dump_stack_lvl+0x64/0x80                                           
[   79.141933]  dump_stack+0x18/0x34                                               
[   79.141935]  __report_bad_irq+0x54/0x16c            
[   79.141938]  note_interrupt+0x30c/0x40c          
[   79.141942]  handle_irq_event+0xec/0x180                               
[   79.141944]  handle_fasteoi_irq+0xcc/0x200            
[   79.141946]  generic_handle_domain_irq+0x48/0x70                              
[   79.141948]  gic_handle_irq+0xc0/0x140                                   
[   79.141950]  call_on_irq_stack+0x2c/0x38                                 
[   79.141952]  do_interrupt_handler+0x88/0x90         
[   79.141955]  el1_interrupt+0x34/0x54                                            
[   79.141959]  el1h_64_irq_handler+0x18/0x24                                                                                                                          
[   79.141961]  el1h_64_irq+0x7c/0x80
[   79.141963]  arch_cpu_idle+0x18/0x2c                                                                                                                                
[   79.141964]  default_idle_call+0x4c/0x140                                    
[   79.141967]  cpuidle_idle_call+0x14c/0x1a0          
[   79.141970]  do_idle+0xb0/0x100                                                 
[   79.141973]  cpu_startup_entry+0x30/0x8c    
[   79.141976]  rest_init+0xd0/0xe0
[   79.141977]  arch_call_rest_init+0x1c/0x28
[   79.141980]  start_kernel+0x484/0x4a0
[   79.141981]  __primary_switched+0xc0/0xc8
[   79.141985] handlers:
[   79.141986] [<00000000f4a19d33>] serial8250_interrupt
[   79.141991] Disabling IRQ #14
[   79.144145] pci 0000:00:01.0: [1d0f:8250] type 00 class 0x070003
[   79.144261] pci 0000:00:01.0: reg 0x10: [mem 0x80118000-0x80118fff]
[   79.144655] pci 0000:00:01.0: BAR 0: assigned [mem 0x80000000-0x80000fff]
[   79.145089] printk: console [ttyS0] disabled
[   79.145243] 0000:00:01.0: ttyS0 at MMIO 0x80000000 (irq = 14, base_baud = 115200) is a 16550A
[   94.741159] printk: console [ttyS0] enabled
[   94.744715] pci 0000:00:04.0: [1d0f:8061] type 00 class 0x010802
[   94.746219] pci 0000:00:04.0: reg 0x10: [mem 0x80110000-0x80113fff]
[   94.749505] pci 0000:00:04.0: PME# supported from D0 D1 D2 D3hot D3cold

and then the system seems to go through a reboot (i.e. I see grub and a full boot happens). At the end of all this there is still never any files created in /var/crash.

Since the system got hung up initially on a message about the console I decided to try the test after removing console=ttyS0,115200n8 on the kernel command line. In this case the test passes, but I have no idea why.

I have opened a BZ#2080468 to track this issue.

@dustymabe
Copy link
Member Author

It turns out removing irqpoll from KDUMP_COMMANDLINE_APPEND= in /etc/sysconfig/kdump is a workaround for this problem (see comment#2).

dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue May 2, 2022
The irqpoll` karg that gets added when the crash kernel is
started causes issues on AWS aarch64 instances. Let's workaround
by removing that from `KDUMP_COMMANDLINE_APPEND` in
`/etc/sysconfig/kdump` for now.

See coreos/fedora-coreos-tracker#1187
dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue May 3, 2022
The irqpoll` karg that gets added when the crash kernel is
started causes issues on AWS aarch64 instances. Let's workaround
by removing that from `KDUMP_COMMANDLINE_APPEND` in
`/etc/sysconfig/kdump` for now.

See coreos/fedora-coreos-tracker#1187
dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue May 3, 2022
The irqpoll` karg that gets added when the crash kernel is
started causes issues on AWS aarch64 instances. Let's workaround
by removing that from `KDUMP_COMMANDLINE_APPEND` in
`/etc/sysconfig/kdump` for now.

See coreos/fedora-coreos-tracker#1187
dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue May 4, 2022
The irqpoll` karg that gets added when the crash kernel is
started causes issues on AWS aarch64 instances. Let's workaround
by removing that from `KDUMP_COMMANDLINE_APPEND` in
`/etc/sysconfig/kdump` for now.

See coreos/fedora-coreos-tracker#1187
dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue May 4, 2022
The irqpoll` karg that gets added when the crash kernel is
started causes issues on AWS aarch64 instances. Let's workaround
by removing that from `KDUMP_COMMANDLINE_APPEND` in
`/etc/sysconfig/kdump` for now.

See coreos/fedora-coreos-tracker#1187
dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue May 4, 2022
The irqpoll` karg that gets added when the crash kernel is
started causes issues on AWS aarch64 instances. Let's workaround
by removing that from `KDUMP_COMMANDLINE_APPEND` in
`/etc/sysconfig/kdump` for now.

See coreos/fedora-coreos-tracker#1187
dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue May 4, 2022
The irqpoll` karg that gets added when the crash kernel is
started causes issues on AWS aarch64 instances. Let's workaround
by removing that from `KDUMP_COMMANDLINE_APPEND` in
`/etc/sysconfig/kdump` for now.

See coreos/fedora-coreos-tracker#1187
dustymabe added a commit to coreos/fedora-coreos-config that referenced this issue May 4, 2022
The irqpoll` karg that gets added when the crash kernel is
started causes issues on AWS aarch64 instances. Let's workaround
by removing that from `KDUMP_COMMANDLINE_APPEND` in
`/etc/sysconfig/kdump` for now.

See coreos/fedora-coreos-tracker#1187
@dustymabe
Copy link
Member Author

While we wait on the issue to be fixed upstream we'll workaround for now with: coreos/fedora-coreos-config@4c31f58

@travier
Copy link
Member

travier commented Sep 7, 2023

This should be "fixed" in kexec-tools so we should remove our workaround.

@dustymabe dustymabe added the jira for syncing to jira label Sep 7, 2023
@dustymabe
Copy link
Member Author

This should be "fixed" in kexec-tools so we should remove our workaround.

I just realized there are a few things that will make it hard to verify this right now. In anything other than rawhide right now there is #1430 which causes kdump to just not work at all for newer kernels. There is a fix for that in rawhide already, but there is an SELinux issue which causes it not to succeed.

So.. if you remove the workaround you can test this in AWS with a rawhide instance with SELinux disabled (i.e. enforcing=0 on the kernel command line). It should be as simple as modifying the test here to add enforcing=0.

@dustymabe
Copy link
Member Author

I just realized there are a few things that will make it hard to verify this right now.

The issues mentioned in my previous comment are now resolved. You should now be able to test this on rawhide with the removed workaround.

HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
The irqpoll` karg that gets added when the crash kernel is
started causes issues on AWS aarch64 instances. Let's workaround
by removing that from `KDUMP_COMMANDLINE_APPEND` in
`/etc/sysconfig/kdump` for now.

See coreos/fedora-coreos-tracker#1187
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
The irqpoll` karg that gets added when the crash kernel is
started causes issues on AWS aarch64 instances. Let's workaround
by removing that from `KDUMP_COMMANDLINE_APPEND` in
`/etc/sysconfig/kdump` for now.

See coreos/fedora-coreos-tracker#1187
gursewak1997 added a commit to gursewak1997/fedora-coreos-config that referenced this issue Feb 8, 2024
No longer need that workaround since the issue has been fixed
Fixes: coreos/fedora-coreos-tracker#1187
gursewak1997 added a commit to gursewak1997/fedora-coreos-config that referenced this issue Feb 8, 2024
No longer need that workaround since the issue has been fixed
Fixes: coreos/fedora-coreos-tracker#1187
dustymabe pushed a commit to coreos/fedora-coreos-config that referenced this issue Feb 9, 2024
No longer need that workaround since the issue has been fixed
Fixes: coreos/fedora-coreos-tracker#1187
aaradhak pushed a commit to aaradhak/fedora-coreos-config that referenced this issue Mar 18, 2024
No longer need that workaround since the issue has been fixed
Fixes: coreos/fedora-coreos-tracker#1187
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants