Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kdump] Fix kdump error message when a reboot is issued #7985

Merged
merged 1 commit into from
Jul 1, 2021

Conversation

rajendra-dendukuri
Copy link
Contributor

[ 342.439096] kdump-tools[13655]: /etc/init.d/kdump-tools: 117: /etc/default/kdump-tools: KDUMP_CMDLINE_APPEND+= panic=10 debug hpet=disable pcie_port=compat pci=nommconf sonic_platform=x86_64-accton_as7326_56x-r0: not found

Why I did it

The below error message is seen when a reboot is issued.

[ 342.439096] kdump-tools[13655]: /etc/init.d/kdump-tools: 117: /etc/default/kdump-tools: KDUMP_CMDLINE_APPEND+= panic=10 debug hpet=disable pcie_port=compat pci=nommconf sonic_platform=x86_64-accton_as7326_56x-r0: not found

How I did it

dash doesn't support += operation to append to a variable's value.

How to verify it

Use KDUMP_CMDLINE_APPEND="${KDUMP_CMDLINE_APPEND} " instead

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012

Description for the changelog

Fix kdump error message when a reboot is issued

A picture of a cute animal (not mandatory but encouraged)

@@ -10,7 +10,7 @@ KDUMP_CMDLINE_APPEND="irqpoll nr_cpus=1 nousb systemd.unit=kdump-tools.service a
# Disable advanced pcie features
# Disable high precision event timer as on some platforms it is interfering with the kdump operation
# Pass platform identifier string as part of crash kernel command line to be used by the reboot script during kdump
KDUMP_CMDLINE_APPEND+=" panic=10 debug hpet=disable pcie_port=compat pci=nommconf sonic_platform=__PLATFORM__"
KDUMP_CMDLINE_APPEND="${KDUMP_CMDLINE_APPEND} panic=10 debug hpet=disable pcie_port=compat pci=nommconf sonic_platform=__PLATFORM__"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several questions: panic=10 is used to Reboot crash kernel on panic, right? Does it mean the device will be rebooted if crash kernel was panicked, right? If crash kernel was panicked, whether the core dump file will be generated?
If device was rebooted, production kernel will be loaded or crash kernel will be loaded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to answer all the questions.

panic=10 is used to Reboot crash kernel on panic, right?

Yes. panic=10 mentioned here is crash kernel's command line argument.

Does it mean the device will be rebooted if crash kernel was panicked, right?

Yes. If crash kernel crashes during boot up or during vmcore collection or during its reboot.

If crash kernel was panicked, whether the core dump file will be generated?

Depends on at what point the crash kernel panicked.

If device was rebooted, production kernel will be loaded or crash kernel will be loaded?

If crash kernel reboots/crashes, production kernel will be loaded.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the change to append extra arguments to KDUMP_CMDLINE_APPEND did work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for your answers! @rajendra-dendukuri.

Can you also share me the link or docs to introduce the meaning panic=x please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is set on /etc/sysctl.conf on the filesystem. But since it is critical that crash kernel should always reboot on panic, we set it explicitly in kdump-tools.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the question:

If crash kernel was panicked, whether the core dump file will be generated?

I think the question should be reworded as following:

If crash kernel was panicked, whether the core dump file of crash kernel will be generated?

I think the only purpose of crash kernel/capture kernel is to save kernel core dump file and kernel log file on local disk or remote server from /proc/vmcore.

From the kdump script kdump-tools, we can see that if kernel core file /proc/vmcore was generated, then crash kernel will try to dump kernel core file and kernel log file by invoking function in another kdump script kdump-config. No matter the dump commands were done successfully or not, device will be rebooted into production kernel by calling the command reboot -f.

But if crash kernel crashed during the dump operation, what I am thinking is kernel core file /proc/vmcore can still be generated, the device can be rebooted into production kernel if and only if the crash kernel was loaded again and have a chance to finish dumping the core file.

We may end up on a continuous loop trying to recover from a failed state. It is safe to reboot into production kernel rather than try the crash kernel which has failed. For example if there is an issue with hard disk access, crash kernel may not be able to write to the device unless a reboot has happened. crash kernel is kexec'ed so there is a chance that it may not be able to bring the system to a reliable state. Kdump is a best effort.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the question:
If crash kernel was panicked, whether the core dump file will be generated?
I think the question should be reworded as following:
If crash kernel was panicked, whether the core dump file of crash kernel will be generated?
I think the only purpose of crash kernel/capture kernel is to save kernel core dump file and kernel log file on local disk or remote server from /proc/vmcore.
From the kdump script kdump-tools, we can see that if kernel core file /proc/vmcore was generated, then crash kernel will try to dump kernel core file and kernel log file by invoking function in another kdump script kdump-config. No matter the dump commands were done successfully or not, device will be rebooted into production kernel by calling the command reboot -f.
But if crash kernel crashed during the dump operation, what I am thinking is kernel core file /proc/vmcore can still be generated, the device can be rebooted into production kernel if and only if the crash kernel was loaded again and have a chance to finish dumping the core file.

We may end up on a continuous loop trying to recover from a failed state. It is safe to reboot into production kernel rather than try the crash kernel which has failed. For example if there is an issue with hard disk access, crash kernel may not be able to write to the device unless a reboot has happened. crash kernel is kexec'ed so there is a chance that it may not be able to bring the system to a reliable state. Kdump is a best effort.

Agreed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is set on /etc/sysctl.conf on the filesystem. But since it is critical that crash kernel should always reboot on panic, we set it explicitly in kdump-tools.

Currently we reused the production kernel as crash kernel, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The production kernel is used as the crash kernel.

@lguohan lguohan merged commit f4b0c8f into sonic-net:master Jul 1, 2021
qiluo-msft pushed a commit that referenced this pull request Jul 7, 2021
dash doesn't support += operation to append to a variable's value. Use KDUMP_CMDLINE_APPEND="${KDUMP_CMDLINE_APPEND} " instead

The below error message is seen when a reboot is issued.

[ 342.439096] kdump-tools[13655]: /etc/init.d/kdump-tools: 117: /etc/default/kdump-tools: KDUMP_CMDLINE_APPEND+= panic=10 debug hpet=disable pcie_port=compat pci=nommconf sonic_platform=x86_64-accton_as7326_56x-r0: not found
carl-nokia pushed a commit to carl-nokia/sonic-buildimage that referenced this pull request Aug 7, 2021
dash doesn't support += operation to append to a variable's value. Use KDUMP_CMDLINE_APPEND="${KDUMP_CMDLINE_APPEND} " instead

The below error message is seen when a reboot is issued.

[ 342.439096] kdump-tools[13655]: /etc/init.d/kdump-tools: 117: /etc/default/kdump-tools: KDUMP_CMDLINE_APPEND+= panic=10 debug hpet=disable pcie_port=compat pci=nommconf sonic_platform=x86_64-accton_as7326_56x-r0: not found
praveen-li pushed a commit to praveen-li/sonic-buildimage that referenced this pull request Feb 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants