Skip to content
This repository has been archived by the owner on Oct 11, 2023. It is now read-only.

a reboot should occur after a kernel panic #1785

Closed
wfleurant opened this issue Apr 13, 2017 · 12 comments
Closed

a reboot should occur after a kernel panic #1785

wfleurant opened this issue Apr 13, 2017 · 12 comments

Comments

@wfleurant
Copy link
Contributor

RancherOS Version: (ros os version)
All version -- This is hardware & kernel related

Where are you running RancherOS? (docker-machine, AWS, GCE, baremetal, etc.)
Baremetal

Background
Some platforms (take HP/ASUS Chromebox for example) are susceptible to pre-init kernel panics when the "Verified Boot" process (UEFI) is replaced with a "Legacy Boot" process.

Issue
Rancher OS does not soft-reset after a kernel panic. A reboot of the system should follow after a kernel panic occurs.

wfleurant added a commit to wfleurant/os that referenced this issue Apr 13, 2017
@SvenDowideit
Copy link
Contributor

oh horrible!

I'm not totally sure that having my servers flapping due to a kernel bug is the right answer as a default though :/

@wfleurant
Copy link
Contributor Author

Devices freezing on kernel panic during early init seems to have one global fix above. These Chromebox reboot tests had an unacceptable rate of failure so this was (at the time, grr) the only fix. Yeah, the rate of failure is quite horrible.

Meantime, found an especially useful parameter in ros os upgrade --append panic=10

Sven, I can't stop myself from arguing about this becoming a default for the entire OS. Rebooting on any non-oops kernel panics or unfed watchdog timers is really a basic standard in the embedded Linux industry. Kubernetes will trigger a reboot after a kernel panic on oops. Regardless, we should let Rancher OS recover from this unrecoverable state. Forget about a case where infinite flapping could occur. Even if the environment is on-metal or in-datacenter the bar should still be set to prevent the need to power cycle for a hung operating system. </rant>

May be we could change these these kernel configuration values kernel hands off to system-docker?

[rancher@rancher ~]$ sudo ros os upgrade --append panic=10
Upgrading to rancher/os:v1.0.0
Continue [y/N]: y
Pulling os-upgrade (rancher/os:v1.0.0)...
v1.0.0: Pulling from rancher/os
627beaf3eaaf: Pull complete 
8cd4da28feed: Pull complete 
....
[rancher@rancher ~]$ dmesg | head -n2
[    0.000000] Linux version 4.9.21-rancher (root@e8db98f7c931) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) ) #1 SMP Mon Apr 10 02:30:28 UTC 2017
[    0.000000] Command line: BOOT_IMAGE=../vmlinuz-4.9.21-rancher printk.devkmsg=on rancher.state.dev=LABEL=RANCHER_STATE rancher.state.wait console=tty0 panic=10 initrd=../initrd-v1.0.0

Hmm.. It looks like the kernel parameter panic=10 (set by Grub) will also set panic_on_oops=1

for x in $(find /proc/sys/kernel | grep panic );  do echo $x $(cat $x); done
/proc/sys/kernel/hardlockup_panic 0
/proc/sys/kernel/hung_task_panic 0
/proc/sys/kernel/panic 5
/proc/sys/kernel/panic_on_io_nmi 0
/proc/sys/kernel/panic_on_oops 1
/proc/sys/kernel/panic_on_rcu_stall 0
/proc/sys/kernel/panic_on_unrecovered_nmi 0
/proc/sys/kernel/panic_on_warn 0
/proc/sys/kernel/softlockup_panic 0
/proc/sys/kernel/unknown_nmi_panic 0

@SvenDowideit
Copy link
Contributor

I like the idea that we set reboot on oops after we've gotten the system to a reasonably sane place. For example, if you create a VM with not enough memory, RancherOS now panics - it used to carry on, and you'd have something that looks like its an ok system, but with some random system service not existing, or not running - and I'd hate to flap on that.

mucho 💯 to setting auto-reboot, though I'm thinking perhaps at the point where user-docker is up?

@tohizma
Copy link

tohizma commented Mar 23, 2018

I'm confirming using v1.3.0-rc1 problem still exist.

How to reproduce :

  • installed on bare metal with core2duo 2,3GHz 2GB Memory (also tried on virtualbox with 2GB memory)
  • running official php+apache container (unmodified)
  • running official mariadb container (unmodified)
  • manual install of wordpress for test load (not in dedicated container, just run from php+apache container)
  • hit 1 wordpress post with Paessler Webserver Stress Tool 8 with click of 1000 users, each 5 clicks with random delay between 0-20second

RancherOS will die with kernel panic and stuck on that condition until manually push reset/CTRL+ALT+DEL.
Problem seems from running of memory and nothing can do by kernel.

Auto reboot will be great to auto refresh everything without touching the bare metal placed somewhere else.

Thank you

@niusmallnan niusmallnan reopened this Mar 23, 2018
@wfleurant
Copy link
Contributor Author

please confirm kernel parameters dmesg | head | grep vmlinuz

looking for similar line:

[    0.000000] Command line: BOOT_IMAGE=../vmlinuz-4.9.40-rancher printk.devkmsg=on rancher.state.dev=LABEL=RANCHER_STATE rancher.state.wait console=tty0 panic=5 initrd=../initrd-v1.0.4

@tohizma
Copy link

tohizma commented Mar 24, 2018

[ 0.000000] Command line: BOOT_IMAGE=../vmlinuz-4.15.9-rancher printk.devkmsg=on rancher.state.dev=LABEL=RANCHER_STATE rancher.state.wait console=tty0 rancher.autologin=tty1 initrd=../initrd-v1.3.0-rc1

@wfleurant
Copy link
Contributor Author

From d263be4 it looks like it went into v1.3.0-rc1 (but, this also matches your version too) hmm.. I downloaded the ISO, mounted it and looked at global.cfg

cd /dev/shm/ && mkdir cdrom
wget https://github.com/rancher/os/releases/download/v1.3.0-rc1/rancheros.iso
sudo mount -t iso9660 -o loop rancheros.iso cdrom

The file global.cfg has the panic=10 command (seen with multi-line for this post)

APPEND rancher.autologin=tty1 rancher.autologin=ttyS0 \
rancher.autologin=ttyS1 rancher.autologin=ttyS1 console=tty1 \
console=ttyS0 console=ttyS1 printk.devkmsg=on panic=10 

Last question, how did you install Docker OS to disk? Did you upgrade from a previous version or was this a fresh installation?

If it was a fresh installation, we need to figure out why this specific parameter was left out, because it is grouped up with the other (and seemingly unique) parameters found in global.cfg where rancher.autologic=ttyS1 is repeated 2 times other boot options found in linux-current.cfg so i guess this means it is being included via ./isolinux/isolinux.cfg

@tohizma
Copy link

tohizma commented Mar 25, 2018

i do fresh install, burn iso to usb disk with rufus 2.18 ( choose mode iso burn than dd ),
boot iso and install to blank /dev/sda with cloud-config.yml nothing but ssh key, static ip and name servers

then do 'sudo ros config syslinux' to add autologin without touching any parameters

i will try another mode for rufus burning maybe rufus tampering kernel because rufus complaint something about vmlinuz is missing when first burning

i'll confirm soon

@tohizma
Copy link

tohizma commented Mar 25, 2018

Confirming, re-download v3.1.0-rc1, test several times on virtualbox
When running live from iso there's panic=10 parameter

image : https://pasteboard.co/Hduq5rl.png

but when installed on hardisk, no more panic parameter

https://pasteboard.co/HduqgPN.png

( we're seeing duplicated autologin because i choose autologin from rancheros boot screen )

Here's when i choose default on boot screen, still there's nothing about panic params :

[ 0.000000] Command line: BOOT_IMAGE=../vmlinuz-4.15.9-rancher printk.devkmsg=on rancher.state.dev=LABEL=RANCHER_STATE rancher.state.wait console=tty0 initrd=../initrd-v1.3.0-rc1

May be you forgot to transfer params from iso to hdd when installing?

@niusmallnan
Copy link
Contributor

niusmallnan commented Mar 26, 2018

Fixed by fe5d2dd

@kingsd041 can you help me confirm this?

@kingsd041
Copy link
Contributor

@niusmallnan
Tested with fe5d2dd
When I installed to the hard disk, I could see the panic=10 parameter.

[root@rancher ~]# ros -v
version 8a0d617 from os image rancher/os:8a0d617

[root@rancher ~]# dmesg | head | grep vmlinuz
[    0.000000] Command line: BOOT_IMAGE=../vmlinuz-4.15.9-rancher printk.devkmsg=on rancher.state.dev=LABEL=RANCHER_STATE rancher.state.wait panic=10 console=tty0  initrd=../initrd-8a0d617

@kingsd041
Copy link
Contributor

Close it.
If you still have doubts, please reopen it and let me know.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants