Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[unplanned-warm-reboot] Add crash-monitor to handle process crash #959

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

hzheng5
Copy link

@hzheng5 hzheng5 commented Jun 24, 2020

crash-monitor monitors the critical processes inside dockers,
and warm restart docker if the process crashes

Signed-off-by: Haiyang Zheng haiyang.z@alibaba-inc.com

- What I did

use crash-monitor to monitor critical processes inside docker, and warm-restart docker once the process crashes.

- How I did it

  1. crash-monitor subscribes to PROC_EVENT_TABLE in sysmonDB
    (TODO: there is another PR which will write the PROC_EVENT_TABLE once a process is crashed.)
  2. upon receiving proc crash event, check if process is alive or dead
    then try to warm restart the docker if the monitored process is dead if the retry limit has not been reached yet

- How to verify it
killed bgpd process inside bgp docker, and verify it's been warm-restarted without affecting traffic

/home/admin# sudo tail -F /var/log/syslog | grep 'monitor\|bgp\|fpmsync'
Jun 15 00:21:19.987084 ASW-C2-4-C11-A.NA61 INFO bgp.sh[9377]: 2019-06-15 00:21:19,984 INFO exited: bgpd (terminated by SIGKILL; not expected)
Jun 15 00:21:20.994875 ASW-C2-4-C11-A.NA61 NOTICE crash-monitor: process_event_handler PROC_STATE_ERROR|bgp:bgpd, data:{'from_state': 'RUNNING', 'timestamp': '2019 Jun 15 00:21:20.991626', 'state': 'UNEXPECTED_EXITED'}
Jun 15 00:21:20.995684 ASW-C2-4-C11-A.NA61 NOTICE crash-monitor: Monitoring critical_proc ['zebra', 'bgpd', 'fpmsyncd']
Jun 15 00:21:21.046472 ASW-C2-4-C11-A.NA61 CRIT bgp-monitor: Process bgpd is not running. Reset all peer status to down, and shutdown actor ports.
Jun 15 00:21:21.047513 ASW-C2-4-C11-A.NA61 NOTICE bgp-monitor: Current status of monitors BGP: {u'10.0.0.14': 'down', u'2000:0:0:500::2': 'down', u'2000::2': 'down', u'10.0.0.10': 'down', u'10.0.0.6': 'down', u'10.0.0.18': 'down', u'2000:0:0:200::2': 'down', u'2000:0:0:600::2': 'down', u'2000:0:0:300::2': 'down', u'2000:0:0:100::2': 'down', u'2000:0:0:700::2': 'down', u'10.0.0.26': 'down', u'10.0.0.30': 'down', u'2000:0:0:400::2': 'down', u'10.0.0.22': 'down', u'10.0.0.2': 'down'}
Jun 15 00:21:21.061910 ASW-C2-4-C11-A.NA61 CRIT crash-monitor: Process ['PROC_STATE_ERROR', 'bgp:bgpd'] is not running. auto-restart bgp
Jun 15 00:21:21.068717 ASW-C2-4-C11-A.NA61 INFO crash-monitor: Cleaned ipv4 and ipv6 eoiu marker flags
Jun 15 00:21:21.069627 ASW-C2-4-C11-A.NA61 NOTICE crash-monitor: runcmd: config warm_restart enable bgp
Jun 15 00:21:21.099738 ASW-C2-4-C11-A.NA61 CRIT bgp-monitor: Process bgpd is not running while CRASH_MONITOR is enabled, skip action.
Jun 15 00:21:21.623680 ASW-C2-4-C11-A.NA61 NOTICE crash-monitor: runcmd: docker exec -i bgp pkill -9 zebra
Jun 15 00:21:21.872214 ASW-C2-4-C11-A.na61 INFO supervisord: fpmsyncd Connection lost, reconnecting...
Jun 15 00:21:21.872214 ASW-C2-4-C11-A.na61 INFO supervisord: fpmsyncd Waiting for fpm-client connection...
Jun 15 00:21:21.875002 ASW-C2-4-C11-A.NA61 INFO bgp.sh[9377]: 2019-06-15 00:21:21,871 INFO exited: zebra (terminated by SIGKILL; not expected)
Jun 15 00:21:21.956280 ASW-C2-4-C11-A.NA61 NOTICE crash-monitor: runcmd: docker exec -i bgp pkill -9 bgpd
Jun 15 00:21:22.325249 ASW-C2-4-C11-A.NA61 ERR crash-monitor: Failed to run: docker exec -i bgp pkill -9 bgpd
Jun 15 00:21:22.327821 ASW-C2-4-C11-A.NA61 NOTICE crash-monitor: runcmd: docker exec -i database redis-cli  -n 6 -p 6379 hdel 'WARM_RESTART_TABLE|bgp' state
Jun 15 00:21:22.693761 ASW-C2-4-C11-A.NA61 NOTICE crash-monitor: runcmd: systemctl restart bgp
Jun 15 00:21:22.709346 ASW-C2-4-C11-A.NA61 NOTICE crash-monitor: process_event_handler PROC_STATE_ERROR|host:bgp.service, data:{'from_state': 'active', 'timestamp': '2019 Jun 15 00:21:22.703935', 'state': 'deactivating'}
Jun 15 00:21:22.710229 ASW-C2-4-C11-A.NA61 NOTICE crash-monitor: Monitoring critical_proc ['zebra', 'bgpd', 'fpmsyncd']
Jun 15 00:21:22.829794 ASW-C2-4-C11-A.NA61 INFO bgp.sh[9377]: 2019-06-15 00:21:22,828 WARN received SIGTERM indicating exit request
Jun 15 00:21:22.830368 ASW-C2-4-C11-A.NA61 INFO bgp.sh[9377]: 2019-06-15 00:21:22,828 INFO waiting for bgp-watchdog, fpmsyncd, bgpcfgd, rsyslogd to die
Jun 15 00:21:22.836550 ASW-C2-4-C11-A.NA61 NOTICE crash-monitor: process_event_handler PROC_STATE_ERROR|bgp:zebra, data:{'from_state': 'RUNNING', 'timestamp': '2019 Jun 15 00:21:22.832523', 'state': 'UNEXPECTED_EXITED'}
Jun 15 00:21:22.837250 ASW-C2-4-C11-A.NA61 NOTICE crash-monitor: Monitoring critical_proc ['zebra', 'bgpd', 'fpmsyncd']
Jun 15 00:21:22.853139 ASW-C2-4-C11-A.NA61 INFO bgp.sh[9377]: 2019-06-15 00:21:22,836 INFO stopped: fpmsyncd (terminated by SIGTERM)
Jun 15 00:21:22.853572 ASW-C2-4-C11-A.NA61 INFO bgp.sh[9377]: 2019-06-15 00:21:22,841 INFO stopped: rsyslogd (exit status 0)
Jun 15 00:21:22.854127 ASW-C2-4-C11-A.NA61 INFO bgp.sh[9377]: 2019-06-15 00:21:22,843 INFO stopped: bgpcfgd (terminated by SIGTERM)
Jun 15 00:21:22.854526 ASW-C2-4-C11-A.NA61 INFO bgp.sh[9377]: 2019-06-15 00:21:22,846 INFO stopped: bgp-watchdog (terminated by SIGTERM)
Jun 15 00:21:22.878935 ASW-C2-4-C11-A.NA61 WARNING crash-monitor: Process ['PROC_STATE_ERROR', 'bgp:zebra'] is not running, but auto-restart bgp already scheduled
Jun 15 00:21:23.257547 ASW-C2-4-C11-A.NA61 INFO bgp.sh[5643]: bgp
Jun 15 00:21:24.246584 ASW-C2-4-C11-A.NA61 INFO bgp.sh[5704]: Starting existing bgp container with HWSKU bfnmodel HOSTNAME ASW-C2-4-C11-A.NA61
Jun 15 00:21:24.873996 ASW-C2-4-C11-A.NA61 INFO bgp.sh[5704]: bgp
Jun 15 00:21:26.273504 ASW-C2-4-C11-A.NA61 INFO bgp.sh[5704]: Current default time zone: 'Etc/UTC'
Jun 15 00:21:26.280900 ASW-C2-4-C11-A.NA61 INFO bgp.sh[5704]: Local time is now:      Sat Jun 15 00:21:26 UTC 2019.
Jun 15 00:21:26.281642 ASW-C2-4-C11-A.NA61 INFO bgp.sh[5704]: Universal Time is now:  Sat Jun 15 00:21:26 UTC 2019.

- Previous command output (if the output of a command-line utility has changed)

- New command output (if the output of a command-line utility has changed)

@lgtm-com
Copy link

lgtm-com bot commented Jun 24, 2020

This pull request introduces 4 alerts when merging bab00b7cb27a68bddbc380f669efc4046cfb9ab3 into f99f7e1 - view on LGTM.com

new alerts:

  • 3 for Unused import
  • 1 for Implicit string concatenation in a list

@yxieca yxieca requested review from jleveque and yxieca July 1, 2020 18:09
Comment on lines 50 to 54
log_file = "/var/log/monitor-crash.log"
log = logging.getLogger('MONITOR_CRASH')
log.setLevel(logging.DEBUG)
# Create file log handler and set level to debug
fh = logging.handlers.RotatingFileHandler(log_file, maxBytes=(MAX_LOG_BYTES / 2), backupCount=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use logger and a dedicated file which you rotate manually rather than just using the syslog library?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Joe,

thanks for help reviewing the change.

This logger is for our internal use purpose. Will change it to syslog.

Thanks,
Haiyang

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use logger and a dedicated file which you rotate manually rather than just using the syslog library?

Have updated the change by using syslog lib instead of logger.

Thanks,
Haiyang

crash-monitor monitors the critical processes inside dockers,
and warm restart docker if the process crashes

Signed-off-by: Haiyang Zheng <haiyang.z@alibaba-inc.com>
@hzheng5 hzheng5 force-pushed the unplanned-warm-reboot branch from bab00b7 to f39e5db Compare July 9, 2020 09:08
@lgtm-com
Copy link

lgtm-com bot commented Jul 9, 2020

This pull request introduces 4 alerts when merging f39e5db into 9715244 - view on LGTM.com

new alerts:

  • 3 for Unused import
  • 1 for Implicit string concatenation in a list

Copy link
Contributor

@jleveque jleveque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature seems as though it may conflict with our current container auto-restart feature. Would it be possible to review the existing solution and see if you can integrate as warm-restart option?

@hzheng5
Copy link
Author

hzheng5 commented Jul 30, 2020

This feature seems as though it may conflict with our current container auto-restart feature. Would it be possible to review the existing solution and see if you can integrate as warm-restart option?

Hi Joe,

Thanks for the info. Sure, will check the existing auto-restart feature first.

Thanks,
Haiyang

stepanblyschak pushed a commit to stepanblyschak/sonic-utilities that referenced this pull request Apr 18, 2022
These changes are included in this PR:

07e1f79 [syncd] Add workaround for warm boot new objects (sonic-net#959)
50fd353 Fix the option missing in kernel config issue (sonic-net#956)
e77503c [syncd] Comparison logic workaround for empty buffer profile (sonic-net#906) (sonic-net#941)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants