From 15c53ce9514225076c5712a55708cc243cae422c Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Tue, 18 Feb 2020 15:55:05 -0800 Subject: [PATCH 01/57] [monitoring] Add a document to provide the details about the monitoring the running status of critical process and resource usage. Signed-off-by: Yong Zhao --- .../monitoring_containers.md | 405 ++++++++++++++++++ 1 file changed, 405 insertions(+) create mode 100644 doc/monitoring_containers/monitoring_containers.md diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md new file mode 100644 index 0000000000..37858787c2 --- /dev/null +++ b/doc/monitoring_containers/monitoring_containers.md @@ -0,0 +1,405 @@ +# Monitoring and Auto-mitigating the Unhealthy of Containers in SONiC + +# High Level Design Document +#### Rev 0.1 + +# Table of Contents +* [List of Tables](#list-of-tables) +* [List of Figures](#list-of-figures) +* [Revision](#revision) +* [About this Manual](#about-this-manual) +* [Scope](#scope) +* [Defintions/Abbreviation](#definitionsabbreviation) +* [1 Overview](#1-overview) + - [1.1 Use Cases](#11-use-cases) + - [1.1.1 A flexible "drop filter"](#111-a-flexible-"drop-filter") + - [1.1.2 A helpful debugging tool](#112-a-helpful-debugging-tool) + - [1.1.3 More sophisticated monitoring schemes](#113-more-sophisticated-monitoring-schemes) +* [2 Requirements](#2-requirements) + - [2.1 Functional Requirements](#21-functional-requirements) + - [2.2 Configuration and Management Requirements](#22-configuration-and-management-requirements) + - [2.3 Scalability Requirements](#23-scalability-requirements) + - [2.4 Supported Debug Counters](#24-supported-debug-counters) +* [3 Design](#3-design) + - [3.1 CLI (and usage example)](#31-cli-and-usage-example) + - [3.1.1 Displaying available counter capabilities](#311-displaying-available-counter-capabilities) + - [3.1.2 Displaying current counter configuration](#312-displaying-current-counter-configuration) + - [3.1.3 Displaying the current counts](#313-displaying-the-current-counts) + - [3.1.4 Clearing the counts](#314-clearing-the-counts) + - [3.1.5 Configuring counters from the CLI](#315-configuring-counters-from-the-CLI) + - [3.2 Config DB](#32-config-db) + - [3.2.1 DEBUG_COUNTER Table](#321-debug_counter-table) + - [3.2.2 PACKET_DROP_COUNTER_REASON Table](#322-packet_drop_counter_reason-table) + - [3.3 State DB](#33-state-db) + - [3.3.1 DEBUG_COUNTER_CAPABILITIES Table](#331-debug-counter-capabilities-table) + - [3.3.2 SAI APIs](#332-sai-apis) + - [3.4 Counters DB](#34-counters-db) + - [3.5 SWSS](#35-swss) + - [3.5.1 SAI APIs](#351-sai-apis) + - [3.6 syncd](#34-syncd) +* [4 Flows](#4-flows) + - [4.1 General Flow](#41-general-flow) +* [5 Warm Reboot Support](#5-warm-reboot-support) +* [6 Unit Tests](#6-unit-tests) +* [7 Platform Support](#7-platform-support) + - [7.1 Known Limitations](#7.1-known-limitations) +* [8 Open Questions](#8-open-questions) +* [9 Acknowledgements](#9-acknowledgements) +* [10 References](#10-references) + +# List of Tables +* [Table 1: Abbreviations](#definitionsabbreviation) + +# List of Figures +* [Figure 1: General Flow](#41-general-flow) + +# Revision +| Rev | Date | Author | Change Description | +|:---:|:--------:|:----------------------:|---------------------------| +| 0.1 | 02/18/20 | Yong Zhao, Joe Leveque | Initial version | + +# About this Manual +This document provides the design and implementation of monitoring and auto-mitigating +the unhealthy of containers in SONiC. + +# Scope +This document describes the high level design of the feature to monitor and auto-mitigate +the unhealthy of containers. + +# Definitions/Abbreviation +| Abbreviation | Description | +|--------------|------------------------------| +| Config DB | SONiC Configuration Database | + +# 1 Overview +SONiC is a collection of various switch applications which are held in docker containers +such as BGP and SNMP. Each application usually includes several processes which are +working together to provide the services for other modules. As such, the healthy of +critical processes in each docker container are the key for the functionality of whole +SONiC systems. + +The main purpose of this feature includes two parts: the first part is to monitor the +running status of each process and critical resource usage such as CPU, memory and disk +of each docker container. The second part is to auto-mitigate the unhealthy state of docker +container if one of its critical process crashed or exited unexpectedly. + +We implemented this feature by employing the existing monit and supervisord system tools. +* we used monit system tool to detect whether a process is running or not and whether + the resource usage of a docker container is beyond the pre-defined threshold. +* we used the mechanism of event listener in supervisord to auto-restart a docker container + if one of its critical processes exited unexpectedly. We also added a knob to make + this auto-restart feature dynamically configurable. + +## 1.1 Use Cases +There are a couple of potential use cases for these drop counters. + +### 1.1.1 A flexible "drop filter" +One potential use case is to use the drop counters to create a filter of sorts for the standard STAT_IF_IN/OUT_DISCARDS counters. Say, for example: +- Packets X, Y, and Z exist in our system +- Our switches should drop X, Y, and Z when they receive them + +We can configure a drop counter (call it "EXPECTED_DROPS", for example) that counts X, Y, and Z. If STAT_IF_IN_DISCARDS = EXPECTED_DROPS, then we know our switch is healthy and that everything is working as intended. If the counts don't match up, then there may be a problem. + +### 1.1.2 A helpful debugging tool +Another potential use case is to configure the counters on the fly in order to help debug packet loss issues. For example, if we're consistently experiencing packet loss in your system, we might try: +- Creating a counter that tracks L2_ANY and a counter that tracks L3_ANY +- L2_ANY is incrementing, so we delete these two counters and create MAC_COUNTER that tracks MAC-related reasons (SMAC_EQUALS_DMAC, DMAC_RESERVED, etc.), VLAN_COUNTER that tracks VLAN related reasons, (INGRESS_VLAN_FILTER, VLAN_TAG_NOT_ALLOWED), and OTHER_COUNTER that tracks everything else (EXCEEDS_L2_MTU, FDB_UC_DISCARD, etc.) +- OTHER_COUNTER is incrementing, so we delete the previous counters and create a counter that tracks the individual reasons from OTHER_COUNTER +- We discover that the EXCEEDS_L2_MTU counter is increasing. There might be an MTU mismatch somewhere in our system! + +### 1.1.3 More sophisticated monitoring schemes +Some have suggested other deployment schemes to try to sample the specific types of packet drops that are occurring in their system. Some of these ideas include: +- Periodically (e.g. every 30s) cycling through different sets of drop counters on a given device +- "Striping" drop counters across different devices in the system (e.g. these 3 switches are tracking VLAN drops, these 3 switches are tracking ACL drops, etc.) +- An automatic version of [1.1.2](#112-a-helpful-debugging-tool) that adapts the drop counter configuration based on which counters are incrementing + +# 2 Requirements + +## 2.1 Functional Requirements +1. CONFIG_DB can be configured to create debug counters +2. STATE_DB can be queried for debug counter capabilities +3. Users can access drop counter information via a CLI tool + 1. Users can see what capabilities are available to them + 1. Types of counters (i.e. port-level and/or switch-level) + 2. Number of counters + 3. Supported drop reasons + 2. Users can see what types of drops each configured counter contains + 3. Users can add and remove drop reasons from each counter + 4. Users can read the current value of each counter + 5. Users can assign aliases to counters + 6. Users can clear counters + +## 2.2 Configuration and Management Requirements +Configuration of the drop counters can be done via: +* config_db.json +* CLI + +## 2.3 Scalability Requirements +Users must be able to use all debug counters and drop reasons provided by the underlying hardware. + +Interacting with debug counters will not interfere with existing hardware counters (e.g. portstat). Likewise, interacting with existing hardware counters will not interfere with debug counter behavior. + +## 2.4 Supported Debug Counters +* PORT_INGRESS_DROPS: port-level ingress drop counters +* PORT_EGRESS_DROPS: port-level egress drop counters +* SWITCH_INGRESS_DROPS: switch-level ingress drop counters +* SWITCH_EGRESS_DROPS: switch-level egress drop counters + +# 3 Design + +## 3.1 CLI (and usage example) +The CLI tool will provide the following functionality: +* See available drop counter capabilities: `show dropcounters capabilities` +* See drop counter config: `show dropcounters configuration` +* Show drop counts: `show dropcounters counts` +* Clear drop counters: `sonic-clear dropcounters` +* Initialize a new drop counter: `config dropcounters install` +* Add drop reasons to a drop counter: `config dropcounters add_reasons` +* Remove drop reasons from a drop counter: `config dropcounters remove_reasons` +* Delete a drop counter: `config dropcounters delete` + +### 3.1.1 Displaying available counter capabilities +``` +admin@sonic:~$ show dropcounters capabilities +Counter Type Total +-------------------- ------- +PORT_INGRESS_DROPS 3 +SWITCH_EGRESS_DROPS 2 + +PORT_INGRESS_DROPS: + L2_ANY + SMAC_MULTICAST + SMAC_EQUALS_DMAC + INGRESS_VLAN_FILTER + EXCEEDS_L2_MTU + SIP_CLASS_E + SIP_LINK_LOCAL + DIP_LINK_LOCAL + UNRESOLVED_NEXT_HOP + DECAP_ERROR + +SWITCH_EGRESS_DROPS: + L2_ANY + L3_ANY + A_CUSTOM_REASON +``` + +### 3.1.2 Displaying current counter configuration +``` +admin@sonic:~$ show dropcounters configuration +Counter Alias Group Type Reasons Description +-------- -------- ----- ------------------ ------------------- -------------- +DEBUG_0 RX_LEGIT LEGIT PORT_INGRESS_DROPS SMAC_EQUALS_DMAC Legitimate port-level RX pipeline drops + INGRESS_VLAN_FILTER +DEBUG_1 TX_LEGIT None SWITCH_EGRESS_DROPS EGRESS_VLAN_FILTER Legitimate switch-level TX pipeline drops + +admin@sonic:~$ show dropcounters configuration -g LEGIT +Counter Alias Group Type Reasons Description +-------- -------- ----- ------------------ ------------------- -------------- +DEBUG_0 RX_LEGIT LEGIT PORT_INGRESS_DROPS SMAC_EQUALS_DMAC Legitimate port-level RX pipeline drops + INGRESS_VLAN_FILTER +``` + +### 3.1.3 Displaying the current counts + +``` +admin@sonic:~$ show dropcounters counts + IFACE STATE RX_ERR RX_DROPS TX_ERR TX_DROPS RX_LEGIT +--------- ------- -------- ---------- -------- ---------- --------- +Ethernet0 U 10 100 0 0 20 +Ethernet4 U 0 1000 0 0 100 +Ethernet8 U 100 10 0 0 0 + +DEVICE TX_LEGIT +------ -------- +sonic 1000 + +admin@sonic:~$ show dropcounters counts -g LEGIT + IFACE STATE RX_ERR RX_DROPS TX_ERR TX_DROPS RX_LEGIT +--------- ------- -------- ---------- -------- ---------- --------- +Ethernet0 U 10 100 0 0 20 +Ethernet4 U 0 1000 0 0 100 +Ethernet8 U 100 10 0 0 0 + +admin@sonic:~$ show dropcounters counts -t SWITCH_EGRESS_DROPS +DEVICE TX_LEGIT +------ -------- +sonic 1000 +``` + +### 3.1.4 Clearing the counts +``` +admin@sonic:~$ sonic-clear dropcounters +Cleared drop counters +``` + +### 3.1.5 Configuring counters from the CLI +``` +admin@sonic:~$ sudo config dropcounters install DEBUG_2 PORT_INGRESS_DROPS [EXCEEDS_L2_MTU,DECAP_ERROR] -d "More port ingress drops" -g BAD -a BAD_DROPS +admin@sonic:~$ sudo config dropcounters add_reasons DEBUG_2 [SIP_CLASS_E] +admin@sonic:~$ sudo config dropcounters remove_reasons DEBUG_2 [SIP_CLASS_E] +admin@sonic:~$ sudo config dropcounters delete DEBUG_2 +``` + +## 3.2 Config DB +Two new tables will be added to Config DB: +* DEBUG_COUNTER to store general debug counter metadata +* DEBUG_COUNTER_DROP_REASON to store drop reasons for debug counters that have been configured to track packet drops + +### 3.2.1 DEBUG_COUNTER Table +Example: +``` +{ + "DEBUG_COUNTER": { + "DEBUG_0": { + "alias": "PORT_RX_LEGIT", + "type": "PORT_INGRESS_DROPS", + "desc": "Legitimate port-level RX pipeline drops", + "group": "LEGIT" + }, + "DEBUG_1": { + "alias": "PORT_TX_LEGIT", + "type": "PORT_EGRESS_DROPS", + "desc": "Legitimate port-level TX pipeline drops" + "group": "LEGIT" + }, + "DEBUG_2": { + "alias": "SWITCH_RX_LEGIT", + "type": "SWITCH_INGRESS_DROPS", + "desc": "Legitimate switch-level RX pipeline drops" + "group": "LEGIT" + } + } +} +``` + +### 3.2.2 DEBUG_COUNTER_DROP_REASON Table +Example: +``` +{ + "DEBUG_COUNTER_DROP_REASON": { + "DEBUG_0|SMAC_EQUALS_DMAC": {}, + "DEBUG_0|INGRESS_VLAN_FILTER": {}, + "DEBUG_1|EGRESS_VLAN_FILTER": {}, + "DEBUG_2|TTL": {}, + } +} +``` + +## 3.3 State DB +State DB will store information about: +* What types of drop counters are available on this device +* How many drop counters are available on this device +* What drop reasons are supported by this device + +### 3.3.1 DEBUG_COUNTER_CAPABILITIES Table +Example: +``` +{ + "DEBUG_COUNTER_CAPABILITIES": { + "SWITCH_INGRESS_DROPS": { + "count": "3", + "reasons": "[L2_ANY, L3_ANY, SMAC_EQUALS_DMAC]" + }, + "SWITCH_EGRESS_DROPS": { + "count": "3", + "reasons": "[L2_ANY, L3_ANY]" + } + } +} +``` + +This information will be populated by the orchestrator (described later) on startup. + +### 3.3.2 SAI APIs +We will use the following SAI APIs to get this information: +* `sai_query_attribute_enum_values_capability` to query support for different types of counters +* `sai_object_type_get_availability` to query the amount of available debug counters + +## 3.4 Counters DB +The contents of the drop counters will be added to Counters DB by flex counters. + +Additionally, we will add a mapping from debug counter names to the appropriate port or switch stat index called COUNTERS_DEBUG_NAME_PORT_STAT_MAP and COUNTERS_DEBUG_NAME_SWITCH_STAT_MAP respectively. + +## 3.5 SWSS +A new orchestrator will be created to handle debug counter creation and configuration. Specifically, this orchestrator will support: +* Creating a new counter +* Deleting existing counters +* Adding drop reasons to an existing counter +* Removing a drop reason from a counter + +### 3.5.1 SAI APIs +This orchestrator will interact with the following SAI Debug Counter APIs: +* `sai_create_debug_counter_fn` to create/configure new drop counters. +* `sai_remove_debug_counter_fn` to delete/free up drop counters that are no longer being used. +* `sai_get_debug_counter_attribute_fn` to gather information about counters that have been configured (e.g. index, drop reasons, etc.). +* `sai_set_debug_counter_attribute_fn` to re-configure drop reasons for counters that have already been created. + +## 3.6 syncd +Flex counter will be extended to support switch-level SAI counters. + +# 4 Flows +## 4.1 General Flow +![alt text](./drop_counters_general_flow.png) +The overall workflow is shown above in figure 1. + +(1) Users configure drop counters using the CLI. Configurations are stored in the DEBUG_COUNTER Config DB table. + +(2) The debug counts orchagent subscribes to the Config DB table. Once the configuration changes, the orchagent uses the debug SAI API to configure the drop counters. + +(3) The debug counts orchagent publishes counter configurations to Flex Counter DB. + +(4) Syncd subscribes to Flex Counter DB and sets up flex counters. Flex counters periodically query ASIC counters and publishes data to Counters DB. + +(5) CLI uses counters DB to satisfy CLI requests. + +(6) (not shown) CLI uses State DB to display hardware capabilities (e.g. how many counters are available, supported drop reasons, etc.) + +# 5 Warm Reboot Support +On resource-constrained platforms, debug counters can be deleted prior to warm reboot and re-installed when orchagent starts back up. This is intended to conserve hardware resources during the warm reboot. This behavior has not been added to SONiC at this time, but can be if the need arises. + +# 6 Unit Tests +This feature comes with a full set of virtual switch tests in SWSS. +``` +=============================================================================================== test session starts =============================================================================================== +platform linux2 -- Python 2.7.15+, pytest-3.3.0, py-1.8.0, pluggy-0.6.0 -- /usr/bin/python2 +cachedir: .cache +rootdir: /home/daall/dev/sonic-swss/tests, inifile: +collected 14 items + +test_drop_counters.py::TestDropCounters::test_deviceCapabilitiesTablePopulated remove extra link dummy +PASSED [ 7%] +test_drop_counters.py::TestDropCounters::test_flexCounterGroupInitialized PASSED [ 14%] +test_drop_counters.py::TestDropCounters::test_createAndRemoveDropCounterBasic PASSED [ 21%] +test_drop_counters.py::TestDropCounters::test_createAndRemoveDropCounterReversed PASSED [ 28%] +test_drop_counters.py::TestDropCounters::test_createCounterWithInvalidCounterType PASSED [ 35%] +test_drop_counters.py::TestDropCounters::test_createCounterWithInvalidDropReason PASSED [ 42%] +test_drop_counters.py::TestDropCounters::test_addReasonToInitializedCounter PASSED [ 50%] +test_drop_counters.py::TestDropCounters::test_removeReasonFromInitializedCounter PASSED [ 57%] +test_drop_counters.py::TestDropCounters::test_addDropReasonMultipleTimes PASSED [ 64%] +test_drop_counters.py::TestDropCounters::test_addInvalidDropReason PASSED [ 71%] +test_drop_counters.py::TestDropCounters::test_removeDropReasonMultipleTimes PASSED [ 78%] +test_drop_counters.py::TestDropCounters::test_removeNonexistentDropReason PASSED [ 85%] +test_drop_counters.py::TestDropCounters::test_removeInvalidDropReason PASSED [ 92%] +test_drop_counters.py::TestDropCounters::test_createAndDeleteMultipleCounters PASSED [100%] + +=========================================================================================== 14 passed in 113.65 seconds =========================================================================================== +``` + +A separate test plan will be uploaded and review by the community. This will consist of system tests written in pytest that will send traffic to the device and verify that the drop counters are updated correctly. + +# 7 Platform Support +In order to make this feature platform independent, we rely on SAI query APIs (described above) to check for what counter types and drop reasons are supported on a given device. As a result, drop counters are only available on platforms that support both the SAI drop counter API as well as the query APIs, in order to preserve safety. + +# 7.1 Known Limitations +* BRCM SAI: + - ACL_ANY, DIP_LINK_LOCAL, SIP_LINK_LOCAL, and L3_EGRESS_LINK_OWN are all based on the same underlying counter in hardware, so enabling any one of these reasons on a drop counter will (implicitly) enable all of them. + +# 8 Open Questions +- How common of an operation is configuring a drop counter? Is this something that will usually only be done on startup, or something people will be updating frequently? + +# 9 Acknowledgements +I'd like to thank the community for all their help designing and reviewing this new feature! Special thanks to Wenda, Ying, Prince, Guohan, Joe, Qi, Renuka, and the team at Microsoft, Madhu and the team at Aviz, Ben, Vissu, Salil, and the team at Broadcom, Itai, Matty, Liat, Marian, and the team at Mellanox, and finally Ravi, Tony, and the team at Innovium. + +# 10 References +[1] [SAI Debug Counter Proposal](https://github.com/itaibaz/SAI/blob/a612dd21257cccca02cfc6dab90745a56d0993be/doc/SAI-Proposal-Debug-Counters.md) From 689c5a7918ff6818397bd2a373b8d8e90c925cc8 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Tue, 18 Feb 2020 16:13:24 -0800 Subject: [PATCH 02/57] [Monitoring] Add an item in the section of overview. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 37858787c2..63136936f2 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -87,8 +87,8 @@ We implemented this feature by employing the existing monit and supervisord syst * we used monit system tool to detect whether a process is running or not and whether the resource usage of a docker container is beyond the pre-defined threshold. * we used the mechanism of event listener in supervisord to auto-restart a docker container - if one of its critical processes exited unexpectedly. We also added a knob to make - this auto-restart feature dynamically configurable. + if one of its critical processes exited unexpectedly. +* We also added a knob to make this auto-restart feature dynamically configurable. ## 1.1 Use Cases There are a couple of potential use cases for these drop counters. From 2b31fef1389275bc663d11ab672545e5578ecb38 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Tue, 18 Feb 2020 17:02:37 -0800 Subject: [PATCH 03/57] [Moniting] Add functional requirements. Signed-off-by: Yong Zhao --- .../monitoring_containers.md | 62 ++++++------------- 1 file changed, 20 insertions(+), 42 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 63136936f2..155d132fb0 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -70,13 +70,14 @@ the unhealthy of containers. | Abbreviation | Description | |--------------|------------------------------| | Config DB | SONiC Configuration Database | +| CLI | Command Line Interface | # 1 Overview SONiC is a collection of various switch applications which are held in docker containers such as BGP and SNMP. Each application usually includes several processes which are working together to provide the services for other modules. As such, the healthy of -critical processes in each docker container are the key for the functionality of whole -SONiC systems. +critical processes in each docker container are the key for the intended functionalities of +SONiC switch. The main purpose of this feature includes two parts: the first part is to monitor the running status of each process and critical resource usage such as CPU, memory and disk @@ -88,46 +89,23 @@ We implemented this feature by employing the existing monit and supervisord syst the resource usage of a docker container is beyond the pre-defined threshold. * we used the mechanism of event listener in supervisord to auto-restart a docker container if one of its critical processes exited unexpectedly. -* We also added a knob to make this auto-restart feature dynamically configurable. - -## 1.1 Use Cases -There are a couple of potential use cases for these drop counters. - -### 1.1.1 A flexible "drop filter" -One potential use case is to use the drop counters to create a filter of sorts for the standard STAT_IF_IN/OUT_DISCARDS counters. Say, for example: -- Packets X, Y, and Z exist in our system -- Our switches should drop X, Y, and Z when they receive them - -We can configure a drop counter (call it "EXPECTED_DROPS", for example) that counts X, Y, and Z. If STAT_IF_IN_DISCARDS = EXPECTED_DROPS, then we know our switch is healthy and that everything is working as intended. If the counts don't match up, then there may be a problem. - -### 1.1.2 A helpful debugging tool -Another potential use case is to configure the counters on the fly in order to help debug packet loss issues. For example, if we're consistently experiencing packet loss in your system, we might try: -- Creating a counter that tracks L2_ANY and a counter that tracks L3_ANY -- L2_ANY is incrementing, so we delete these two counters and create MAC_COUNTER that tracks MAC-related reasons (SMAC_EQUALS_DMAC, DMAC_RESERVED, etc.), VLAN_COUNTER that tracks VLAN related reasons, (INGRESS_VLAN_FILTER, VLAN_TAG_NOT_ALLOWED), and OTHER_COUNTER that tracks everything else (EXCEEDS_L2_MTU, FDB_UC_DISCARD, etc.) -- OTHER_COUNTER is incrementing, so we delete the previous counters and create a counter that tracks the individual reasons from OTHER_COUNTER -- We discover that the EXCEEDS_L2_MTU counter is increasing. There might be an MTU mismatch somewhere in our system! - -### 1.1.3 More sophisticated monitoring schemes -Some have suggested other deployment schemes to try to sample the specific types of packet drops that are occurring in their system. Some of these ideas include: -- Periodically (e.g. every 30s) cycling through different sets of drop counters on a given device -- "Striping" drop counters across different devices in the system (e.g. these 3 switches are tracking VLAN drops, these 3 switches are tracking ACL drops, etc.) -- An automatic version of [1.1.2](#112-a-helpful-debugging-tool) that adapts the drop counter configuration based on which counters are incrementing - -# 2 Requirements - -## 2.1 Functional Requirements -1. CONFIG_DB can be configured to create debug counters -2. STATE_DB can be queried for debug counter capabilities -3. Users can access drop counter information via a CLI tool - 1. Users can see what capabilities are available to them - 1. Types of counters (i.e. port-level and/or switch-level) - 2. Number of counters - 3. Supported drop reasons - 2. Users can see what types of drops each configured counter contains - 3. Users can add and remove drop reasons from each counter - 4. Users can read the current value of each counter - 5. Users can assign aliases to counters - 6. Users can clear counters +* we also added a knob to make this auto-restart feature dynamically configurable. + +## 1.1 Requirements + +### 1.1.1 Functional Requirements +1. The monit must provide the ability to generate an alert when a critical process is not + running. +2. The monit must provide the ability to generate an alert when the resource usage of + a docker contaier is larger than the pre-defined threshold. +3. The event listener in supervisord must receive the signal when a critical process in + a docker container crashed or exited unexpectedly and then restart this docker + container. +4. CONFIG_DB can be configured to enable/disable this auto-restart feature for each docker + container.. +5. Users can access this auto-restart information via a CLI tool + 1. Users can see current auto-restart status for docker containers. + 1. Users can change auto-restart status for a specific docker container. ## 2.2 Configuration and Management Requirements Configuration of the drop counters can be done via: From 6a2c01a4114cf65f1794aa849b5db0517f0a43c4 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Tue, 18 Feb 2020 18:14:27 -0800 Subject: [PATCH 04/57] [Monitoring] Add section of design overview. Signed-off-by: Yong Zhao --- .../monitoring_containers.md | 42 +++++++++---------- 1 file changed, 20 insertions(+), 22 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 155d132fb0..7920d36804 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -1,4 +1,4 @@ -# Monitoring and Auto-mitigating the Unhealthy of Containers in SONiC +# Monitoring and Auto-Mitigating the Unhealthy of Containers in SONiC # High Level Design Document #### Rev 0.1 @@ -54,17 +54,17 @@ * [Figure 1: General Flow](#41-general-flow) # Revision -| Rev | Date | Author | Change Description | -|:---:|:--------:|:----------------------:|---------------------------| -| 0.1 | 02/18/20 | Yong Zhao, Joe Leveque | Initial version | +| Rev | Date | Author | Change Description | +|:---:|:----------:|:----------------------:|---------------------------| +| 0.1 | 02/18/2020 | Yong Zhao, Joe Leveque | Initial version | # About this Manual This document provides the design and implementation of monitoring and auto-mitigating -the unhealthy of containers in SONiC. +the unhealthy of docker containers in SONiC. # Scope This document describes the high level design of the feature to monitor and auto-mitigate -the unhealthy of containers. +the unhealthy of docker containers. # Definitions/Abbreviation | Abbreviation | Description | @@ -72,7 +72,7 @@ the unhealthy of containers. | Config DB | SONiC Configuration Database | | CLI | Command Line Interface | -# 1 Overview +# 1 Feature Overview SONiC is a collection of various switch applications which are held in docker containers such as BGP and SNMP. Each application usually includes several processes which are working together to provide the services for other modules. As such, the healthy of @@ -105,27 +105,25 @@ We implemented this feature by employing the existing monit and supervisord syst container.. 5. Users can access this auto-restart information via a CLI tool 1. Users can see current auto-restart status for docker containers. - 1. Users can change auto-restart status for a specific docker container. + 2. Users can change auto-restart status for a specific docker container. -## 2.2 Configuration and Management Requirements -Configuration of the drop counters can be done via: -* config_db.json +### 1.1.2 Configuration and Management Requirements +Configuration of the auto-restart feature can be done via: +* init_cfg.json * CLI -## 2.3 Scalability Requirements -Users must be able to use all debug counters and drop reasons provided by the underlying hardware. +### 1.1.3 Scalability Requirements -Interacting with debug counters will not interfere with existing hardware counters (e.g. portstat). Likewise, interacting with existing hardware counters will not interfere with debug counter behavior. +# 2 Design -## 2.4 Supported Debug Counters -* PORT_INGRESS_DROPS: port-level ingress drop counters -* PORT_EGRESS_DROPS: port-level egress drop counters -* SWITCH_INGRESS_DROPS: switch-level ingress drop counters -* SWITCH_EGRESS_DROPS: switch-level egress drop counters +## 2.1 Basic Approach +Monitoring the running status of critical processes and resource usage of docker containers +are heavily depended on the monit system tool. Since monit already provided the mechanism +to check whether a process is running or not, it will be easy to integrate this to monitor the +critical processes in SONiC. Currently we only used monit to monitor the memory usage of each +docker container, -# 3 Design - -## 3.1 CLI (and usage example) +## 2.1 CLI (and usage example) The CLI tool will provide the following functionality: * See available drop counter capabilities: `show dropcounters capabilities` * See drop counter config: `show dropcounters configuration` From ac56da8c3567f6552f99a09889e362e8a055a5d5 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Tue, 18 Feb 2020 18:19:51 -0800 Subject: [PATCH 05/57] [Monitoring] add section of design overview. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 7920d36804..3f9f6198f4 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -120,8 +120,9 @@ Configuration of the auto-restart feature can be done via: Monitoring the running status of critical processes and resource usage of docker containers are heavily depended on the monit system tool. Since monit already provided the mechanism to check whether a process is running or not, it will be easy to integrate this to monitor the -critical processes in SONiC. Currently we only used monit to monitor the memory usage of each -docker container, +critical processes in SONiC. However, monit only provided the mechanism to monitor the resource +usage per process level not container level. As such, monitoring the resource usage of docker +container will be a challenging problem. ## 2.1 CLI (and usage example) The CLI tool will provide the following functionality: From e294a9c79ec8d127258efe7f2ae224409fef97b8 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Tue, 18 Feb 2020 18:22:16 -0800 Subject: [PATCH 06/57] [Monitoring] Add section of design overview. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 3f9f6198f4..0ac86c6d76 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -119,8 +119,8 @@ Configuration of the auto-restart feature can be done via: ## 2.1 Basic Approach Monitoring the running status of critical processes and resource usage of docker containers are heavily depended on the monit system tool. Since monit already provided the mechanism -to check whether a process is running or not, it will be easy to integrate this to monitor the -critical processes in SONiC. However, monit only provided the mechanism to monitor the resource +to check whether a process is running or not, it will be easy to integrate this into monitoring +the critical processes in SONiC. However, monit only presented the method to monitor the resource usage per process level not container level. As such, monitoring the resource usage of docker container will be a challenging problem. From 954688270a87bab834cf2a1b42773c02650c1144 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Wed, 19 Feb 2020 09:32:55 -0800 Subject: [PATCH 07/57] [Monitoring] Add introduction for auto-restart feature in overview. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 0ac86c6d76..ee7cd04ed7 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -81,15 +81,17 @@ SONiC switch. The main purpose of this feature includes two parts: the first part is to monitor the running status of each process and critical resource usage such as CPU, memory and disk -of each docker container. The second part is to auto-mitigate the unhealthy state of docker +of each docker container. The second part is to auto-mitigate the unhealthy of docker container if one of its critical process crashed or exited unexpectedly. We implemented this feature by employing the existing monit and supervisord system tools. * we used monit system tool to detect whether a process is running or not and whether the resource usage of a docker container is beyond the pre-defined threshold. -* we used the mechanism of event listener in supervisord to auto-restart a docker container +* we leveraged the mechanism of event listener in supervisord to auto-restart a docker container if one of its critical processes exited unexpectedly. * we also added a knob to make this auto-restart feature dynamically configurable. + Specifically users can run CLI to configure this feature residing in Config_DB as + enabled/disabled state. ## 1.1 Requirements From 8f157ec1045306fde9d2cbf0860c360d4f1a0fe4 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Wed, 19 Feb 2020 10:01:22 -0800 Subject: [PATCH 08/57] [Monitoring] Add the section of basic approach. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index ee7cd04ed7..7ca94c68da 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -105,7 +105,7 @@ We implemented this feature by employing the existing monit and supervisord syst container. 4. CONFIG_DB can be configured to enable/disable this auto-restart feature for each docker container.. -5. Users can access this auto-restart information via a CLI tool +5. Users can access this auto-restart information via the CLI utility 1. Users can see current auto-restart status for docker containers. 2. Users can change auto-restart status for a specific docker container. @@ -121,10 +121,12 @@ Configuration of the auto-restart feature can be done via: ## 2.1 Basic Approach Monitoring the running status of critical processes and resource usage of docker containers are heavily depended on the monit system tool. Since monit already provided the mechanism -to check whether a process is running or not, it will be easy to integrate this into monitoring -the critical processes in SONiC. However, monit only presented the method to monitor the resource -usage per process level not container level. As such, monitoring the resource usage of docker -container will be a challenging problem. +to check whether a process is running or not, it will be straightforward to integrate this into monitoring +the critical processes in SONiC. However, monit only gives the method to monitor the resource +usage per process level not container level. As such, monitoring the resource usage of a docker +container will be an interesting and challenging problem. In our design, we adopted the way +that monit will check the exit code of a script which reads the resource usage of docker +containers, compares it with threshold and then return different value. ## 2.1 CLI (and usage example) The CLI tool will provide the following functionality: From 752dad0f1a51c99e8ab5e36f3c5e93f0f91fa39d Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Wed, 19 Feb 2020 10:24:29 -0800 Subject: [PATCH 09/57] [Monitoring] Add paragraph in section of basic approach. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 7ca94c68da..2a8a15c41b 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -125,8 +125,15 @@ to check whether a process is running or not, it will be straightforward to inte the critical processes in SONiC. However, monit only gives the method to monitor the resource usage per process level not container level. As such, monitoring the resource usage of a docker container will be an interesting and challenging problem. In our design, we adopted the way -that monit will check the exit code of a script which reads the resource usage of docker -containers, compares it with threshold and then return different value. +that monit will check the returned value of a script which reads the resource usage of docker +container, compares it with pre-defined threshold and then exited. The value 0 signified that +the resource usage is less than threshold and non-zero means we should send an alert since +current usage is larger than threshold. + +The second part in this feature is docker containers can be automatically shut down and +restarted if one of critical processes running in the container exits unexpectedly. Restarting +the entire container ensures that configuration is reloaded and all processes in the container +get restarted, thus increasing the likelihood of entering a healthy state. ## 2.1 CLI (and usage example) The CLI tool will provide the following functionality: From 38d6cab08b96b9759c547885b88b43a190c85292 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Wed, 19 Feb 2020 10:47:53 -0800 Subject: [PATCH 10/57] [Monitoring] Add description in the section of feature overview. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 2a8a15c41b..167bbcbc90 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -76,8 +76,11 @@ the unhealthy of docker containers. SONiC is a collection of various switch applications which are held in docker containers such as BGP and SNMP. Each application usually includes several processes which are working together to provide the services for other modules. As such, the healthy of -critical processes in each docker container are the key for the intended functionalities of -SONiC switch. +critical processes in each docker container are the key not only for this docker +container working correctly but also for the intended functionalities of whole SONiC switch. +On the other hand, profiling the resource usages and performance of each docker +container are also important for us to understand whether it is in healthy state +and more importantly to provide us with deep insight about networking traffic. The main purpose of this feature includes two parts: the first part is to monitor the running status of each process and critical resource usage such as CPU, memory and disk @@ -116,9 +119,9 @@ Configuration of the auto-restart feature can be done via: ### 1.1.3 Scalability Requirements -# 2 Design +## 1.2 Design -## 2.1 Basic Approach +### 1.2.1 Basic Approach Monitoring the running status of critical processes and resource usage of docker containers are heavily depended on the monit system tool. Since monit already provided the mechanism to check whether a process is running or not, it will be straightforward to integrate this into monitoring @@ -135,6 +138,8 @@ restarted if one of critical processes running in the container exits unexpected the entire container ensures that configuration is reloaded and all processes in the container get restarted, thus increasing the likelihood of entering a healthy state. +# 2 Functionality + ## 2.1 CLI (and usage example) The CLI tool will provide the following functionality: * See available drop counter capabilities: `show dropcounters capabilities` From df371882e0ef083acaacdaf2cde57aa4d61917a1 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Wed, 19 Feb 2020 11:12:26 -0800 Subject: [PATCH 11/57] [Monitoring] Delete some extra blank lines. Signed-off-by: Yong Zhao --- .../monitoring_containers.md | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 167bbcbc90..52032e68e7 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -79,13 +79,16 @@ working together to provide the services for other modules. As such, the healthy critical processes in each docker container are the key not only for this docker container working correctly but also for the intended functionalities of whole SONiC switch. On the other hand, profiling the resource usages and performance of each docker -container are also important for us to understand whether it is in healthy state -and more importantly to provide us with deep insight about networking traffic. +container are also important for us to understand whether this container is in healthy state +or not and to provide us with deep insight about networking traffic. The main purpose of this feature includes two parts: the first part is to monitor the running status of each process and critical resource usage such as CPU, memory and disk -of each docker container. The second part is to auto-mitigate the unhealthy of docker -container if one of its critical process crashed or exited unexpectedly. +of each docker container. +The second part in this feature is docker containers can be automatically shut down and +restarted if one of critical processes running in the container exits unexpectedly. Restarting +the entire container ensures that configuration is reloaded and all processes in the container +get restarted, thus increasing the likelihood of entering a healthy state. We implemented this feature by employing the existing monit and supervisord system tools. * we used monit system tool to detect whether a process is running or not and whether @@ -133,10 +136,6 @@ container, compares it with pre-defined threshold and then exited. The value 0 s the resource usage is less than threshold and non-zero means we should send an alert since current usage is larger than threshold. -The second part in this feature is docker containers can be automatically shut down and -restarted if one of critical processes running in the container exits unexpectedly. Restarting -the entire container ensures that configuration is reloaded and all processes in the container -get restarted, thus increasing the likelihood of entering a healthy state. # 2 Functionality From 6d04987572a62d0b2c8027c217fab0076d0e952c Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Wed, 19 Feb 2020 11:16:57 -0800 Subject: [PATCH 12/57] [Monitoring] Reword in the feature overview. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 52032e68e7..fd05b97cf5 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -80,12 +80,12 @@ critical processes in each docker container are the key not only for this docker container working correctly but also for the intended functionalities of whole SONiC switch. On the other hand, profiling the resource usages and performance of each docker container are also important for us to understand whether this container is in healthy state -or not and to provide us with deep insight about networking traffic. +or not and furtherly to provide us with deep insight about networking traffic. The main purpose of this feature includes two parts: the first part is to monitor the running status of each process and critical resource usage such as CPU, memory and disk of each docker container. -The second part in this feature is docker containers can be automatically shut down and +The second part is docker containers can be automatically shut down and restarted if one of critical processes running in the container exits unexpectedly. Restarting the entire container ensures that configuration is reloaded and all processes in the container get restarted, thus increasing the likelihood of entering a healthy state. From 9724d9ead61670430e25254fae7dfeaf1cd2444c Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Wed, 19 Feb 2020 11:54:31 -0800 Subject: [PATCH 13/57] [Monitoring] Add a section of use cases. Signed-off-by: Yong Zhao --- .../monitoring_containers.md | 32 +++++++++++++++---- 1 file changed, 25 insertions(+), 7 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index fd05b97cf5..9b67e9ada8 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -117,8 +117,8 @@ We implemented this feature by employing the existing monit and supervisord syst ### 1.1.2 Configuration and Management Requirements Configuration of the auto-restart feature can be done via: -* init_cfg.json -* CLI +1. init_cfg.json +2. CLI ### 1.1.3 Scalability Requirements @@ -132,12 +132,31 @@ the critical processes in SONiC. However, monit only gives the method to monitor usage per process level not container level. As such, monitoring the resource usage of a docker container will be an interesting and challenging problem. In our design, we adopted the way that monit will check the returned value of a script which reads the resource usage of docker -container, compares it with pre-defined threshold and then exited. The value 0 signified that -the resource usage is less than threshold and non-zero means we should send an alert since -current usage is larger than threshold. +container, compares it with pre-defined threshold and then exited. +We employed the mechanism of event listener in supervisord to achieve auto-restarting of docker +container. Currently supervisord will monitor the running status of each process in SONiC +docker containers. If one critical process exited unexpectedly, supervisord will catch such signal +and send it to event listener. Then event listener will kill the process supervisord and +the entire docker container will be shut down and restarted. # 2 Functionality +## 2.1 Target Deployment Use Cases +This feature is used to perform the following functions: +1. Monit will write an alert message into syslog if one if critical process exited unexpectedly. +2. Monit will write an alert message into syslog if the usage of memory is larger than the + pre-defined threshold for a docker container. +3. A docker container will auto-restart if one of its critical processes crashed or exited + unexpectedly. + +## 2.2 Functional Description + + +### 2.2.1 Monitoring Critical Processes +The value 0 signified that +the resource usage is less than threshold and non-zero means we should send an alert since +current usage is larger than threshold. + ## 2.1 CLI (and usage example) The CLI tool will provide the following functionality: @@ -168,8 +187,7 @@ PORT_INGRESS_DROPS: SIP_LINK_LOCAL DIP_LINK_LOCAL UNRESOLVED_NEXT_HOP - DECAP_ERROR - + DECAP_ERROR SWITCH_EGRESS_DROPS: L2_ANY L3_ANY From fe179998d60b2c7317607d8706fca416db61d1cb Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Wed, 19 Feb 2020 13:28:44 -0800 Subject: [PATCH 14/57] [Monitoring] Add section of Monitoring Critical Processes. Signed-off-by: Yong Zhao --- .../monitoring_containers.md | 32 +++++++++++++++++-- 1 file changed, 29 insertions(+), 3 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 9b67e9ada8..54b3091507 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -91,11 +91,11 @@ the entire container ensures that configuration is reloaded and all processes in get restarted, thus increasing the likelihood of entering a healthy state. We implemented this feature by employing the existing monit and supervisord system tools. -* we used monit system tool to detect whether a process is running or not and whether +1. We used monit system tool to detect whether a process is running or not and whether the resource usage of a docker container is beyond the pre-defined threshold. -* we leveraged the mechanism of event listener in supervisord to auto-restart a docker container +2. We leveraged the mechanism of event listener in supervisord to auto-restart a docker container if one of its critical processes exited unexpectedly. -* we also added a knob to make this auto-restart feature dynamically configurable. +3. We also added a knob to make this auto-restart feature dynamically configurable. Specifically users can run CLI to configure this feature residing in Config_DB as enabled/disabled state. @@ -153,6 +153,32 @@ This feature is used to perform the following functions: ### 2.2.1 Monitoring Critical Processes +Monit has implemented the mechanism to monitor whether a process is running or not. In detail, +monit will periodically read the configuration file trying to match the target process in +the process tree. + +Below is an example of monit configuration file to check the critical processes in lldp +container. + +*/etc/monit/conf.d/monit_lldp* +```bash +############################################################################### +# Monit configuration file for lldp container +# Process list: +# lldpd +# lldp_syncd +# lldpmgrd +############################################################################### +check process lldp_monitor matching "lldpd: " + if does not exit for 5 times within 5 cycles then alert +check process lldp_syncd matching "python2 -m lldp_syncd" + if does not exit for 5 times within 5 cycles then alert +check process lldpmgrd matching "python /usr/bin/lldpmgrd" + if does not exit for 5 times within 5 cycles then alert +``` + +### 2.2.2 Monitoring Critical Resource Usage + The value 0 signified that the resource usage is less than threshold and non-zero means we should send an alert since current usage is larger than threshold. From 5d3bdfae9b543fb85429d14a78714090d84081ad Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Wed, 19 Feb 2020 15:07:21 -0800 Subject: [PATCH 15/57] [Moniting] Add a section about monitoring the critical process. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 54b3091507..1510a3e977 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -154,10 +154,10 @@ This feature is used to perform the following functions: ### 2.2.1 Monitoring Critical Processes Monit has implemented the mechanism to monitor whether a process is running or not. In detail, -monit will periodically read the configuration file trying to match the target process in -the process tree. +monit will periodically read the target processes from configuration file and tries to match +those process with the processes tree in Linux kernel. -Below is an example of monit configuration file to check the critical processes in lldp +Below is an example of monit configuration file to monitor the critical processes in lldp container. */etc/monit/conf.d/monit_lldp* @@ -178,6 +178,10 @@ check process lldpmgrd matching "python /usr/bin/lldpmgrd" ``` ### 2.2.2 Monitoring Critical Resource Usage +Similar to monitoring the critical processes, we can employ monit to monitor the resource usage +such as CPU, memory and disk for each process. Unfortunately monit is unable to do the resource monitoring +in the container level. Thus we developed a new method to achieve this base on monit. + The value 0 signified that the resource usage is less than threshold and non-zero means we should send an alert since From c948aa28f8eb2d946ba2190a2df1c7cd7b77955e Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Wed, 19 Feb 2020 15:47:55 -0800 Subject: [PATCH 16/57] [Monitoring] Add a section of monitoring critical resources. Signed-off-by: Yong Zhao --- .../monitoring_containers.md | 39 +++++++++++-------- 1 file changed, 23 insertions(+), 16 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 1510a3e977..1b7f541c0a 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -90,8 +90,8 @@ restarted if one of critical processes running in the container exits unexpected the entire container ensures that configuration is reloaded and all processes in the container get restarted, thus increasing the likelihood of entering a healthy state. -We implemented this feature by employing the existing monit and supervisord system tools. -1. We used monit system tool to detect whether a process is running or not and whether +We implemented this feature by employing the existing Monit and supervisord system tools. +1. We used Monit system tool to detect whether a process is running or not and whether the resource usage of a docker container is beyond the pre-defined threshold. 2. We leveraged the mechanism of event listener in supervisord to auto-restart a docker container if one of its critical processes exited unexpectedly. @@ -102,9 +102,9 @@ We implemented this feature by employing the existing monit and supervisord syst ## 1.1 Requirements ### 1.1.1 Functional Requirements -1. The monit must provide the ability to generate an alert when a critical process is not +1. The Monit must provide the ability to generate an alert when a critical process is not running. -2. The monit must provide the ability to generate an alert when the resource usage of +2. The Monit must provide the ability to generate an alert when the resource usage of a docker contaier is larger than the pre-defined threshold. 3. The event listener in supervisord must receive the signal when a critical process in a docker container crashed or exited unexpectedly and then restart this docker @@ -126,12 +126,12 @@ Configuration of the auto-restart feature can be done via: ### 1.2.1 Basic Approach Monitoring the running status of critical processes and resource usage of docker containers -are heavily depended on the monit system tool. Since monit already provided the mechanism +are heavily depended on the Monit system tool. Since Monit already provided the mechanism to check whether a process is running or not, it will be straightforward to integrate this into monitoring -the critical processes in SONiC. However, monit only gives the method to monitor the resource +the critical processes in SONiC. However, Monit only gives the method to monitor the resource usage per process level not container level. As such, monitoring the resource usage of a docker container will be an interesting and challenging problem. In our design, we adopted the way -that monit will check the returned value of a script which reads the resource usage of docker +that Monit will check the returned value of a script which reads the resource usage of docker container, compares it with pre-defined threshold and then exited. We employed the mechanism of event listener in supervisord to achieve auto-restarting of docker @@ -154,10 +154,10 @@ This feature is used to perform the following functions: ### 2.2.1 Monitoring Critical Processes Monit has implemented the mechanism to monitor whether a process is running or not. In detail, -monit will periodically read the target processes from configuration file and tries to match +Monit will periodically read the target processes from configuration file and tries to match those process with the processes tree in Linux kernel. -Below is an example of monit configuration file to monitor the critical processes in lldp +Below is an example of Monit configuration file to monitor the critical processes in lldp container. */etc/monit/conf.d/monit_lldp* @@ -178,15 +178,22 @@ check process lldpmgrd matching "python /usr/bin/lldpmgrd" ``` ### 2.2.2 Monitoring Critical Resource Usage -Similar to monitoring the critical processes, we can employ monit to monitor the resource usage -such as CPU, memory and disk for each process. Unfortunately monit is unable to do the resource monitoring -in the container level. Thus we developed a new method to achieve this base on monit. - - -The value 0 signified that -the resource usage is less than threshold and non-zero means we should send an alert since +Similar to monitoring the critical processes, we can employ Monit to monitor the resource usage +such as CPU, memory and disk for each process. Unfortunately Monit is unable to do the resource monitoring +in the container level. Thus we developed a new method to achieve such monitoring base on Monit. +Specifically Monit will monitor a script and check its exit status. This script +will correspondingly read the resource usage of docker containers, compare it with +pre-defined threshold and then return a value. The value 0 signified that +the resource usage is less than threshold and non-zero means Monit will send an alert since current usage is larger than threshold. +Below is an example of Monit configuration file for lldp container to pass the pre-defined +threshold (bytes) to the script and check it exiting value. + +```bash +check program memory_checker with path "/usr/bin/memory_checker lldp 104857600" + if status != 0 then alert +``` ## 2.1 CLI (and usage example) The CLI tool will provide the following functionality: From c5c01914fd7df1113acc62a04e58670ffe353cbc Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Thu, 20 Feb 2020 13:50:13 -0800 Subject: [PATCH 17/57] [Monitoring] Add a section of auto-restart docker container. Signed-off-by: Yong Zhao --- .../monitoring_containers.md | 20 +++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 1b7f541c0a..b861ecfe49 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -180,7 +180,7 @@ check process lldpmgrd matching "python /usr/bin/lldpmgrd" ### 2.2.2 Monitoring Critical Resource Usage Similar to monitoring the critical processes, we can employ Monit to monitor the resource usage such as CPU, memory and disk for each process. Unfortunately Monit is unable to do the resource monitoring -in the container level. Thus we developed a new method to achieve such monitoring base on Monit. +in the container level. Thus we propose a new design to achieve such monitoring based on Monit. Specifically Monit will monitor a script and check its exit status. This script will correspondingly read the resource usage of docker containers, compare it with pre-defined threshold and then return a value. The value 0 signified that @@ -188,13 +188,29 @@ the resource usage is less than threshold and non-zero means Monit will send an current usage is larger than threshold. Below is an example of Monit configuration file for lldp container to pass the pre-defined -threshold (bytes) to the script and check it exiting value. +threshold (bytes) to the script and check the exiting value. ```bash check program memory_checker with path "/usr/bin/memory_checker lldp 104857600" if status != 0 then alert ``` +### 2.2.3 Auto-restart Docker Container +The design principle behind this auto-restart feature is that docker containers can be automatically shut down and +restarted if one of critical processes running in the container exits unexpectedly. Restarting +the entire container ensures that configuration is reloaded and all processes in the container +get restarted, thus increasing the likelihood of entering a healthy state. + +Currently SONiC used superviord system tool to manage the processes in each +docker container. Actually auto-restarting docker container is based on the process +monitoring/notification framework provided by supervisord. Specifically +if the state of process changes for example from running to exited, +an event notification `PROCESS_STATE_STOPPED` will be emitted by supervisord. +This event will be received event listener. After that event listener will +terminate supervisord and the container will be stopped and restarted +if the exited process is critical one. + + ## 2.1 CLI (and usage example) The CLI tool will provide the following functionality: * See available drop counter capabilities: `show dropcounters capabilities` From 402387434323f797d345475dc653940abe7cefe2 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Thu, 20 Feb 2020 14:43:30 -0800 Subject: [PATCH 18/57] [Monitoring] Correct the hyper-link. Signed-off-by: Yong Zhao --- .../monitoring_containers.md | 360 +++++------------- 1 file changed, 85 insertions(+), 275 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index b861ecfe49..0ac5da6388 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -10,42 +10,21 @@ * [About this Manual](#about-this-manual) * [Scope](#scope) * [Defintions/Abbreviation](#definitionsabbreviation) -* [1 Overview](#1-overview) - - [1.1 Use Cases](#11-use-cases) - - [1.1.1 A flexible "drop filter"](#111-a-flexible-"drop-filter") - - [1.1.2 A helpful debugging tool](#112-a-helpful-debugging-tool) - - [1.1.3 More sophisticated monitoring schemes](#113-more-sophisticated-monitoring-schemes) -* [2 Requirements](#2-requirements) - - [2.1 Functional Requirements](#21-functional-requirements) - - [2.2 Configuration and Management Requirements](#22-configuration-and-management-requirements) - - [2.3 Scalability Requirements](#23-scalability-requirements) - - [2.4 Supported Debug Counters](#24-supported-debug-counters) -* [3 Design](#3-design) - - [3.1 CLI (and usage example)](#31-cli-and-usage-example) - - [3.1.1 Displaying available counter capabilities](#311-displaying-available-counter-capabilities) - - [3.1.2 Displaying current counter configuration](#312-displaying-current-counter-configuration) - - [3.1.3 Displaying the current counts](#313-displaying-the-current-counts) +* [1 Feature Overview](#1-feature-overview) + - [1.1 Requirements](#11-requirements) + - [1.1.1 Functional Requirements](#111-functional-requirements) + - [1.1.2 Configuration and Management Requirements](#112-configuration-and-management-requirements) + - [1.1.3 Scalability Requirements](#113-scalability-requirements) + - [1.2 Design](#12-design) + - [1.2.1 Basic Approach](#121-basic-approach) +* [2 Functionality](#2-functionality) + - [2.1 Target Deployment Use Cases](#21-target-deployment-use-cases) + - [2.2 Functional Description](#22-functional-description) + - [2.2.1 Monitoring Critical Processes](#221-monitoring-critical-processes) + - [2.2.2 Monitoring Critical Resource Usage](#222-monitoring-critical-resource-usage) + - [2.2.3 Auto-restart Docker Container](#223-autorestart-docker-container) - [3.1.4 Clearing the counts](#314-clearing-the-counts) - [3.1.5 Configuring counters from the CLI](#315-configuring-counters-from-the-CLI) - - [3.2 Config DB](#32-config-db) - - [3.2.1 DEBUG_COUNTER Table](#321-debug_counter-table) - - [3.2.2 PACKET_DROP_COUNTER_REASON Table](#322-packet_drop_counter_reason-table) - - [3.3 State DB](#33-state-db) - - [3.3.1 DEBUG_COUNTER_CAPABILITIES Table](#331-debug-counter-capabilities-table) - - [3.3.2 SAI APIs](#332-sai-apis) - - [3.4 Counters DB](#34-counters-db) - - [3.5 SWSS](#35-swss) - - [3.5.1 SAI APIs](#351-sai-apis) - - [3.6 syncd](#34-syncd) -* [4 Flows](#4-flows) - - [4.1 General Flow](#41-general-flow) -* [5 Warm Reboot Support](#5-warm-reboot-support) -* [6 Unit Tests](#6-unit-tests) -* [7 Platform Support](#7-platform-support) - - [7.1 Known Limitations](#7.1-known-limitations) -* [8 Open Questions](#8-open-questions) -* [9 Acknowledgements](#9-acknowledgements) -* [10 References](#10-references) # List of Tables * [Table 1: Abbreviations](#definitionsabbreviation) @@ -121,6 +100,7 @@ Configuration of the auto-restart feature can be done via: 2. CLI ### 1.1.3 Scalability Requirements +`Place holder` ## 1.2 Design @@ -196,273 +176,103 @@ check program memory_checker with path "/usr/bin/memory_checker lldp 104857600" ``` ### 2.2.3 Auto-restart Docker Container -The design principle behind this auto-restart feature is that docker containers can be automatically shut down and +The design principle behind this auto-restart feature is docker containers can be automatically shut down and restarted if one of critical processes running in the container exits unexpectedly. Restarting the entire container ensures that configuration is reloaded and all processes in the container get restarted, thus increasing the likelihood of entering a healthy state. Currently SONiC used superviord system tool to manage the processes in each docker container. Actually auto-restarting docker container is based on the process -monitoring/notification framework provided by supervisord. Specifically +monitoring/notification framework. Specifically if the state of process changes for example from running to exited, an event notification `PROCESS_STATE_STOPPED` will be emitted by supervisord. -This event will be received event listener. After that event listener will -terminate supervisord and the container will be stopped and restarted -if the exited process is critical one. +This event will be received by event listener. If the exited process is critical +one, then the event listener will terminate supervisord and the container will be stopped +and restarted. +We also introduced a knob which can enable or disable this auto-restart feature +dynamically according to the requirement of users. In detail, we created a table +named `CONTAINER_FEATURE` in Config_DB and this table includes the status of +auto-restart feature for each docker container. Users can easily use CLI to +see and configure the corresponding status. -## 2.1 CLI (and usage example) -The CLI tool will provide the following functionality: -* See available drop counter capabilities: `show dropcounters capabilities` -* See drop counter config: `show dropcounters configuration` -* Show drop counts: `show dropcounters counts` -* Clear drop counters: `sonic-clear dropcounters` -* Initialize a new drop counter: `config dropcounters install` -* Add drop reasons to a drop counter: `config dropcounters add_reasons` -* Remove drop reasons from a drop counter: `config dropcounters remove_reasons` -* Delete a drop counter: `config dropcounters delete` - -### 3.1.1 Displaying available counter capabilities -``` -admin@sonic:~$ show dropcounters capabilities -Counter Type Total --------------------- ------- -PORT_INGRESS_DROPS 3 -SWITCH_EGRESS_DROPS 2 - -PORT_INGRESS_DROPS: - L2_ANY - SMAC_MULTICAST - SMAC_EQUALS_DMAC - INGRESS_VLAN_FILTER - EXCEEDS_L2_MTU - SIP_CLASS_E - SIP_LINK_LOCAL - DIP_LINK_LOCAL - UNRESOLVED_NEXT_HOP - DECAP_ERROR -SWITCH_EGRESS_DROPS: - L2_ANY - L3_ANY - A_CUSTOM_REASON -``` - -### 3.1.2 Displaying current counter configuration -``` -admin@sonic:~$ show dropcounters configuration -Counter Alias Group Type Reasons Description --------- -------- ----- ------------------ ------------------- -------------- -DEBUG_0 RX_LEGIT LEGIT PORT_INGRESS_DROPS SMAC_EQUALS_DMAC Legitimate port-level RX pipeline drops - INGRESS_VLAN_FILTER -DEBUG_1 TX_LEGIT None SWITCH_EGRESS_DROPS EGRESS_VLAN_FILTER Legitimate switch-level TX pipeline drops - -admin@sonic:~$ show dropcounters configuration -g LEGIT -Counter Alias Group Type Reasons Description --------- -------- ----- ------------------ ------------------- -------------- -DEBUG_0 RX_LEGIT LEGIT PORT_INGRESS_DROPS SMAC_EQUALS_DMAC Legitimate port-level RX pipeline drops - INGRESS_VLAN_FILTER -``` -### 3.1.3 Displaying the current counts - -``` -admin@sonic:~$ show dropcounters counts - IFACE STATE RX_ERR RX_DROPS TX_ERR TX_DROPS RX_LEGIT ---------- ------- -------- ---------- -------- ---------- --------- -Ethernet0 U 10 100 0 0 20 -Ethernet4 U 0 1000 0 0 100 -Ethernet8 U 100 10 0 0 0 - -DEVICE TX_LEGIT ------- -------- -sonic 1000 - -admin@sonic:~$ show dropcounters counts -g LEGIT - IFACE STATE RX_ERR RX_DROPS TX_ERR TX_DROPS RX_LEGIT ---------- ------- -------- ---------- -------- ---------- --------- -Ethernet0 U 10 100 0 0 20 -Ethernet4 U 0 1000 0 0 100 -Ethernet8 U 100 10 0 0 0 - -admin@sonic:~$ show dropcounters counts -t SWITCH_EGRESS_DROPS -DEVICE TX_LEGIT ------- -------- -sonic 1000 -``` +#### 2.2.3.1 CLI (and usage example) +The CLI tool will provide the following functionality: +1. Show current status of auto-restart feature for docker containers. +2. Configure the status of a specific docker container. -### 3.1.4 Clearing the counts +##### 2.2.3.1.1 Show the status of auto-restart ``` -admin@sonic:~$ sonic-clear dropcounters -Cleared drop counters +admin@sonic:~$ show container feature autorestart +Container Name Status +-------------------- -------- +database disabled +lldp disabled +radv disabled +pmon disabled +sflow enabled +snmp enabled +telemetry enabled +bgp disabled +dhcp_relay disabled +rest-api enabled +teamd disabled +syncd enabled +swss disabled ``` -### 3.1.5 Configuring counters from the CLI +##### 2.2.3.1.2 Configure the status of auto-restart ``` -admin@sonic:~$ sudo config dropcounters install DEBUG_2 PORT_INGRESS_DROPS [EXCEEDS_L2_MTU,DECAP_ERROR] -d "More port ingress drops" -g BAD -a BAD_DROPS -admin@sonic:~$ sudo config dropcounters add_reasons DEBUG_2 [SIP_CLASS_E] -admin@sonic:~$ sudo config dropcounters remove_reasons DEBUG_2 [SIP_CLASS_E] -admin@sonic:~$ sudo config dropcounters delete DEBUG_2 +admin@sonic:~$ sudo config container feature autorestart database enabled ``` -## 3.2 Config DB -Two new tables will be added to Config DB: -* DEBUG_COUNTER to store general debug counter metadata -* DEBUG_COUNTER_DROP_REASON to store drop reasons for debug counters that have been configured to track packet drops ### 3.2.1 DEBUG_COUNTER Table Example: ``` { - "DEBUG_COUNTER": { - "DEBUG_0": { - "alias": "PORT_RX_LEGIT", - "type": "PORT_INGRESS_DROPS", - "desc": "Legitimate port-level RX pipeline drops", - "group": "LEGIT" + "CONTAINER_FEATURE": { + "database": { + "auto_restart": "enabled", }, - "DEBUG_1": { - "alias": "PORT_TX_LEGIT", - "type": "PORT_EGRESS_DROPS", - "desc": "Legitimate port-level TX pipeline drops" - "group": "LEGIT" + "lldp": { + "auto_restart": "disabled", }, - "DEBUG_2": { - "alias": "SWITCH_RX_LEGIT", - "type": "SWITCH_INGRESS_DROPS", - "desc": "Legitimate switch-level RX pipeline drops" - "group": "LEGIT" - } - } -} -``` - -### 3.2.2 DEBUG_COUNTER_DROP_REASON Table -Example: -``` -{ - "DEBUG_COUNTER_DROP_REASON": { - "DEBUG_0|SMAC_EQUALS_DMAC": {}, - "DEBUG_0|INGRESS_VLAN_FILTER": {}, - "DEBUG_1|EGRESS_VLAN_FILTER": {}, - "DEBUG_2|TTL": {}, - } -} -``` - -## 3.3 State DB -State DB will store information about: -* What types of drop counters are available on this device -* How many drop counters are available on this device -* What drop reasons are supported by this device - -### 3.3.1 DEBUG_COUNTER_CAPABILITIES Table -Example: -``` -{ - "DEBUG_COUNTER_CAPABILITIES": { - "SWITCH_INGRESS_DROPS": { - "count": "3", - "reasons": "[L2_ANY, L3_ANY, SMAC_EQUALS_DMAC]" + "radv": { + "auto_restart": "disabled", + }, + "pmon": { + "auto_restart": "disabled", + }, + "sflow": { + "auto_restart": "enabled", + }, + "snmp": { + "auto_restart": "enabled", + }, + "telemetry": { + "auto_restart": "enabled", + }, + "bgp": { + "auto_restart": "disabled", }, - "SWITCH_EGRESS_DROPS": { - "count": "3", - "reasons": "[L2_ANY, L3_ANY]" - } + "dhcp_relay": { + "auto_restart": "disabled", + }, + "rest-api": { + "auto_restart": "enabled", + }, + "teamd": { + "auto_restart": "disabled", + }, + "syncd": { + "auto_restart": "enabled", + }, + "swss": { + "auto_restart": "disabled", + }, + } } ``` - -This information will be populated by the orchestrator (described later) on startup. - -### 3.3.2 SAI APIs -We will use the following SAI APIs to get this information: -* `sai_query_attribute_enum_values_capability` to query support for different types of counters -* `sai_object_type_get_availability` to query the amount of available debug counters - -## 3.4 Counters DB -The contents of the drop counters will be added to Counters DB by flex counters. - -Additionally, we will add a mapping from debug counter names to the appropriate port or switch stat index called COUNTERS_DEBUG_NAME_PORT_STAT_MAP and COUNTERS_DEBUG_NAME_SWITCH_STAT_MAP respectively. - -## 3.5 SWSS -A new orchestrator will be created to handle debug counter creation and configuration. Specifically, this orchestrator will support: -* Creating a new counter -* Deleting existing counters -* Adding drop reasons to an existing counter -* Removing a drop reason from a counter - -### 3.5.1 SAI APIs -This orchestrator will interact with the following SAI Debug Counter APIs: -* `sai_create_debug_counter_fn` to create/configure new drop counters. -* `sai_remove_debug_counter_fn` to delete/free up drop counters that are no longer being used. -* `sai_get_debug_counter_attribute_fn` to gather information about counters that have been configured (e.g. index, drop reasons, etc.). -* `sai_set_debug_counter_attribute_fn` to re-configure drop reasons for counters that have already been created. - -## 3.6 syncd -Flex counter will be extended to support switch-level SAI counters. - -# 4 Flows -## 4.1 General Flow -![alt text](./drop_counters_general_flow.png) -The overall workflow is shown above in figure 1. - -(1) Users configure drop counters using the CLI. Configurations are stored in the DEBUG_COUNTER Config DB table. - -(2) The debug counts orchagent subscribes to the Config DB table. Once the configuration changes, the orchagent uses the debug SAI API to configure the drop counters. - -(3) The debug counts orchagent publishes counter configurations to Flex Counter DB. - -(4) Syncd subscribes to Flex Counter DB and sets up flex counters. Flex counters periodically query ASIC counters and publishes data to Counters DB. - -(5) CLI uses counters DB to satisfy CLI requests. - -(6) (not shown) CLI uses State DB to display hardware capabilities (e.g. how many counters are available, supported drop reasons, etc.) - -# 5 Warm Reboot Support -On resource-constrained platforms, debug counters can be deleted prior to warm reboot and re-installed when orchagent starts back up. This is intended to conserve hardware resources during the warm reboot. This behavior has not been added to SONiC at this time, but can be if the need arises. - -# 6 Unit Tests -This feature comes with a full set of virtual switch tests in SWSS. -``` -=============================================================================================== test session starts =============================================================================================== -platform linux2 -- Python 2.7.15+, pytest-3.3.0, py-1.8.0, pluggy-0.6.0 -- /usr/bin/python2 -cachedir: .cache -rootdir: /home/daall/dev/sonic-swss/tests, inifile: -collected 14 items - -test_drop_counters.py::TestDropCounters::test_deviceCapabilitiesTablePopulated remove extra link dummy -PASSED [ 7%] -test_drop_counters.py::TestDropCounters::test_flexCounterGroupInitialized PASSED [ 14%] -test_drop_counters.py::TestDropCounters::test_createAndRemoveDropCounterBasic PASSED [ 21%] -test_drop_counters.py::TestDropCounters::test_createAndRemoveDropCounterReversed PASSED [ 28%] -test_drop_counters.py::TestDropCounters::test_createCounterWithInvalidCounterType PASSED [ 35%] -test_drop_counters.py::TestDropCounters::test_createCounterWithInvalidDropReason PASSED [ 42%] -test_drop_counters.py::TestDropCounters::test_addReasonToInitializedCounter PASSED [ 50%] -test_drop_counters.py::TestDropCounters::test_removeReasonFromInitializedCounter PASSED [ 57%] -test_drop_counters.py::TestDropCounters::test_addDropReasonMultipleTimes PASSED [ 64%] -test_drop_counters.py::TestDropCounters::test_addInvalidDropReason PASSED [ 71%] -test_drop_counters.py::TestDropCounters::test_removeDropReasonMultipleTimes PASSED [ 78%] -test_drop_counters.py::TestDropCounters::test_removeNonexistentDropReason PASSED [ 85%] -test_drop_counters.py::TestDropCounters::test_removeInvalidDropReason PASSED [ 92%] -test_drop_counters.py::TestDropCounters::test_createAndDeleteMultipleCounters PASSED [100%] - -=========================================================================================== 14 passed in 113.65 seconds =========================================================================================== -``` - -A separate test plan will be uploaded and review by the community. This will consist of system tests written in pytest that will send traffic to the device and verify that the drop counters are updated correctly. - -# 7 Platform Support -In order to make this feature platform independent, we rely on SAI query APIs (described above) to check for what counter types and drop reasons are supported on a given device. As a result, drop counters are only available on platforms that support both the SAI drop counter API as well as the query APIs, in order to preserve safety. - -# 7.1 Known Limitations -* BRCM SAI: - - ACL_ANY, DIP_LINK_LOCAL, SIP_LINK_LOCAL, and L3_EGRESS_LINK_OWN are all based on the same underlying counter in hardware, so enabling any one of these reasons on a drop counter will (implicitly) enable all of them. - -# 8 Open Questions -- How common of an operation is configuring a drop counter? Is this something that will usually only be done on startup, or something people will be updating frequently? - -# 9 Acknowledgements -I'd like to thank the community for all their help designing and reviewing this new feature! Special thanks to Wenda, Ying, Prince, Guohan, Joe, Qi, Renuka, and the team at Microsoft, Madhu and the team at Aviz, Ben, Vissu, Salil, and the team at Broadcom, Itai, Matty, Liat, Marian, and the team at Mellanox, and finally Ravi, Tony, and the team at Innovium. - -# 10 References -[1] [SAI Debug Counter Proposal](https://github.com/itaibaz/SAI/blob/a612dd21257cccca02cfc6dab90745a56d0993be/doc/SAI-Proposal-Debug-Counters.md) From 7a846125d64f8d72c1e5da336e94112717622fd3 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Thu, 20 Feb 2020 15:37:06 -0800 Subject: [PATCH 19/57] [Monitoring] Correct the typo in the hyper-link. Signed-off-by: Yong Zhao --- .../monitoring_containers.md | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 0ac5da6388..07f180cff8 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -5,7 +5,6 @@ # Table of Contents * [List of Tables](#list-of-tables) -* [List of Figures](#list-of-figures) * [Revision](#revision) * [About this Manual](#about-this-manual) * [Scope](#scope) @@ -23,15 +22,14 @@ - [2.2.1 Monitoring Critical Processes](#221-monitoring-critical-processes) - [2.2.2 Monitoring Critical Resource Usage](#222-monitoring-critical-resource-usage) - [2.2.3 Auto-restart Docker Container](#223-autorestart-docker-container) - - [3.1.4 Clearing the counts](#314-clearing-the-counts) - - [3.1.5 Configuring counters from the CLI](#315-configuring-counters-from-the-CLI) + - [2.2.3.1 CLI (and usage example)](#2231-cli-and-usage-example) + - [2.2.3.1.1 Show the Status of Auto-restart](#22311-show-the-status-of-autorestart) + - [2.2.3.1.2 Configure the Status of Auto-restart](#22312-configure-the-status-of-autorestart) + - [2.2.3.1.3 CONTAINER_FEATURE Table](#22313-container-feature-table) # List of Tables * [Table 1: Abbreviations](#definitionsabbreviation) -# List of Figures -* [Figure 1: General Flow](#41-general-flow) - # Revision | Rev | Date | Author | Change Description | |:---:|:----------:|:----------------------:|---------------------------| @@ -202,7 +200,7 @@ The CLI tool will provide the following functionality: 1. Show current status of auto-restart feature for docker containers. 2. Configure the status of a specific docker container. -##### 2.2.3.1.1 Show the status of auto-restart +##### 2.2.3.1.1 Show the Status of Auto-restart ``` admin@sonic:~$ show container feature autorestart Container Name Status @@ -222,13 +220,13 @@ syncd enabled swss disabled ``` -##### 2.2.3.1.2 Configure the status of auto-restart +##### 2.2.3.1.2 Configure the Status of Auto-restart ``` admin@sonic:~$ sudo config container feature autorestart database enabled ``` -### 3.2.1 DEBUG_COUNTER Table +##### 2.2.3.1.3 CONTAINER_FEATURE Table Example: ``` { From 9941852b648508771710642793b78410acf61b7b Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Thu, 20 Feb 2020 15:40:54 -0800 Subject: [PATCH 20/57] [Monitoring] Correct a typo in the hyper-link. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 07f180cff8..ce9b519dcc 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -21,10 +21,10 @@ - [2.2 Functional Description](#22-functional-description) - [2.2.1 Monitoring Critical Processes](#221-monitoring-critical-processes) - [2.2.2 Monitoring Critical Resource Usage](#222-monitoring-critical-resource-usage) - - [2.2.3 Auto-restart Docker Container](#223-autorestart-docker-container) + - [2.2.3 Auto-restart Docker Container](#223-auto-restart-docker-container) - [2.2.3.1 CLI (and usage example)](#2231-cli-and-usage-example) - - [2.2.3.1.1 Show the Status of Auto-restart](#22311-show-the-status-of-autorestart) - - [2.2.3.1.2 Configure the Status of Auto-restart](#22312-configure-the-status-of-autorestart) + - [2.2.3.1.1 Show the Status of Auto-restart](#22311-show-the-status-of-auto-restart) + - [2.2.3.1.2 Configure the Status of Auto-restart](#22312-configure-the-status-of-auto-restart) - [2.2.3.1.3 CONTAINER_FEATURE Table](#22313-container-feature-table) # List of Tables From 58c1f79aeb5b8b6b8553891f983943538756bf8c Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Thu, 20 Feb 2020 15:44:59 -0800 Subject: [PATCH 21/57] [Monitoring] Add a hyper-link for container feature table. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index ce9b519dcc..f8ba137afa 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -25,7 +25,7 @@ - [2.2.3.1 CLI (and usage example)](#2231-cli-and-usage-example) - [2.2.3.1.1 Show the Status of Auto-restart](#22311-show-the-status-of-auto-restart) - [2.2.3.1.2 Configure the Status of Auto-restart](#22312-configure-the-status-of-auto-restart) - - [2.2.3.1.3 CONTAINER_FEATURE Table](#22313-container-feature-table) + - [2.2.3.1.3 CONTAINER_FEATURE Table](#22313-container_feature-table) # List of Tables * [Table 1: Abbreviations](#definitionsabbreviation) From da03448fe743ef5c3b402cf168f7de7c19a34551 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Thu, 20 Feb 2020 15:53:45 -0800 Subject: [PATCH 22/57] [Monitoring] Reword the sentence in the section of feature overview. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index f8ba137afa..0621377346 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -53,8 +53,8 @@ the unhealthy of docker containers. SONiC is a collection of various switch applications which are held in docker containers such as BGP and SNMP. Each application usually includes several processes which are working together to provide the services for other modules. As such, the healthy of -critical processes in each docker container are the key not only for this docker -container working correctly but also for the intended functionalities of whole SONiC switch. +critical processes in each docker container are the key not only for the docker +container working correctly but also for the intended functionalities of entire SONiC switch. On the other hand, profiling the resource usages and performance of each docker container are also important for us to understand whether this container is in healthy state or not and furtherly to provide us with deep insight about networking traffic. @@ -74,7 +74,7 @@ We implemented this feature by employing the existing Monit and supervisord syst if one of its critical processes exited unexpectedly. 3. We also added a knob to make this auto-restart feature dynamically configurable. Specifically users can run CLI to configure this feature residing in Config_DB as - enabled/disabled state. + enabled/disabled status. ## 1.1 Requirements @@ -90,7 +90,7 @@ We implemented this feature by employing the existing Monit and supervisord syst container.. 5. Users can access this auto-restart information via the CLI utility 1. Users can see current auto-restart status for docker containers. - 2. Users can change auto-restart status for a specific docker container. + 2. Users can configure auto-restart status for a specific docker container. ### 1.1.2 Configuration and Management Requirements Configuration of the auto-restart feature can be done via: @@ -169,7 +169,7 @@ Below is an example of Monit configuration file for lldp container to pass the p threshold (bytes) to the script and check the exiting value. ```bash -check program memory_checker with path "/usr/bin/memory_checker lldp 104857600" +check program container_memory_lldp with path "/usr/bin/memory_checker lldp 104857600" if status != 0 then alert ``` From 9884fc29d43bae85b7843fad736a1adfd75608ee Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Thu, 20 Feb 2020 17:21:28 -0800 Subject: [PATCH 23/57] [Monitoring] Reword the sentences in the section of auto-restart feature. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 0621377346..0dd70bf4db 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -1,4 +1,4 @@ -# Monitoring and Auto-Mitigating the Unhealthy of Containers in SONiC +# Monitoring and Auto-Mitigating the Unhealthy of Docker Containers in SONiC # High Level Design Document #### Rev 0.1 @@ -36,11 +36,11 @@ | 0.1 | 02/18/2020 | Yong Zhao, Joe Leveque | Initial version | # About this Manual -This document provides the design and implementation of monitoring and auto-mitigating +This document presents the design and implementation of feature to monitor and auto-mitigate the unhealthy of docker containers in SONiC. # Scope -This document describes the high level design of the feature to monitor and auto-mitigate +This document describes the high level design of feature to monitor and auto-mitigate the unhealthy of docker containers. # Definitions/Abbreviation @@ -51,8 +51,8 @@ the unhealthy of docker containers. # 1 Feature Overview SONiC is a collection of various switch applications which are held in docker containers -such as BGP and SNMP. Each application usually includes several processes which are -working together to provide the services for other modules. As such, the healthy of +such as BGP container and SNMP container. Each application usually includes several processes which are +working together to provide and receive the services from other modules. As such, the healthy of critical processes in each docker container are the key not only for the docker container working correctly but also for the intended functionalities of entire SONiC switch. On the other hand, profiling the resource usages and performance of each docker @@ -185,14 +185,14 @@ monitoring/notification framework. Specifically if the state of process changes for example from running to exited, an event notification `PROCESS_STATE_STOPPED` will be emitted by supervisord. This event will be received by event listener. If the exited process is critical -one, then the event listener will terminate supervisord and the container will be stopped +one, then the event listener will terminate supervisord and the container will be shut down and restarted. We also introduced a knob which can enable or disable this auto-restart feature dynamically according to the requirement of users. In detail, we created a table named `CONTAINER_FEATURE` in Config_DB and this table includes the status of auto-restart feature for each docker container. Users can easily use CLI to -see and configure the corresponding status. +check and configure the corresponding docker container status. #### 2.2.3.1 CLI (and usage example) From e0f0d96e8ed16a6009283221644db2277caa8dc0 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Sun, 23 Feb 2020 17:05:34 -0800 Subject: [PATCH 24/57] [Doc-Monitoring] Reword the title and the section of feature overview. Signed-off-by: Yong Zhao --- .../monitoring_containers.md | 23 ++++++++----------- 1 file changed, 9 insertions(+), 14 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 0dd70bf4db..63d26db001 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -1,4 +1,4 @@ -# Monitoring and Auto-Mitigating the Unhealthy of Docker Containers in SONiC +# Monitoring and Auto-Mitigating Unhealthy Containers in SONiC # High Level Design Document #### Rev 0.1 @@ -6,7 +6,6 @@ # Table of Contents * [List of Tables](#list-of-tables) * [Revision](#revision) -* [About this Manual](#about-this-manual) * [Scope](#scope) * [Defintions/Abbreviation](#definitionsabbreviation) * [1 Feature Overview](#1-feature-overview) @@ -35,10 +34,6 @@ |:---:|:----------:|:----------------------:|---------------------------| | 0.1 | 02/18/2020 | Yong Zhao, Joe Leveque | Initial version | -# About this Manual -This document presents the design and implementation of feature to monitor and auto-mitigate -the unhealthy of docker containers in SONiC. - # Scope This document describes the high level design of feature to monitor and auto-mitigate the unhealthy of docker containers. @@ -52,22 +47,22 @@ the unhealthy of docker containers. # 1 Feature Overview SONiC is a collection of various switch applications which are held in docker containers such as BGP container and SNMP container. Each application usually includes several processes which are -working together to provide and receive the services from other modules. As such, the healthy of -critical processes in each docker container are the key not only for the docker +working together to provide and receive the services from other modules. As such, the health of +critical processes in each docker container is imperitive not only for the docker container working correctly but also for the intended functionalities of entire SONiC switch. -On the other hand, profiling the resource usages and performance of each docker -container are also important for us to understand whether this container is in healthy state -or not and furtherly to provide us with deep insight about networking traffic. -The main purpose of this feature includes two parts: the first part is to monitor the +## 1.1 Monitoring +This feature is to monitor the running status of each process and critical resource usage such as CPU, memory and disk of each docker container. -The second part is docker containers can be automatically shut down and + +## 1.2 Auto-Mitigating +This feature is docker containers can be automatically shut down and restarted if one of critical processes running in the container exits unexpectedly. Restarting the entire container ensures that configuration is reloaded and all processes in the container get restarted, thus increasing the likelihood of entering a healthy state. -We implemented this feature by employing the existing Monit and supervisord system tools. +We implemented these two feature by employing the existing Monit and supervisord system tools. 1. We used Monit system tool to detect whether a process is running or not and whether the resource usage of a docker container is beyond the pre-defined threshold. 2. We leveraged the mechanism of event listener in supervisord to auto-restart a docker container From 1dc3a96fa5d8cf97f5ed46fcca64c99b32a33815 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Sun, 23 Feb 2020 17:41:52 -0800 Subject: [PATCH 25/57] [Doc-monitoring] Reworded the sentences and fixed the typo. Signed-off-by: Yong Zhao --- .../monitoring_containers.md | 56 +++++++++---------- 1 file changed, 28 insertions(+), 28 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 63d26db001..6a3e6fa2df 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -9,12 +9,14 @@ * [Scope](#scope) * [Defintions/Abbreviation](#definitionsabbreviation) * [1 Feature Overview](#1-feature-overview) - - [1.1 Requirements](#11-requirements) - - [1.1.1 Functional Requirements](#111-functional-requirements) - - [1.1.2 Configuration and Management Requirements](#112-configuration-and-management-requirements) - - [1.1.3 Scalability Requirements](#113-scalability-requirements) - - [1.2 Design](#12-design) - - [1.2.1 Basic Approach](#121-basic-approach) + - [1.1 Monitoring](#11-monitoring) + - [1.2 Auto-mitigating](#12-auto-mitigating) + - [1.3 Requirements](#13-requirements) + - [1.3.1 Functional Requirements](#131-functional-requirements) + - [1.3.2 Configuration and Management Requirements](#132-configuration-and-management-requirements) + - [1.3.3 Scalability Requirements](#133-scalability-requirements) + - [1.4 Design](#12-design) + - [1.4.1 Basic Approach](#141-basic-approach) * [2 Functionality](#2-functionality) - [2.1 Target Deployment Use Cases](#21-target-deployment-use-cases) - [2.2 Functional Description](#22-functional-description) @@ -52,28 +54,26 @@ critical processes in each docker container is imperitive not only for the docke container working correctly but also for the intended functionalities of entire SONiC switch. ## 1.1 Monitoring -This feature is to monitor the -running status of each process and critical resource usage such as CPU, memory and disk -of each docker container. +This feature is used to monitor the running status of each process and critical resource +usage such as CPU, memory and disk of each docker container. + +We used Monit system tool to detect whether a process is running or not and whether +the resource usage of a docker container is beyond the pre-defined threshold. ## 1.2 Auto-Mitigating -This feature is docker containers can be automatically shut down and +This feature demonstrated docker containers can be automatically shut down and restarted if one of critical processes running in the container exits unexpectedly. Restarting the entire container ensures that configuration is reloaded and all processes in the container get restarted, thus increasing the likelihood of entering a healthy state. -We implemented these two feature by employing the existing Monit and supervisord system tools. -1. We used Monit system tool to detect whether a process is running or not and whether - the resource usage of a docker container is beyond the pre-defined threshold. -2. We leveraged the mechanism of event listener in supervisord to auto-restart a docker container - if one of its critical processes exited unexpectedly. -3. We also added a knob to make this auto-restart feature dynamically configurable. - Specifically users can run CLI to configure this feature residing in Config_DB as - enabled/disabled status. +We leveraged the mechanism of event listener in supervisord to auto-restart a docker container +if one of its critical processes exited unexpectedly. We also added a configuration option to make this +auto-restart feature dynamically configurable. Specifically users can run CLI to configure this +feature residing in Config_DB as enabled/disabled status. -## 1.1 Requirements +## 1.3 Requirements -### 1.1.1 Functional Requirements +### 1.3.1 Functional Requirements 1. The Monit must provide the ability to generate an alert when a critical process is not running. 2. The Monit must provide the ability to generate an alert when the resource usage of @@ -87,17 +87,17 @@ We implemented these two feature by employing the existing Monit and supervisord 1. Users can see current auto-restart status for docker containers. 2. Users can configure auto-restart status for a specific docker container. -### 1.1.2 Configuration and Management Requirements +### 1.3.2 Configuration and Management Requirements Configuration of the auto-restart feature can be done via: 1. init_cfg.json 2. CLI -### 1.1.3 Scalability Requirements +### 1.3.3 Scalability Requirements `Place holder` -## 1.2 Design +## 1.4 Design -### 1.2.1 Basic Approach +### 1.4.1 Basic Approach Monitoring the running status of critical processes and resource usage of docker containers are heavily depended on the Monit system tool. Since Monit already provided the mechanism to check whether a process is running or not, it will be straightforward to integrate this into monitoring @@ -116,7 +116,8 @@ the entire docker container will be shut down and restarted. # 2 Functionality ## 2.1 Target Deployment Use Cases This feature is used to perform the following functions: -1. Monit will write an alert message into syslog if one if critical process exited unexpectedly. +1. Monit will write an alert message into syslog if one if critical process has not been + alive for 5 minutes. 2. Monit will write an alert message into syslog if the usage of memory is larger than the pre-defined threshold for a docker container. 3. A docker container will auto-restart if one of its critical processes crashed or exited @@ -174,7 +175,7 @@ restarted if one of critical processes running in the container exits unexpected the entire container ensures that configuration is reloaded and all processes in the container get restarted, thus increasing the likelihood of entering a healthy state. -Currently SONiC used superviord system tool to manage the processes in each +Currently SONiC used supervisord system tool to manage the processes in each docker container. Actually auto-restarting docker container is based on the process monitoring/notification framework. Specifically if the state of process changes for example from running to exited, @@ -183,7 +184,7 @@ This event will be received by event listener. If the exited process is critical one, then the event listener will terminate supervisord and the container will be shut down and restarted. -We also introduced a knob which can enable or disable this auto-restart feature +We also introduced a configuration option which can enable or disable this auto-restart feature dynamically according to the requirement of users. In detail, we created a table named `CONTAINER_FEATURE` in Config_DB and this table includes the status of auto-restart feature for each docker container. Users can easily use CLI to @@ -265,7 +266,6 @@ Example: "swss": { "auto_restart": "disabled", }, - } } ``` From 0124b94b1a09f418957e02c54c3c83611f6ab1c5 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Sun, 23 Feb 2020 18:22:43 -0800 Subject: [PATCH 26/57] [Doc-monitoring] Reword and correct the typos. Signed-off-by: Yong Zhao --- .../monitoring_containers.md | 22 +++++++++---------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 6a3e6fa2df..c80b3e3ba5 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -15,7 +15,7 @@ - [1.3.1 Functional Requirements](#131-functional-requirements) - [1.3.2 Configuration and Management Requirements](#132-configuration-and-management-requirements) - [1.3.3 Scalability Requirements](#133-scalability-requirements) - - [1.4 Design](#12-design) + - [1.4 Design](#14-design) - [1.4.1 Basic Approach](#141-basic-approach) * [2 Functionality](#2-functionality) - [2.1 Target Deployment Use Cases](#21-target-deployment-use-cases) @@ -34,11 +34,11 @@ # Revision | Rev | Date | Author | Change Description | |:---:|:----------:|:----------------------:|---------------------------| -| 0.1 | 02/18/2020 | Yong Zhao, Joe Leveque | Initial version | +| 0.1 | 02/18/2020 | Yong Zhao, Joe Leveque | Initial version | # Scope -This document describes the high level design of feature to monitor and auto-mitigate -the unhealthy of docker containers. +This document describes the high level design of features to monitor and auto-mitigate +the unhealthy containers in SONiC. # Definitions/Abbreviation | Abbreviation | Description | @@ -50,21 +50,21 @@ the unhealthy of docker containers. SONiC is a collection of various switch applications which are held in docker containers such as BGP container and SNMP container. Each application usually includes several processes which are working together to provide and receive the services from other modules. As such, the health of -critical processes in each docker container is imperitive not only for the docker +critical processes in each docker container is imperative not only for the docker container working correctly but also for the intended functionalities of entire SONiC switch. ## 1.1 Monitoring -This feature is used to monitor the running status of each process and critical resource +This feature is used to monitor the running status of critical processes and critical resource usage such as CPU, memory and disk of each docker container. -We used Monit system tool to detect whether a process is running or not and whether +We used Monit system tool to detect whether a critical process is running or not and whether the resource usage of a docker container is beyond the pre-defined threshold. ## 1.2 Auto-Mitigating -This feature demonstrated docker containers can be automatically shut down and -restarted if one of critical processes running in the container exits unexpectedly. Restarting -the entire container ensures that configuration is reloaded and all processes in the container -get restarted, thus increasing the likelihood of entering a healthy state. +This feature demonstrated docker container can be automatically shut down and +restarted if one of critical processes running in docker container exits unexpectedly. Restarting +the entire docker container ensures that configuration is reloaded and all processes in +docker container get restarted, thus increasing the likelihood of entering a healthy state. We leveraged the mechanism of event listener in supervisord to auto-restart a docker container if one of its critical processes exited unexpectedly. We also added a configuration option to make this From 07743446fdad0a62df2564dc164c41cf0fb65f11 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Sun, 23 Feb 2020 18:28:14 -0800 Subject: [PATCH 27/57] [Doc-monitoring] Revised the functional requirement. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index c80b3e3ba5..49cfe55940 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -74,8 +74,8 @@ feature residing in Config_DB as enabled/disabled status. ## 1.3 Requirements ### 1.3.1 Functional Requirements -1. The Monit must provide the ability to generate an alert when a critical process is not - running. +1. The Monit must provide the ability to generate an alert when a critical process has not + been alive for 5 minutes. 2. The Monit must provide the ability to generate an alert when the resource usage of a docker contaier is larger than the pre-defined threshold. 3. The event listener in supervisord must receive the signal when a critical process in @@ -83,7 +83,7 @@ feature residing in Config_DB as enabled/disabled status. container. 4. CONFIG_DB can be configured to enable/disable this auto-restart feature for each docker container.. -5. Users can access this auto-restart information via the CLI utility +5. Users can access the status of auto-restart feature via the CLI utility 1. Users can see current auto-restart status for docker containers. 2. Users can configure auto-restart status for a specific docker container. From a852c358ae456ad80154ec29e4a8f1dd68b29fc1 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Sun, 23 Feb 2020 18:36:06 -0800 Subject: [PATCH 28/57] [Doc-monitoring] Reword the basic approach. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 49cfe55940..2615e29c60 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -107,15 +107,15 @@ container will be an interesting and challenging problem. In our design, we adop that Monit will check the returned value of a script which reads the resource usage of docker container, compares it with pre-defined threshold and then exited. -We employed the mechanism of event listener in supervisord to achieve auto-restarting of docker -container. Currently supervisord will monitor the running status of each process in SONiC +We employed the mechanism of event listener in supervisord to achieve auto-restarting docker +containers. Currently supervisord will monitor the running status of critical processes in SONiC docker containers. If one critical process exited unexpectedly, supervisord will catch such signal and send it to event listener. Then event listener will kill the process supervisord and the entire docker container will be shut down and restarted. # 2 Functionality ## 2.1 Target Deployment Use Cases -This feature is used to perform the following functions: +These two features are used to perform the following functions: 1. Monit will write an alert message into syslog if one if critical process has not been alive for 5 minutes. 2. Monit will write an alert message into syslog if the usage of memory is larger than the @@ -128,7 +128,7 @@ This feature is used to perform the following functions: ### 2.2.1 Monitoring Critical Processes Monit has implemented the mechanism to monitor whether a process is running or not. In detail, -Monit will periodically read the target processes from configuration file and tries to match +Monit will periodically read the target processes from configuration file and try to match those process with the processes tree in Linux kernel. Below is an example of Monit configuration file to monitor the critical processes in lldp From a5d094b1885afa8809c50f550b2be754a163128e Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Mon, 24 Feb 2020 14:11:09 -0800 Subject: [PATCH 29/57] [Doc-monitoring] Reworded basic approach and fix the typos. Signed-off-by: Yong Zhao --- .../monitoring_containers.md | 51 +++++++++++-------- 1 file changed, 29 insertions(+), 22 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 2615e29c60..6829800993 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -14,7 +14,6 @@ - [1.3 Requirements](#13-requirements) - [1.3.1 Functional Requirements](#131-functional-requirements) - [1.3.2 Configuration and Management Requirements](#132-configuration-and-management-requirements) - - [1.3.3 Scalability Requirements](#133-scalability-requirements) - [1.4 Design](#14-design) - [1.4.1 Basic Approach](#141-basic-approach) * [2 Functionality](#2-functionality) @@ -74,10 +73,10 @@ feature residing in Config_DB as enabled/disabled status. ## 1.3 Requirements ### 1.3.1 Functional Requirements -1. The Monit must provide the ability to generate an alert when a critical process has not +1. Monit must provide the ability to generate an alert when a critical process has not been alive for 5 minutes. -2. The Monit must provide the ability to generate an alert when the resource usage of - a docker contaier is larger than the pre-defined threshold. +2. Monit must provide the ability to generate an alert when the resource usage of + a docker container is larger than the pre-defined threshold. 3. The event listener in supervisord must receive the signal when a critical process in a docker container crashed or exited unexpectedly and then restart this docker container. @@ -90,28 +89,33 @@ feature residing in Config_DB as enabled/disabled status. ### 1.3.2 Configuration and Management Requirements Configuration of the auto-restart feature can be done via: 1. init_cfg.json -2. CLI - -### 1.3.3 Scalability Requirements -`Place holder` +2. config_db.json +3. CLI ## 1.4 Design ### 1.4.1 Basic Approach Monitoring the running status of critical processes and resource usage of docker containers -are heavily depended on the Monit system tool. Since Monit already provided the mechanism +are depended on the Monit system tool. Since Monit already provided the mechanism to check whether a process is running or not, it will be straightforward to integrate this into monitoring the critical processes in SONiC. However, Monit only gives the method to monitor the resource usage per process level not container level. As such, monitoring the resource usage of a docker -container will be an interesting and challenging problem. In our design, we adopted the way -that Monit will check the returned value of a script which reads the resource usage of docker -container, compares it with pre-defined threshold and then exited. - -We employed the mechanism of event listener in supervisord to achieve auto-restarting docker -containers. Currently supervisord will monitor the running status of critical processes in SONiC -docker containers. If one critical process exited unexpectedly, supervisord will catch such signal -and send it to event listener. Then event listener will kill the process supervisord and -the entire docker container will be shut down and restarted. +container is not as straightforward. In our design, we propose to utilize the mechanism with +which Monit can spawn a process and check the return value of the process. We will have Monit +launch a script which reads the resource usage of the container and compares the resource usage +with a configured threshold value for that container. If the current resource usage is less than +the configured threshold value, the script will return 0 and Monit will not log a message. +However, if the resource usage exceeds the threshold, the script will return a non-zero value +and Monit will log an alert message to the syslog. + +We employed event listener's mechanism in supervisord to achieve auto-restarting docker +containers. We configure our event listener to listen for process exit events. When a supervised +process exits, supervisord will pass the event to our custom event listener. The event listener +determines if the process is a critical process and whether it exited unexpectedly. If both of +these conditions are true, the event listener will kill the supervisord process. Since supervisord +runs as PID 1 inside the containers, when supervisotd exits, the container will stop. When the +container stops, the systemd service which manages the container will also stop, but it is +configured to automatically restart the service, thus it will restart the container. # 2 Functionality ## 2.1 Target Deployment Use Cases @@ -127,7 +131,7 @@ These two features are used to perform the following functions: ### 2.2.1 Monitoring Critical Processes -Monit has implemented the mechanism to monitor whether a process is running or not. In detail, +Monit natively implements a mechanism to monitor whether a process is running or not. In detail, Monit will periodically read the target processes from configuration file and try to match those process with the processes tree in Linux kernel. @@ -180,9 +184,12 @@ docker container. Actually auto-restarting docker container is based on the proc monitoring/notification framework. Specifically if the state of process changes for example from running to exited, an event notification `PROCESS_STATE_STOPPED` will be emitted by supervisord. -This event will be received by event listener. If the exited process is critical -one, then the event listener will terminate supervisord and the container will be shut down -and restarted. +This event will be received by event listener. The event listener determines if the process is +critical process and whether it exited unexpectedly. If both of +these conditions are true, the event listener will kill the supervisord process. Since supervisord +runs as PID 1 inside the containers, when supervisotd exits, the container will stop. When the +container stops, the systemd service which manages the container will also stop, but it is +configured to automatically restart the service, thus it will restart the container. We also introduced a configuration option which can enable or disable this auto-restart feature dynamically according to the requirement of users. In detail, we created a table From 93826e414ba0a018772cf4c1c851a656a5d0d691 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Mon, 24 Feb 2020 14:52:37 -0800 Subject: [PATCH 30/57] [Doc-monitoring] Correct the typo of supervisord. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 6829800993..1f3287e656 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -113,7 +113,7 @@ containers. We configure our event listener to listen for process exit events. W process exits, supervisord will pass the event to our custom event listener. The event listener determines if the process is a critical process and whether it exited unexpectedly. If both of these conditions are true, the event listener will kill the supervisord process. Since supervisord -runs as PID 1 inside the containers, when supervisotd exits, the container will stop. When the +runs as PID 1 inside the containers, when supervisord exits, the container will stop. When the container stops, the systemd service which manages the container will also stop, but it is configured to automatically restart the service, thus it will restart the container. From 0e84f8726e4931cb1ebec01020372b321dbb8849 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Mon, 24 Feb 2020 15:00:50 -0800 Subject: [PATCH 31/57] [Doc-monitoring] When a process changes from running to exited, the event type should be PROCESS_STATE_EXITED. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 1f3287e656..f4bc3784a1 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -183,7 +183,7 @@ Currently SONiC used supervisord system tool to manage the processes in each docker container. Actually auto-restarting docker container is based on the process monitoring/notification framework. Specifically if the state of process changes for example from running to exited, -an event notification `PROCESS_STATE_STOPPED` will be emitted by supervisord. +an event notification `PROCESS_STATE_EXITED` will be emitted by supervisord. This event will be received by event listener. The event listener determines if the process is critical process and whether it exited unexpectedly. If both of these conditions are true, the event listener will kill the supervisord process. Since supervisord From 965fc6147b0e90ddefa4ac6b9f9a892ff6b5fb29 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Mon, 24 Feb 2020 15:57:10 -0800 Subject: [PATCH 32/57] [Doc-monitoring] Reword the mechanism of event listener to 'event listener' mechanism. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index f4bc3784a1..c66d4b944b 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -108,7 +108,7 @@ the configured threshold value, the script will return 0 and Monit will not log However, if the resource usage exceeds the threshold, the script will return a non-zero value and Monit will log an alert message to the syslog. -We employed event listener's mechanism in supervisord to achieve auto-restarting docker +We employed 'event listener' mechanism in supervisord to achieve auto-restarting docker containers. We configure our event listener to listen for process exit events. When a supervised process exits, supervisord will pass the event to our custom event listener. The event listener determines if the process is a critical process and whether it exited unexpectedly. If both of From 5c69e6e814d27026d18d639b2c90eb24c7f2deca Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Mon, 24 Feb 2020 16:05:22 -0800 Subject: [PATCH 33/57] [Doc-monitoring] Correct a typo and remove the init_cfg.json in line 90 since the status of auto-restart feature in init_cfg.json is fixed and we should not change the content in this file. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index c66d4b944b..cc924ddd25 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -87,10 +87,10 @@ feature residing in Config_DB as enabled/disabled status. 2. Users can configure auto-restart status for a specific docker container. ### 1.3.2 Configuration and Management Requirements +Via the init_cfg.json file, these container features are disabled by default. Configuration of the auto-restart feature can be done via: -1. init_cfg.json -2. config_db.json -3. CLI +1. config_db.json +2. CLI ## 1.4 Design @@ -187,7 +187,7 @@ an event notification `PROCESS_STATE_EXITED` will be emitted by supervisord. This event will be received by event listener. The event listener determines if the process is critical process and whether it exited unexpectedly. If both of these conditions are true, the event listener will kill the supervisord process. Since supervisord -runs as PID 1 inside the containers, when supervisotd exits, the container will stop. When the +runs as PID 1 inside the containers, when supervisord exits, the container will stop. When the container stops, the systemd service which manages the container will also stop, but it is configured to automatically restart the service, thus it will restart the container. From a040c34daf9af768f58154a0ba80cdd62c192979 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Mon, 24 Feb 2020 16:12:40 -0800 Subject: [PATCH 34/57] [Doc-monitoring] Reword the gives to provides in line 101. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index cc924ddd25..ce3862e991 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -98,8 +98,8 @@ Configuration of the auto-restart feature can be done via: Monitoring the running status of critical processes and resource usage of docker containers are depended on the Monit system tool. Since Monit already provided the mechanism to check whether a process is running or not, it will be straightforward to integrate this into monitoring -the critical processes in SONiC. However, Monit only gives the method to monitor the resource -usage per process level not container level. As such, monitoring the resource usage of a docker +the critical processes in SONiC. However, Monit only provides the method to monitor the resource +usage on a per-process level not a per-container level. As such, monitoring the resource usage of a docker container is not as straightforward. In our design, we propose to utilize the mechanism with which Monit can spawn a process and check the return value of the process. We will have Monit launch a script which reads the resource usage of the container and compares the resource usage From a28459aecee56bc5e2941963fa37e0e94500d4cf Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Mon, 24 Feb 2020 16:36:53 -0800 Subject: [PATCH 35/57] [Doc-monitoring] Reword the sentence "we emplyed 'event listener' mechanism" to "we employed the 'event listener' mechanism". Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index ce3862e991..be8e7542f4 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -98,7 +98,7 @@ Configuration of the auto-restart feature can be done via: Monitoring the running status of critical processes and resource usage of docker containers are depended on the Monit system tool. Since Monit already provided the mechanism to check whether a process is running or not, it will be straightforward to integrate this into monitoring -the critical processes in SONiC. However, Monit only provides the method to monitor the resource +the critical processes in SONiC. However, Monit only provides a method to monitor the resource usage on a per-process level not a per-container level. As such, monitoring the resource usage of a docker container is not as straightforward. In our design, we propose to utilize the mechanism with which Monit can spawn a process and check the return value of the process. We will have Monit @@ -108,7 +108,7 @@ the configured threshold value, the script will return 0 and Monit will not log However, if the resource usage exceeds the threshold, the script will return a non-zero value and Monit will log an alert message to the syslog. -We employed 'event listener' mechanism in supervisord to achieve auto-restarting docker +We employed the 'event listener' mechanism in supervisord to achieve auto-restarting docker containers. We configure our event listener to listen for process exit events. When a supervised process exits, supervisord will pass the event to our custom event listener. The event listener determines if the process is a critical process and whether it exited unexpectedly. If both of From 8b270be4fde92cce64c00dcf58bcd6f1f14da02f Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Mon, 24 Feb 2020 16:41:38 -0800 Subject: [PATCH 36/57] [Doc-monitoring] Reword the line 68 to we leveraged the 'event listener' mechanism ... Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index be8e7542f4..2d6972724d 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -65,7 +65,7 @@ restarted if one of critical processes running in docker container exits unexpec the entire docker container ensures that configuration is reloaded and all processes in docker container get restarted, thus increasing the likelihood of entering a healthy state. -We leveraged the mechanism of event listener in supervisord to auto-restart a docker container +We leveraged the 'event listener' mechanism in supervisord to auto-restart a docker container if one of its critical processes exited unexpectedly. We also added a configuration option to make this auto-restart feature dynamically configurable. Specifically users can run CLI to configure this feature residing in Config_DB as enabled/disabled status. From 7710e1b0c4b151686cb6d3d75680f8c894c1daf2 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Mon, 24 Feb 2020 16:59:25 -0800 Subject: [PATCH 37/57] [Doc-monitoring] Add the proposed section for memory, cpu and disk alert. Signed-off-by: Yong Zhao --- .../monitoring_containers.md | 41 ++++++++++++++++++- 1 file changed, 40 insertions(+), 1 deletion(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 2d6972724d..15709ff021 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -88,7 +88,7 @@ feature residing in Config_DB as enabled/disabled status. ### 1.3.2 Configuration and Management Requirements Via the init_cfg.json file, these container features are disabled by default. -Configuration of the auto-restart feature can be done via: +Configuration of these features can be done via: 1. config_db.json 2. CLI @@ -236,42 +236,81 @@ Example: "CONTAINER_FEATURE": { "database": { "auto_restart": "enabled", + "high_mem_alert": "", + "high_cpu_alert": "", + "high_disk_alert": "" }, "lldp": { "auto_restart": "disabled", + "high_mem_alert": "", + "high_cpu_alert": "", + "high_disk_alert": "" }, "radv": { "auto_restart": "disabled", + "high_mem_alert": "", + "high_cpu_alert": "", + "high_disk_alert": "" }, "pmon": { "auto_restart": "disabled", + "high_mem_alert": "", + "high_cpu_alert": "", + "high_disk_alert": "" }, "sflow": { "auto_restart": "enabled", + "high_mem_alert": "", + "high_cpu_alert": "", + "high_disk_alert": "" }, "snmp": { "auto_restart": "enabled", + "high_mem_alert": "", + "high_cpu_alert": "", + "high_disk_alert": "" }, "telemetry": { "auto_restart": "enabled", + "high_mem_alert": "", + "high_cpu_alert": "", + "high_disk_alert": "" }, "bgp": { "auto_restart": "disabled", + "high_mem_alert": "", + "high_cpu_alert": "", + "high_disk_alert": "" }, "dhcp_relay": { "auto_restart": "disabled", + "high_mem_alert": "", + "high_cpu_alert": "", + "high_disk_alert": "" }, "rest-api": { "auto_restart": "enabled", + "high_mem_alert": "", + "high_cpu_alert": "", + "high_disk_alert": "" }, "teamd": { "auto_restart": "disabled", + "high_mem_alert": "", + "high_cpu_alert": "", + "high_disk_alert": "" }, "syncd": { "auto_restart": "enabled", + "high_mem_alert": "", + "high_cpu_alert": "", + "high_disk_alert": "" }, "swss": { "auto_restart": "disabled", + "high_mem_alert": "", + "high_cpu_alert": "", + "high_disk_alert": "" }, } } From 6d73a9dba943ab11745f478b4d63975f4aa8a67b Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Mon, 24 Feb 2020 17:26:17 -0800 Subject: [PATCH 38/57] [Doc-monitoring] Add a section for the new proposal resource alerting. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 15709ff021..7c5deb1824 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -25,7 +25,8 @@ - [2.2.3.1 CLI (and usage example)](#2231-cli-and-usage-example) - [2.2.3.1.1 Show the Status of Auto-restart](#22311-show-the-status-of-auto-restart) - [2.2.3.1.2 Configure the Status of Auto-restart](#22312-configure-the-status-of-auto-restart) - - [2.2.3.1.3 CONTAINER_FEATURE Table](#22313-container_feature-table) + - [2.2.4 Resource Alerting](#224-resource-alerting) + - [2.2.5 CONTAINER_FEATURE Table](#225-container_feature-table) # List of Tables * [Table 1: Abbreviations](#definitionsabbreviation) @@ -96,7 +97,7 @@ Configuration of these features can be done via: ### 1.4.1 Basic Approach Monitoring the running status of critical processes and resource usage of docker containers -are depended on the Monit system tool. Since Monit already provided the mechanism +depends on the Monit system tool. Since Monit natively provides a mechanism to check whether a process is running or not, it will be straightforward to integrate this into monitoring the critical processes in SONiC. However, Monit only provides a method to monitor the resource usage on a per-process level not a per-container level. As such, monitoring the resource usage of a docker @@ -228,8 +229,10 @@ swss disabled admin@sonic:~$ sudo config container feature autorestart database enabled ``` +### 2.2.4 Resource Alerting -##### 2.2.3.1.3 CONTAINER_FEATURE Table + +### 2.2.5 CONTAINER_FEATURE Table Example: ``` { From d4c4fd48b51d91ce083b88b4a282a9796fb12e35 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Mon, 24 Feb 2020 18:11:45 -0800 Subject: [PATCH 39/57] [Doc-monitoring] Place the value of memory threshold in section 2.5. Signed-off-by: Yong Zhao --- .../monitoring_containers.md | 98 +++++++++---------- 1 file changed, 47 insertions(+), 51 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 7c5deb1824..254aefb3f5 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -20,12 +20,11 @@ - [2.1 Target Deployment Use Cases](#21-target-deployment-use-cases) - [2.2 Functional Description](#22-functional-description) - [2.2.1 Monitoring Critical Processes](#221-monitoring-critical-processes) - - [2.2.2 Monitoring Critical Resource Usage](#222-monitoring-critical-resource-usage) - - [2.2.3 Auto-restart Docker Container](#223-auto-restart-docker-container) - - [2.2.3.1 CLI (and usage example)](#2231-cli-and-usage-example) - - [2.2.3.1.1 Show the Status of Auto-restart](#22311-show-the-status-of-auto-restart) - - [2.2.3.1.2 Configure the Status of Auto-restart](#22312-configure-the-status-of-auto-restart) - - [2.2.4 Resource Alerting](#224-resource-alerting) + - [2.2.2 Auto-restart Docker Container](#222-auto-restart-docker-container) + - [2.2.3 Monitoring Critical Resource Usage](#223-monitoring-critical-resource-usage) + - [2.2.4 CLI (and usage example)](#2231-cli-and-usage-example) + - [2.2.4.1 Show the Status of Auto-restart](#2241-show-the-status-of-auto-restart) + - [2.2.4.2 Configure the Status of Auto-restart](#2242-configure-the-status-of-auto-restart) - [2.2.5 CONTAINER_FEATURE Table](#225-container_feature-table) # List of Tables @@ -199,12 +198,12 @@ auto-restart feature for each docker container. Users can easily use CLI to check and configure the corresponding docker container status. -#### 2.2.3.1 CLI (and usage example) +### 2.2.4 CLI (and usage example) The CLI tool will provide the following functionality: 1. Show current status of auto-restart feature for docker containers. 2. Configure the status of a specific docker container. -##### 2.2.3.1.1 Show the Status of Auto-restart +#### 2.2.4.1 Show the Status of Auto-restart ``` admin@sonic:~$ show container feature autorestart Container Name Status @@ -224,14 +223,11 @@ syncd enabled swss disabled ``` -##### 2.2.3.1.2 Configure the Status of Auto-restart +#### 2.2.4.2 Configure the Status of Auto-restart ``` admin@sonic:~$ sudo config container feature autorestart database enabled ``` -### 2.2.4 Resource Alerting - - ### 2.2.5 CONTAINER_FEATURE Table Example: ``` @@ -239,81 +235,81 @@ Example: "CONTAINER_FEATURE": { "database": { "auto_restart": "enabled", - "high_mem_alert": "", - "high_cpu_alert": "", - "high_disk_alert": "" + "high_mem_alert": "157286400", + "high_cpu_alert": "0", + "high_disk_alert": "0" }, "lldp": { "auto_restart": "disabled", - "high_mem_alert": "", - "high_cpu_alert": "", - "high_disk_alert": "" + "high_mem_alert": "104857600", + "high_cpu_alert": "0", + "high_disk_alert": "0" }, "radv": { "auto_restart": "disabled", - "high_mem_alert": "", - "high_cpu_alert": "", - "high_disk_alert": "" + "high_mem_alert": "104857600", + "high_cpu_alert": "0", + "high_disk_alert": "0" }, "pmon": { "auto_restart": "disabled", - "high_mem_alert": "", - "high_cpu_alert": "", - "high_disk_alert": "" + "high_mem_alert": "104857600", + "high_cpu_alert": "0", + "high_disk_alert": "0" }, "sflow": { "auto_restart": "enabled", - "high_mem_alert": "", - "high_cpu_alert": "", - "high_disk_alert": "" + "high_mem_alert": "0", + "high_cpu_alert": "0", + "high_disk_alert": "0" }, "snmp": { "auto_restart": "enabled", - "high_mem_alert": "", - "high_cpu_alert": "", - "high_disk_alert": "" + "high_mem_alert": "157286400", + "high_cpu_alert": "0", + "high_disk_alert": "0" }, "telemetry": { "auto_restart": "enabled", - "high_mem_alert": "", - "high_cpu_alert": "", - "high_disk_alert": "" + "high_mem_alert": "0", + "high_cpu_alert": "0", + "high_disk_alert": "0" }, "bgp": { "auto_restart": "disabled", - "high_mem_alert": "", - "high_cpu_alert": "", - "high_disk_alert": "" + "high_mem_alert": "314572800", + "high_cpu_alert": "0", + "high_disk_alert": "0" }, "dhcp_relay": { "auto_restart": "disabled", - "high_mem_alert": "", - "high_cpu_alert": "", - "high_disk_alert": "" + "high_mem_alert": "104857600", + "high_cpu_alert": "0", + "high_disk_alert": "0" }, "rest-api": { "auto_restart": "enabled", - "high_mem_alert": "", - "high_cpu_alert": "", - "high_disk_alert": "" + "high_mem_alert": "0", + "high_cpu_alert": "0", + "high_disk_alert": "0" }, "teamd": { "auto_restart": "disabled", - "high_mem_alert": "", - "high_cpu_alert": "", - "high_disk_alert": "" + "high_mem_alert": "104857600", + "high_cpu_alert": "0", + "high_disk_alert": "0" }, "syncd": { "auto_restart": "enabled", - "high_mem_alert": "", - "high_cpu_alert": "", - "high_disk_alert": "" + "high_mem_alert": "629145600", + "high_cpu_alert": "0", + "high_disk_alert": "0" }, "swss": { "auto_restart": "disabled", - "high_mem_alert": "", - "high_cpu_alert": "", - "high_disk_alert": "" + "high_mem_alert": "157286400", + "high_cpu_alert": "0", + "high_disk_alert": "0" }, } } From f36f5ef6ca93a04b92a48cd9d5f82b0012962fd9 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Mon, 24 Feb 2020 18:27:51 -0800 Subject: [PATCH 40/57] [Doc-monitoring] Reorganize the sections 2.2.3 and 2.2.4. Signed-off-by: Yong Zhao --- .../monitoring_containers.md | 44 ++++++++++--------- 1 file changed, 24 insertions(+), 20 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 254aefb3f5..fcb35776f3 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -22,7 +22,7 @@ - [2.2.1 Monitoring Critical Processes](#221-monitoring-critical-processes) - [2.2.2 Auto-restart Docker Container](#222-auto-restart-docker-container) - [2.2.3 Monitoring Critical Resource Usage](#223-monitoring-critical-resource-usage) - - [2.2.4 CLI (and usage example)](#2231-cli-and-usage-example) + - [2.2.4 CLI (and usage example)](#224-cli-and-usage-example) - [2.2.4.1 Show the Status of Auto-restart](#2241-show-the-status-of-auto-restart) - [2.2.4.2 Configure the Status of Auto-restart](#2242-configure-the-status-of-auto-restart) - [2.2.5 CONTAINER_FEATURE Table](#225-container_feature-table) @@ -155,25 +155,7 @@ check process lldpmgrd matching "python /usr/bin/lldpmgrd" if does not exit for 5 times within 5 cycles then alert ``` -### 2.2.2 Monitoring Critical Resource Usage -Similar to monitoring the critical processes, we can employ Monit to monitor the resource usage -such as CPU, memory and disk for each process. Unfortunately Monit is unable to do the resource monitoring -in the container level. Thus we propose a new design to achieve such monitoring based on Monit. -Specifically Monit will monitor a script and check its exit status. This script -will correspondingly read the resource usage of docker containers, compare it with -pre-defined threshold and then return a value. The value 0 signified that -the resource usage is less than threshold and non-zero means Monit will send an alert since -current usage is larger than threshold. - -Below is an example of Monit configuration file for lldp container to pass the pre-defined -threshold (bytes) to the script and check the exiting value. - -```bash -check program container_memory_lldp with path "/usr/bin/memory_checker lldp 104857600" - if status != 0 then alert -``` - -### 2.2.3 Auto-restart Docker Container +### 2.2.2 Auto-restart Docker Container The design principle behind this auto-restart feature is docker containers can be automatically shut down and restarted if one of critical processes running in the container exits unexpectedly. Restarting the entire container ensures that configuration is reloaded and all processes in the container @@ -197,6 +179,28 @@ named `CONTAINER_FEATURE` in Config_DB and this table includes the status of auto-restart feature for each docker container. Users can easily use CLI to check and configure the corresponding docker container status. +### 2.2.3 Monitoring Critical Resource Usage +Similar to monitoring the critical processes, we can employ Monit to monitor the resource usage +such as CPU, memory and disk for each process. Unfortunately Monit is unable to do the resource monitoring +in the container level. Thus we propose a new design to achieve such monitoring based on Monit. +Specifically Monit will monitor a script and check its exit status. This script +will correspondingly read the resource usage of docker containers, compare it with +pre-defined threshold and then return a value. The value 0 signified that +the resource usage is less than threshold and non-zero means Monit will send an alert since +current usage is larger than threshold. + +Below is an example of Monit configuration file for lldp container to pass the pre-defined +threshold (bytes) to the script and check the exiting value. + +```bash +check program container_memory_lldp with path "/usr/bin/memory_checker lldp 104857600" + if status != 0 then alert +``` + +We will employ similar mechanism for CPU and disk utilization. Currently the threshold of +memory usage for each docker container in CONTAINER_FEATURE table are decided after +we polled the memory usage of docker containers in 1970 production boxes. The value `0` +in table represents the corresponding feature in the docker container is in `disabled` status. ### 2.2.4 CLI (and usage example) The CLI tool will provide the following functionality: From 2edc3d4a79c2107ca39f1d90b43a62908c433f45 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Mon, 24 Feb 2020 18:39:32 -0800 Subject: [PATCH 41/57] [Doc-monitoring] Reword the section 2.2.2. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index fcb35776f3..775503f541 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -197,9 +197,10 @@ check program container_memory_lldp with path "/usr/bin/memory_checker lldp 1048 if status != 0 then alert ``` -We will employ similar mechanism for CPU and disk utilization. Currently the threshold of +We will employ similar mechanism for CPU and disk utilization. Currently the thresholds of memory usage for each docker container in CONTAINER_FEATURE table are decided after -we polled the memory usage of docker containers in 1970 production boxes. The value `0` +we polled the memory usage of docker containers in 1970 production boxes. We also intend +to use same method to obtain the thresholds of CPU and disk usage for each dock container. The value `0` in table represents the corresponding feature in the docker container is in `disabled` status. ### 2.2.4 CLI (and usage example) From d041d780401affee7d593a8ddf3329efa2b18100 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Mon, 24 Feb 2020 19:39:27 -0800 Subject: [PATCH 42/57] [Doc-monitoring] Reword in the section 2.2.4 Monitoring Critical Resource Usage. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 775503f541..19f8a2f1d4 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -197,11 +197,10 @@ check program container_memory_lldp with path "/usr/bin/memory_checker lldp 1048 if status != 0 then alert ``` -We will employ similar mechanism for CPU and disk utilization. Currently the thresholds of -memory usage for each docker container in CONTAINER_FEATURE table are decided after -we polled the memory usage of docker containers in 1970 production boxes. We also intend -to use same method to obtain the thresholds of CPU and disk usage for each dock container. The value `0` -in table represents the corresponding feature in the docker container is in `disabled` status. +We will employ similar mechanism for CPU and disk utilization. Thresholds for each resource, +per container can be determined by the operator by examining averages of resource usage in +a production environment. The value `0` in table represents the corresponding feature in +the docker container is in `disabled` status. ### 2.2.4 CLI (and usage example) The CLI tool will provide the following functionality: From f94d019546c942145bd32d2aead237b94d54a4ab Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Sat, 7 Mar 2020 16:23:08 -0800 Subject: [PATCH 43/57] [Monitoring] Add a section to describe the relationship between auto-restart and warm re-boot. Add a paragraph to introduce how can we use Monit to monitor multiple processes with the same command. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 19f8a2f1d4..36e7927487 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -14,6 +14,7 @@ - [1.3 Requirements](#13-requirements) - [1.3.1 Functional Requirements](#131-functional-requirements) - [1.3.2 Configuration and Management Requirements](#132-configuration-and-management-requirements) + - [1.3.3 Warm Reboot requirements](#133-warm-reboot-requirements) - [1.4 Design](#14-design) - [1.4.1 Basic Approach](#141-basic-approach) * [2 Functionality](#2-functionality) @@ -92,6 +93,10 @@ Configuration of these features can be done via: 1. config_db.json 2. CLI +### 1.3.3 Warm Reboot Requirements +When switch reboots in the warm boot mode, auto-restart feature must ensure that systemd +service is stopped explicitly such that it will not affect warm reboot functionality. + ## 1.4 Design ### 1.4.1 Basic Approach @@ -154,6 +159,15 @@ check process lldp_syncd matching "python2 -m lldp_syncd" check process lldpmgrd matching "python /usr/bin/lldpmgrd" if does not exit for 5 times within 5 cycles then alert ``` +However, Monit is unable to monitor multiple processes executing the same command but with +different arguments. For example, in teamd container, there are multiple teamd processes +running the same command ```/usr/bin/teamd``` but using different port channel as argument. +Since there exists 1:1 mapping between a port channel and a teamd process, we employ Monit to +monitor a script which retrieves all the port channels from Config_DB and then determine +whether there exists a teamd process in Linux for each port channel. If succeed, that means +all teamd processes are live. Otherwise, we will know at least teamd process exited unexpectedly +and then Monit will write an alert message into syslog. Similarly we can also use this method +to solve the issue in dhcp_relay container. ### 2.2.2 Auto-restart Docker Container The design principle behind this auto-restart feature is docker containers can be automatically shut down and From 02bd31c11787949900124e6e2249d1e1da6373ce Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Sat, 7 Mar 2020 16:28:08 -0800 Subject: [PATCH 44/57] [Monitoring] Add a word "same" in the last sentence of section 2.2.1 Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 36e7927487..34b1ff3d20 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -167,7 +167,7 @@ monitor a script which retrieves all the port channels from Config_DB and then d whether there exists a teamd process in Linux for each port channel. If succeed, that means all teamd processes are live. Otherwise, we will know at least teamd process exited unexpectedly and then Monit will write an alert message into syslog. Similarly we can also use this method -to solve the issue in dhcp_relay container. +to solve the same issue in dhcp_relay container. ### 2.2.2 Auto-restart Docker Container The design principle behind this auto-restart feature is docker containers can be automatically shut down and From fa20bea252c53886fa6462bbe9c576e457297ec1 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Mon, 9 Mar 2020 00:00:31 -0700 Subject: [PATCH 45/57] [Monitrong] Reword the section 1.3.3. Signed-off-by: Yong Zhao --- .../monitoring_containers.md | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 34b1ff3d20..d18d58ba4b 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -14,7 +14,7 @@ - [1.3 Requirements](#13-requirements) - [1.3.1 Functional Requirements](#131-functional-requirements) - [1.3.2 Configuration and Management Requirements](#132-configuration-and-management-requirements) - - [1.3.3 Warm Reboot requirements](#133-warm-reboot-requirements) + - [1.3.3 Fast-Reboot/Warm-Reboot requirements](#133-fast-reboot-warm-reboot-requirements) - [1.4 Design](#14-design) - [1.4.1 Basic Approach](#141-basic-approach) * [2 Functionality](#2-functionality) @@ -93,9 +93,16 @@ Configuration of these features can be done via: 1. config_db.json 2. CLI -### 1.3.3 Warm Reboot Requirements -When switch reboots in the warm boot mode, auto-restart feature must ensure that systemd -service is stopped explicitly such that it will not affect warm reboot functionality. +### 1.3.3 Fast-Reboot/Warm-Reboot Requirements +During the fast-reboot/warm-reboot/warm-restart procedures in SONiC, a select number of processes +and the containers they reside in are stopped in a special manner (via a signals or similar). +In this situation, we need ensure these containers remain stopped until the fast-reboot/warm-reboot/warm-restart +procedure is complete. Therefore, in order to prevent the auto-restart mechanism from restarting +the containers prematurely, it is the responsibility of the fast-reboot/warm-reboot/warm-restart +procedure to explicitly stop the systemd service which manages the container immediately after stopping +and critical processes/container. Once the systemd service is explicitly stopped, it will not attempt +to automatically restart the container. + ## 1.4 Design From e4d9a8d3fa0793ac194607f44b47358c77891cae Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Mon, 9 Mar 2020 00:05:03 -0700 Subject: [PATCH 46/57] [Monitoring] Correct a commection symbol. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index d18d58ba4b..2e689e752c 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -14,7 +14,7 @@ - [1.3 Requirements](#13-requirements) - [1.3.1 Functional Requirements](#131-functional-requirements) - [1.3.2 Configuration and Management Requirements](#132-configuration-and-management-requirements) - - [1.3.3 Fast-Reboot/Warm-Reboot requirements](#133-fast-reboot-warm-reboot-requirements) + - [1.3.3 Fast-Reboot/Warm-Reboot requirements](#133-fast-reboot/warm-reboot-requirements) - [1.4 Design](#14-design) - [1.4.1 Basic Approach](#141-basic-approach) * [2 Functionality](#2-functionality) From 7c917c0663ff6f4b088ad7ce3e7c2b414782474a Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Mon, 9 Mar 2020 00:06:50 -0700 Subject: [PATCH 47/57] [Monitoring] Fix a error for connection symbol. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 2e689e752c..0041f1b79c 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -14,7 +14,7 @@ - [1.3 Requirements](#13-requirements) - [1.3.1 Functional Requirements](#131-functional-requirements) - [1.3.2 Configuration and Management Requirements](#132-configuration-and-management-requirements) - - [1.3.3 Fast-Reboot/Warm-Reboot requirements](#133-fast-reboot/warm-reboot-requirements) + - [1.3.3 Fast-Reboot/Warm-Reboot requirements](#133-fast-rebootwarm-reboot-requirements) - [1.4 Design](#14-design) - [1.4.1 Basic Approach](#141-basic-approach) * [2 Functionality](#2-functionality) From 3fad48fd340a3193456c9312a8500d1df1ea1c9c Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Tue, 10 Mar 2020 09:02:59 -0700 Subject: [PATCH 48/57] [Monitoring] Swap the location of section 2.2.2 and section 2.2.3. Signed-off-by: Yong Zhao --- doc/monitoring_containers/:w | 341 ++++++++++++++++++ .../monitoring_containers.md | 53 +-- 2 files changed, 368 insertions(+), 26 deletions(-) create mode 100644 doc/monitoring_containers/:w diff --git a/doc/monitoring_containers/:w b/doc/monitoring_containers/:w new file mode 100644 index 0000000000..d18d58ba4b --- /dev/null +++ b/doc/monitoring_containers/:w @@ -0,0 +1,341 @@ +# Monitoring and Auto-Mitigating Unhealthy Containers in SONiC + +# High Level Design Document +#### Rev 0.1 + +# Table of Contents +* [List of Tables](#list-of-tables) +* [Revision](#revision) +* [Scope](#scope) +* [Defintions/Abbreviation](#definitionsabbreviation) +* [1 Feature Overview](#1-feature-overview) + - [1.1 Monitoring](#11-monitoring) + - [1.2 Auto-mitigating](#12-auto-mitigating) + - [1.3 Requirements](#13-requirements) + - [1.3.1 Functional Requirements](#131-functional-requirements) + - [1.3.2 Configuration and Management Requirements](#132-configuration-and-management-requirements) + - [1.3.3 Fast-Reboot/Warm-Reboot requirements](#133-fast-reboot-warm-reboot-requirements) + - [1.4 Design](#14-design) + - [1.4.1 Basic Approach](#141-basic-approach) +* [2 Functionality](#2-functionality) + - [2.1 Target Deployment Use Cases](#21-target-deployment-use-cases) + - [2.2 Functional Description](#22-functional-description) + - [2.2.1 Monitoring Critical Processes](#221-monitoring-critical-processes) + - [2.2.2 Auto-restart Docker Container](#222-auto-restart-docker-container) + - [2.2.3 Monitoring Critical Resource Usage](#223-monitoring-critical-resource-usage) + - [2.2.4 CLI (and usage example)](#224-cli-and-usage-example) + - [2.2.4.1 Show the Status of Auto-restart](#2241-show-the-status-of-auto-restart) + - [2.2.4.2 Configure the Status of Auto-restart](#2242-configure-the-status-of-auto-restart) + - [2.2.5 CONTAINER_FEATURE Table](#225-container_feature-table) + +# List of Tables +* [Table 1: Abbreviations](#definitionsabbreviation) + +# Revision +| Rev | Date | Author | Change Description | +|:---:|:----------:|:----------------------:|---------------------------| +| 0.1 | 02/18/2020 | Yong Zhao, Joe Leveque | Initial version | + +# Scope +This document describes the high level design of features to monitor and auto-mitigate +the unhealthy containers in SONiC. + +# Definitions/Abbreviation +| Abbreviation | Description | +|--------------|------------------------------| +| Config DB | SONiC Configuration Database | +| CLI | Command Line Interface | + +# 1 Feature Overview +SONiC is a collection of various switch applications which are held in docker containers +such as BGP container and SNMP container. Each application usually includes several processes which are +working together to provide and receive the services from other modules. As such, the health of +critical processes in each docker container is imperative not only for the docker +container working correctly but also for the intended functionalities of entire SONiC switch. + +## 1.1 Monitoring +This feature is used to monitor the running status of critical processes and critical resource +usage such as CPU, memory and disk of each docker container. + +We used Monit system tool to detect whether a critical process is running or not and whether +the resource usage of a docker container is beyond the pre-defined threshold. + +## 1.2 Auto-Mitigating +This feature demonstrated docker container can be automatically shut down and +restarted if one of critical processes running in docker container exits unexpectedly. Restarting +the entire docker container ensures that configuration is reloaded and all processes in +docker container get restarted, thus increasing the likelihood of entering a healthy state. + +We leveraged the 'event listener' mechanism in supervisord to auto-restart a docker container +if one of its critical processes exited unexpectedly. We also added a configuration option to make this +auto-restart feature dynamically configurable. Specifically users can run CLI to configure this +feature residing in Config_DB as enabled/disabled status. + +## 1.3 Requirements + +### 1.3.1 Functional Requirements +1. Monit must provide the ability to generate an alert when a critical process has not + been alive for 5 minutes. +2. Monit must provide the ability to generate an alert when the resource usage of + a docker container is larger than the pre-defined threshold. +3. The event listener in supervisord must receive the signal when a critical process in + a docker container crashed or exited unexpectedly and then restart this docker + container. +4. CONFIG_DB can be configured to enable/disable this auto-restart feature for each docker + container.. +5. Users can access the status of auto-restart feature via the CLI utility + 1. Users can see current auto-restart status for docker containers. + 2. Users can configure auto-restart status for a specific docker container. + +### 1.3.2 Configuration and Management Requirements +Via the init_cfg.json file, these container features are disabled by default. +Configuration of these features can be done via: +1. config_db.json +2. CLI + +### 1.3.3 Fast-Reboot/Warm-Reboot Requirements +During the fast-reboot/warm-reboot/warm-restart procedures in SONiC, a select number of processes +and the containers they reside in are stopped in a special manner (via a signals or similar). +In this situation, we need ensure these containers remain stopped until the fast-reboot/warm-reboot/warm-restart +procedure is complete. Therefore, in order to prevent the auto-restart mechanism from restarting +the containers prematurely, it is the responsibility of the fast-reboot/warm-reboot/warm-restart +procedure to explicitly stop the systemd service which manages the container immediately after stopping +and critical processes/container. Once the systemd service is explicitly stopped, it will not attempt +to automatically restart the container. + + +## 1.4 Design + +### 1.4.1 Basic Approach +Monitoring the running status of critical processes and resource usage of docker containers +depends on the Monit system tool. Since Monit natively provides a mechanism +to check whether a process is running or not, it will be straightforward to integrate this into monitoring +the critical processes in SONiC. However, Monit only provides a method to monitor the resource +usage on a per-process level not a per-container level. As such, monitoring the resource usage of a docker +container is not as straightforward. In our design, we propose to utilize the mechanism with +which Monit can spawn a process and check the return value of the process. We will have Monit +launch a script which reads the resource usage of the container and compares the resource usage +with a configured threshold value for that container. If the current resource usage is less than +the configured threshold value, the script will return 0 and Monit will not log a message. +However, if the resource usage exceeds the threshold, the script will return a non-zero value +and Monit will log an alert message to the syslog. + +We employed the 'event listener' mechanism in supervisord to achieve auto-restarting docker +containers. We configure our event listener to listen for process exit events. When a supervised +process exits, supervisord will pass the event to our custom event listener. The event listener +determines if the process is a critical process and whether it exited unexpectedly. If both of +these conditions are true, the event listener will kill the supervisord process. Since supervisord +runs as PID 1 inside the containers, when supervisord exits, the container will stop. When the +container stops, the systemd service which manages the container will also stop, but it is +configured to automatically restart the service, thus it will restart the container. + +# 2 Functionality +## 2.1 Target Deployment Use Cases +These two features are used to perform the following functions: +1. Monit will write an alert message into syslog if one if critical process has not been + alive for 5 minutes. +2. Monit will write an alert message into syslog if the usage of memory is larger than the + pre-defined threshold for a docker container. +3. A docker container will auto-restart if one of its critical processes crashed or exited + unexpectedly. + +## 2.2 Functional Description + + +### 2.2.1 Monitoring Critical Processes +Monit natively implements a mechanism to monitor whether a process is running or not. In detail, +Monit will periodically read the target processes from configuration file and try to match +those process with the processes tree in Linux kernel. + +Below is an example of Monit configuration file to monitor the critical processes in lldp +container. + +*/etc/monit/conf.d/monit_lldp* +```bash +############################################################################### +# Monit configuration file for lldp container +# Process list: +# lldpd +# lldp_syncd +# lldpmgrd +############################################################################### +check process lldp_monitor matching "lldpd: " + if does not exit for 5 times within 5 cycles then alert +check process lldp_syncd matching "python2 -m lldp_syncd" + if does not exit for 5 times within 5 cycles then alert +check process lldpmgrd matching "python /usr/bin/lldpmgrd" + if does not exit for 5 times within 5 cycles then alert +``` +However, Monit is unable to monitor multiple processes executing the same command but with +different arguments. For example, in teamd container, there are multiple teamd processes +running the same command ```/usr/bin/teamd``` but using different port channel as argument. +Since there exists 1:1 mapping between a port channel and a teamd process, we employ Monit to +monitor a script which retrieves all the port channels from Config_DB and then determine +whether there exists a teamd process in Linux for each port channel. If succeed, that means +all teamd processes are live. Otherwise, we will know at least teamd process exited unexpectedly +and then Monit will write an alert message into syslog. Similarly we can also use this method +to solve the same issue in dhcp_relay container. + +### 2.2.2 Auto-restart Docker Container +The design principle behind this auto-restart feature is docker containers can be automatically shut down and +restarted if one of critical processes running in the container exits unexpectedly. Restarting +the entire container ensures that configuration is reloaded and all processes in the container +get restarted, thus increasing the likelihood of entering a healthy state. + +Currently SONiC used supervisord system tool to manage the processes in each +docker container. Actually auto-restarting docker container is based on the process +monitoring/notification framework. Specifically +if the state of process changes for example from running to exited, +an event notification `PROCESS_STATE_EXITED` will be emitted by supervisord. +This event will be received by event listener. The event listener determines if the process is +critical process and whether it exited unexpectedly. If both of +these conditions are true, the event listener will kill the supervisord process. Since supervisord +runs as PID 1 inside the containers, when supervisord exits, the container will stop. When the +container stops, the systemd service which manages the container will also stop, but it is +configured to automatically restart the service, thus it will restart the container. + +We also introduced a configuration option which can enable or disable this auto-restart feature +dynamically according to the requirement of users. In detail, we created a table +named `CONTAINER_FEATURE` in Config_DB and this table includes the status of +auto-restart feature for each docker container. Users can easily use CLI to +check and configure the corresponding docker container status. + +### 2.2.3 Monitoring Critical Resource Usage +Similar to monitoring the critical processes, we can employ Monit to monitor the resource usage +such as CPU, memory and disk for each process. Unfortunately Monit is unable to do the resource monitoring +in the container level. Thus we propose a new design to achieve such monitoring based on Monit. +Specifically Monit will monitor a script and check its exit status. This script +will correspondingly read the resource usage of docker containers, compare it with +pre-defined threshold and then return a value. The value 0 signified that +the resource usage is less than threshold and non-zero means Monit will send an alert since +current usage is larger than threshold. + +Below is an example of Monit configuration file for lldp container to pass the pre-defined +threshold (bytes) to the script and check the exiting value. + +```bash +check program container_memory_lldp with path "/usr/bin/memory_checker lldp 104857600" + if status != 0 then alert +``` + +We will employ similar mechanism for CPU and disk utilization. Thresholds for each resource, +per container can be determined by the operator by examining averages of resource usage in +a production environment. The value `0` in table represents the corresponding feature in +the docker container is in `disabled` status. + +### 2.2.4 CLI (and usage example) +The CLI tool will provide the following functionality: +1. Show current status of auto-restart feature for docker containers. +2. Configure the status of a specific docker container. + +#### 2.2.4.1 Show the Status of Auto-restart +``` +admin@sonic:~$ show container feature autorestart +Container Name Status +-------------------- -------- +database disabled +lldp disabled +radv disabled +pmon disabled +sflow enabled +snmp enabled +telemetry enabled +bgp disabled +dhcp_relay disabled +rest-api enabled +teamd disabled +syncd enabled +swss disabled +``` + +#### 2.2.4.2 Configure the Status of Auto-restart +``` +admin@sonic:~$ sudo config container feature autorestart database enabled +``` + +### 2.2.5 CONTAINER_FEATURE Table +Example: +``` +{ + "CONTAINER_FEATURE": { + "database": { + "auto_restart": "enabled", + "high_mem_alert": "157286400", + "high_cpu_alert": "0", + "high_disk_alert": "0" + }, + "lldp": { + "auto_restart": "disabled", + "high_mem_alert": "104857600", + "high_cpu_alert": "0", + "high_disk_alert": "0" + }, + "radv": { + "auto_restart": "disabled", + "high_mem_alert": "104857600", + "high_cpu_alert": "0", + "high_disk_alert": "0" + }, + "pmon": { + "auto_restart": "disabled", + "high_mem_alert": "104857600", + "high_cpu_alert": "0", + "high_disk_alert": "0" + }, + "sflow": { + "auto_restart": "enabled", + "high_mem_alert": "0", + "high_cpu_alert": "0", + "high_disk_alert": "0" + }, + "snmp": { + "auto_restart": "enabled", + "high_mem_alert": "157286400", + "high_cpu_alert": "0", + "high_disk_alert": "0" + }, + "telemetry": { + "auto_restart": "enabled", + "high_mem_alert": "0", + "high_cpu_alert": "0", + "high_disk_alert": "0" + }, + "bgp": { + "auto_restart": "disabled", + "high_mem_alert": "314572800", + "high_cpu_alert": "0", + "high_disk_alert": "0" + }, + "dhcp_relay": { + "auto_restart": "disabled", + "high_mem_alert": "104857600", + "high_cpu_alert": "0", + "high_disk_alert": "0" + }, + "rest-api": { + "auto_restart": "enabled", + "high_mem_alert": "0", + "high_cpu_alert": "0", + "high_disk_alert": "0" + }, + "teamd": { + "auto_restart": "disabled", + "high_mem_alert": "104857600", + "high_cpu_alert": "0", + "high_disk_alert": "0" + }, + "syncd": { + "auto_restart": "enabled", + "high_mem_alert": "629145600", + "high_cpu_alert": "0", + "high_disk_alert": "0" + }, + "swss": { + "auto_restart": "disabled", + "high_mem_alert": "157286400", + "high_cpu_alert": "0", + "high_disk_alert": "0" + }, + } +} +``` diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 0041f1b79c..0a7a4595df 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -21,8 +21,8 @@ - [2.1 Target Deployment Use Cases](#21-target-deployment-use-cases) - [2.2 Functional Description](#22-functional-description) - [2.2.1 Monitoring Critical Processes](#221-monitoring-critical-processes) - - [2.2.2 Auto-restart Docker Container](#222-auto-restart-docker-container) - - [2.2.3 Monitoring Critical Resource Usage](#223-monitoring-critical-resource-usage) + - [2.2.2 Monitoring Critical Resource Usage](#222-monitoring-critical-resource-usage) + - [2.2.3 Auto-restart Docker Container](#223-auto-restart-docker-container) - [2.2.4 CLI (and usage example)](#224-cli-and-usage-example) - [2.2.4.1 Show the Status of Auto-restart](#2241-show-the-status-of-auto-restart) - [2.2.4.2 Configure the Status of Auto-restart](#2242-configure-the-status-of-auto-restart) @@ -176,7 +176,31 @@ all teamd processes are live. Otherwise, we will know at least teamd process exi and then Monit will write an alert message into syslog. Similarly we can also use this method to solve the same issue in dhcp_relay container. -### 2.2.2 Auto-restart Docker Container +### 2.2.2 Monitoring Critical Resource Usage +Similar to monitoring the critical processes, we can employ Monit to monitor the resource usage +such as CPU, memory and disk for each process. Unfortunately Monit is unable to do the resource monitoring +in the container level. Thus we propose a new design to achieve such monitoring based on Monit. +Specifically Monit will monitor a script and check its exit status. This script +will correspondingly read the resource usage of docker containers, compare it with +pre-defined threshold and then return a value. The value 0 signified that +the resource usage is less than threshold and non-zero means Monit will send an alert since +current usage is larger than threshold. + +Below is an example of Monit configuration file for lldp container to pass the pre-defined +threshold (bytes) to the script and check the exiting value. + +```bash +check program container_memory_lldp with path "/usr/bin/memory_checker lldp 104857600" + if status != 0 then alert +``` + +We will employ similar mechanism for CPU and disk utilization. Thresholds for each resource, +per container can be determined by the operator by examining averages of resource usage in +a production environment. The value `0` in table represents the corresponding feature in +the docker container is in `disabled` status. + + +### 2.2.3 Auto-restart Docker Container The design principle behind this auto-restart feature is docker containers can be automatically shut down and restarted if one of critical processes running in the container exits unexpectedly. Restarting the entire container ensures that configuration is reloaded and all processes in the container @@ -200,29 +224,6 @@ named `CONTAINER_FEATURE` in Config_DB and this table includes the status of auto-restart feature for each docker container. Users can easily use CLI to check and configure the corresponding docker container status. -### 2.2.3 Monitoring Critical Resource Usage -Similar to monitoring the critical processes, we can employ Monit to monitor the resource usage -such as CPU, memory and disk for each process. Unfortunately Monit is unable to do the resource monitoring -in the container level. Thus we propose a new design to achieve such monitoring based on Monit. -Specifically Monit will monitor a script and check its exit status. This script -will correspondingly read the resource usage of docker containers, compare it with -pre-defined threshold and then return a value. The value 0 signified that -the resource usage is less than threshold and non-zero means Monit will send an alert since -current usage is larger than threshold. - -Below is an example of Monit configuration file for lldp container to pass the pre-defined -threshold (bytes) to the script and check the exiting value. - -```bash -check program container_memory_lldp with path "/usr/bin/memory_checker lldp 104857600" - if status != 0 then alert -``` - -We will employ similar mechanism for CPU and disk utilization. Thresholds for each resource, -per container can be determined by the operator by examining averages of resource usage in -a production environment. The value `0` in table represents the corresponding feature in -the docker container is in `disabled` status. - ### 2.2.4 CLI (and usage example) The CLI tool will provide the following functionality: 1. Show current status of auto-restart feature for docker containers. From eb3043204baa1f22f92a017f810697f81a296f44 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Tue, 10 Mar 2020 09:06:47 -0700 Subject: [PATCH 49/57] [Monitoring] Correct a typo. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 0a7a4595df..fddec1542a 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -22,7 +22,7 @@ - [2.2 Functional Description](#22-functional-description) - [2.2.1 Monitoring Critical Processes](#221-monitoring-critical-processes) - [2.2.2 Monitoring Critical Resource Usage](#222-monitoring-critical-resource-usage) - - [2.2.3 Auto-restart Docker Container](#223-auto-restart-docker-container) + - [2.2.3 Auto-restart Docker Container](#223-auto-restart-docker-container) - [2.2.4 CLI (and usage example)](#224-cli-and-usage-example) - [2.2.4.1 Show the Status of Auto-restart](#2241-show-the-status-of-auto-restart) - [2.2.4.2 Configure the Status of Auto-restart](#2242-configure-the-status-of-auto-restart) From a84bfdf8ad5d8685965b413bac93859b7dee3d33 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Tue, 10 Mar 2020 09:10:31 -0700 Subject: [PATCH 50/57] [Monitoring] Delete an extra space. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index fddec1542a..0a7a4595df 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -22,7 +22,7 @@ - [2.2 Functional Description](#22-functional-description) - [2.2.1 Monitoring Critical Processes](#221-monitoring-critical-processes) - [2.2.2 Monitoring Critical Resource Usage](#222-monitoring-critical-resource-usage) - - [2.2.3 Auto-restart Docker Container](#223-auto-restart-docker-container) + - [2.2.3 Auto-restart Docker Container](#223-auto-restart-docker-container) - [2.2.4 CLI (and usage example)](#224-cli-and-usage-example) - [2.2.4.1 Show the Status of Auto-restart](#2241-show-the-status-of-auto-restart) - [2.2.4.2 Configure the Status of Auto-restart](#2242-configure-the-status-of-auto-restart) From 8a908c2d84f7d58cdaaea2825df13bfda9b73296 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Tue, 10 Mar 2020 09:20:40 -0700 Subject: [PATCH 51/57] [Monitoring] Delete the file which is added mistakenly. Signed-off-by: Yong Zhao --- doc/monitoring_containers/:w | 341 ----------------------------------- 1 file changed, 341 deletions(-) delete mode 100644 doc/monitoring_containers/:w diff --git a/doc/monitoring_containers/:w b/doc/monitoring_containers/:w deleted file mode 100644 index d18d58ba4b..0000000000 --- a/doc/monitoring_containers/:w +++ /dev/null @@ -1,341 +0,0 @@ -# Monitoring and Auto-Mitigating Unhealthy Containers in SONiC - -# High Level Design Document -#### Rev 0.1 - -# Table of Contents -* [List of Tables](#list-of-tables) -* [Revision](#revision) -* [Scope](#scope) -* [Defintions/Abbreviation](#definitionsabbreviation) -* [1 Feature Overview](#1-feature-overview) - - [1.1 Monitoring](#11-monitoring) - - [1.2 Auto-mitigating](#12-auto-mitigating) - - [1.3 Requirements](#13-requirements) - - [1.3.1 Functional Requirements](#131-functional-requirements) - - [1.3.2 Configuration and Management Requirements](#132-configuration-and-management-requirements) - - [1.3.3 Fast-Reboot/Warm-Reboot requirements](#133-fast-reboot-warm-reboot-requirements) - - [1.4 Design](#14-design) - - [1.4.1 Basic Approach](#141-basic-approach) -* [2 Functionality](#2-functionality) - - [2.1 Target Deployment Use Cases](#21-target-deployment-use-cases) - - [2.2 Functional Description](#22-functional-description) - - [2.2.1 Monitoring Critical Processes](#221-monitoring-critical-processes) - - [2.2.2 Auto-restart Docker Container](#222-auto-restart-docker-container) - - [2.2.3 Monitoring Critical Resource Usage](#223-monitoring-critical-resource-usage) - - [2.2.4 CLI (and usage example)](#224-cli-and-usage-example) - - [2.2.4.1 Show the Status of Auto-restart](#2241-show-the-status-of-auto-restart) - - [2.2.4.2 Configure the Status of Auto-restart](#2242-configure-the-status-of-auto-restart) - - [2.2.5 CONTAINER_FEATURE Table](#225-container_feature-table) - -# List of Tables -* [Table 1: Abbreviations](#definitionsabbreviation) - -# Revision -| Rev | Date | Author | Change Description | -|:---:|:----------:|:----------------------:|---------------------------| -| 0.1 | 02/18/2020 | Yong Zhao, Joe Leveque | Initial version | - -# Scope -This document describes the high level design of features to monitor and auto-mitigate -the unhealthy containers in SONiC. - -# Definitions/Abbreviation -| Abbreviation | Description | -|--------------|------------------------------| -| Config DB | SONiC Configuration Database | -| CLI | Command Line Interface | - -# 1 Feature Overview -SONiC is a collection of various switch applications which are held in docker containers -such as BGP container and SNMP container. Each application usually includes several processes which are -working together to provide and receive the services from other modules. As such, the health of -critical processes in each docker container is imperative not only for the docker -container working correctly but also for the intended functionalities of entire SONiC switch. - -## 1.1 Monitoring -This feature is used to monitor the running status of critical processes and critical resource -usage such as CPU, memory and disk of each docker container. - -We used Monit system tool to detect whether a critical process is running or not and whether -the resource usage of a docker container is beyond the pre-defined threshold. - -## 1.2 Auto-Mitigating -This feature demonstrated docker container can be automatically shut down and -restarted if one of critical processes running in docker container exits unexpectedly. Restarting -the entire docker container ensures that configuration is reloaded and all processes in -docker container get restarted, thus increasing the likelihood of entering a healthy state. - -We leveraged the 'event listener' mechanism in supervisord to auto-restart a docker container -if one of its critical processes exited unexpectedly. We also added a configuration option to make this -auto-restart feature dynamically configurable. Specifically users can run CLI to configure this -feature residing in Config_DB as enabled/disabled status. - -## 1.3 Requirements - -### 1.3.1 Functional Requirements -1. Monit must provide the ability to generate an alert when a critical process has not - been alive for 5 minutes. -2. Monit must provide the ability to generate an alert when the resource usage of - a docker container is larger than the pre-defined threshold. -3. The event listener in supervisord must receive the signal when a critical process in - a docker container crashed or exited unexpectedly and then restart this docker - container. -4. CONFIG_DB can be configured to enable/disable this auto-restart feature for each docker - container.. -5. Users can access the status of auto-restart feature via the CLI utility - 1. Users can see current auto-restart status for docker containers. - 2. Users can configure auto-restart status for a specific docker container. - -### 1.3.2 Configuration and Management Requirements -Via the init_cfg.json file, these container features are disabled by default. -Configuration of these features can be done via: -1. config_db.json -2. CLI - -### 1.3.3 Fast-Reboot/Warm-Reboot Requirements -During the fast-reboot/warm-reboot/warm-restart procedures in SONiC, a select number of processes -and the containers they reside in are stopped in a special manner (via a signals or similar). -In this situation, we need ensure these containers remain stopped until the fast-reboot/warm-reboot/warm-restart -procedure is complete. Therefore, in order to prevent the auto-restart mechanism from restarting -the containers prematurely, it is the responsibility of the fast-reboot/warm-reboot/warm-restart -procedure to explicitly stop the systemd service which manages the container immediately after stopping -and critical processes/container. Once the systemd service is explicitly stopped, it will not attempt -to automatically restart the container. - - -## 1.4 Design - -### 1.4.1 Basic Approach -Monitoring the running status of critical processes and resource usage of docker containers -depends on the Monit system tool. Since Monit natively provides a mechanism -to check whether a process is running or not, it will be straightforward to integrate this into monitoring -the critical processes in SONiC. However, Monit only provides a method to monitor the resource -usage on a per-process level not a per-container level. As such, monitoring the resource usage of a docker -container is not as straightforward. In our design, we propose to utilize the mechanism with -which Monit can spawn a process and check the return value of the process. We will have Monit -launch a script which reads the resource usage of the container and compares the resource usage -with a configured threshold value for that container. If the current resource usage is less than -the configured threshold value, the script will return 0 and Monit will not log a message. -However, if the resource usage exceeds the threshold, the script will return a non-zero value -and Monit will log an alert message to the syslog. - -We employed the 'event listener' mechanism in supervisord to achieve auto-restarting docker -containers. We configure our event listener to listen for process exit events. When a supervised -process exits, supervisord will pass the event to our custom event listener. The event listener -determines if the process is a critical process and whether it exited unexpectedly. If both of -these conditions are true, the event listener will kill the supervisord process. Since supervisord -runs as PID 1 inside the containers, when supervisord exits, the container will stop. When the -container stops, the systemd service which manages the container will also stop, but it is -configured to automatically restart the service, thus it will restart the container. - -# 2 Functionality -## 2.1 Target Deployment Use Cases -These two features are used to perform the following functions: -1. Monit will write an alert message into syslog if one if critical process has not been - alive for 5 minutes. -2. Monit will write an alert message into syslog if the usage of memory is larger than the - pre-defined threshold for a docker container. -3. A docker container will auto-restart if one of its critical processes crashed or exited - unexpectedly. - -## 2.2 Functional Description - - -### 2.2.1 Monitoring Critical Processes -Monit natively implements a mechanism to monitor whether a process is running or not. In detail, -Monit will periodically read the target processes from configuration file and try to match -those process with the processes tree in Linux kernel. - -Below is an example of Monit configuration file to monitor the critical processes in lldp -container. - -*/etc/monit/conf.d/monit_lldp* -```bash -############################################################################### -# Monit configuration file for lldp container -# Process list: -# lldpd -# lldp_syncd -# lldpmgrd -############################################################################### -check process lldp_monitor matching "lldpd: " - if does not exit for 5 times within 5 cycles then alert -check process lldp_syncd matching "python2 -m lldp_syncd" - if does not exit for 5 times within 5 cycles then alert -check process lldpmgrd matching "python /usr/bin/lldpmgrd" - if does not exit for 5 times within 5 cycles then alert -``` -However, Monit is unable to monitor multiple processes executing the same command but with -different arguments. For example, in teamd container, there are multiple teamd processes -running the same command ```/usr/bin/teamd``` but using different port channel as argument. -Since there exists 1:1 mapping between a port channel and a teamd process, we employ Monit to -monitor a script which retrieves all the port channels from Config_DB and then determine -whether there exists a teamd process in Linux for each port channel. If succeed, that means -all teamd processes are live. Otherwise, we will know at least teamd process exited unexpectedly -and then Monit will write an alert message into syslog. Similarly we can also use this method -to solve the same issue in dhcp_relay container. - -### 2.2.2 Auto-restart Docker Container -The design principle behind this auto-restart feature is docker containers can be automatically shut down and -restarted if one of critical processes running in the container exits unexpectedly. Restarting -the entire container ensures that configuration is reloaded and all processes in the container -get restarted, thus increasing the likelihood of entering a healthy state. - -Currently SONiC used supervisord system tool to manage the processes in each -docker container. Actually auto-restarting docker container is based on the process -monitoring/notification framework. Specifically -if the state of process changes for example from running to exited, -an event notification `PROCESS_STATE_EXITED` will be emitted by supervisord. -This event will be received by event listener. The event listener determines if the process is -critical process and whether it exited unexpectedly. If both of -these conditions are true, the event listener will kill the supervisord process. Since supervisord -runs as PID 1 inside the containers, when supervisord exits, the container will stop. When the -container stops, the systemd service which manages the container will also stop, but it is -configured to automatically restart the service, thus it will restart the container. - -We also introduced a configuration option which can enable or disable this auto-restart feature -dynamically according to the requirement of users. In detail, we created a table -named `CONTAINER_FEATURE` in Config_DB and this table includes the status of -auto-restart feature for each docker container. Users can easily use CLI to -check and configure the corresponding docker container status. - -### 2.2.3 Monitoring Critical Resource Usage -Similar to monitoring the critical processes, we can employ Monit to monitor the resource usage -such as CPU, memory and disk for each process. Unfortunately Monit is unable to do the resource monitoring -in the container level. Thus we propose a new design to achieve such monitoring based on Monit. -Specifically Monit will monitor a script and check its exit status. This script -will correspondingly read the resource usage of docker containers, compare it with -pre-defined threshold and then return a value. The value 0 signified that -the resource usage is less than threshold and non-zero means Monit will send an alert since -current usage is larger than threshold. - -Below is an example of Monit configuration file for lldp container to pass the pre-defined -threshold (bytes) to the script and check the exiting value. - -```bash -check program container_memory_lldp with path "/usr/bin/memory_checker lldp 104857600" - if status != 0 then alert -``` - -We will employ similar mechanism for CPU and disk utilization. Thresholds for each resource, -per container can be determined by the operator by examining averages of resource usage in -a production environment. The value `0` in table represents the corresponding feature in -the docker container is in `disabled` status. - -### 2.2.4 CLI (and usage example) -The CLI tool will provide the following functionality: -1. Show current status of auto-restart feature for docker containers. -2. Configure the status of a specific docker container. - -#### 2.2.4.1 Show the Status of Auto-restart -``` -admin@sonic:~$ show container feature autorestart -Container Name Status --------------------- -------- -database disabled -lldp disabled -radv disabled -pmon disabled -sflow enabled -snmp enabled -telemetry enabled -bgp disabled -dhcp_relay disabled -rest-api enabled -teamd disabled -syncd enabled -swss disabled -``` - -#### 2.2.4.2 Configure the Status of Auto-restart -``` -admin@sonic:~$ sudo config container feature autorestart database enabled -``` - -### 2.2.5 CONTAINER_FEATURE Table -Example: -``` -{ - "CONTAINER_FEATURE": { - "database": { - "auto_restart": "enabled", - "high_mem_alert": "157286400", - "high_cpu_alert": "0", - "high_disk_alert": "0" - }, - "lldp": { - "auto_restart": "disabled", - "high_mem_alert": "104857600", - "high_cpu_alert": "0", - "high_disk_alert": "0" - }, - "radv": { - "auto_restart": "disabled", - "high_mem_alert": "104857600", - "high_cpu_alert": "0", - "high_disk_alert": "0" - }, - "pmon": { - "auto_restart": "disabled", - "high_mem_alert": "104857600", - "high_cpu_alert": "0", - "high_disk_alert": "0" - }, - "sflow": { - "auto_restart": "enabled", - "high_mem_alert": "0", - "high_cpu_alert": "0", - "high_disk_alert": "0" - }, - "snmp": { - "auto_restart": "enabled", - "high_mem_alert": "157286400", - "high_cpu_alert": "0", - "high_disk_alert": "0" - }, - "telemetry": { - "auto_restart": "enabled", - "high_mem_alert": "0", - "high_cpu_alert": "0", - "high_disk_alert": "0" - }, - "bgp": { - "auto_restart": "disabled", - "high_mem_alert": "314572800", - "high_cpu_alert": "0", - "high_disk_alert": "0" - }, - "dhcp_relay": { - "auto_restart": "disabled", - "high_mem_alert": "104857600", - "high_cpu_alert": "0", - "high_disk_alert": "0" - }, - "rest-api": { - "auto_restart": "enabled", - "high_mem_alert": "0", - "high_cpu_alert": "0", - "high_disk_alert": "0" - }, - "teamd": { - "auto_restart": "disabled", - "high_mem_alert": "104857600", - "high_cpu_alert": "0", - "high_disk_alert": "0" - }, - "syncd": { - "auto_restart": "enabled", - "high_mem_alert": "629145600", - "high_cpu_alert": "0", - "high_disk_alert": "0" - }, - "swss": { - "auto_restart": "disabled", - "high_mem_alert": "157286400", - "high_cpu_alert": "0", - "high_disk_alert": "0" - }, - } -} -``` From 7056a9fc81f35ba01eb9ce2f7022a4e2e7aace89 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Thu, 22 Jul 2021 20:24:17 +0000 Subject: [PATCH 52/57] [memory_restart] Add the description of monitoring the critical process by Supervisord and high memory restart. Signed-off-by: Yong Zhao --- .../monitoring_containers.md | 284 ++++++++++++------ 1 file changed, 186 insertions(+), 98 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 0a7a4595df..dce56d86c6 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -10,7 +10,11 @@ * [Defintions/Abbreviation](#definitionsabbreviation) * [1 Feature Overview](#1-feature-overview) - [1.1 Monitoring](#11-monitoring) + - [1.1.1 Monitoring critical processes by Monit](#111-monitoring-critical-processes-by-monit) + - [1.1.2 Monitoring critical processes by Supervisor](#112-monitoring-critical-processes-by-supervisor) - [1.2 Auto-mitigating](#12-auto-mitigating) + - [1.2.1 Container auto-restart related to crash of critical process](#121-container-auto-restart-related-to-crash-of-critical-process) + - [1.2.2 Container restart related to high memory usage](#122-container-restart-related-to-high-memory-usage) - [1.3 Requirements](#13-requirements) - [1.3.1 Functional Requirements](#131-functional-requirements) - [1.3.2 Configuration and Management Requirements](#132-configuration-and-management-requirements) @@ -35,6 +39,7 @@ | Rev | Date | Author | Change Description | |:---:|:----------:|:----------------------:|---------------------------| | 0.1 | 02/18/2020 | Yong Zhao, Joe Leveque | Initial version | +| 0.2 | 07/19/2020 | Yong Zhao | Second version | # Scope This document describes the high level design of features to monitor and auto-mitigate @@ -54,13 +59,27 @@ critical processes in each docker container is imperative not only for the docke container working correctly but also for the intended functionalities of entire SONiC switch. ## 1.1 Monitoring -This feature is used to monitor the running status of critical processes and critical resource -usage such as CPU, memory and disk of each docker container. + +### 1.1.1 Monitoring critical processes by Monit +This feature is used to monitor the running status of critical processes in containers. We used Monit system tool to detect whether a critical process is running or not and whether the resource usage of a docker container is beyond the pre-defined threshold. +### 1.1.2 Monitoring critical processes by Supervisor +This feature demonstrated 'event listener' provided by Supervisord can be leveraged to do +the critical processes monitoring. Specifically 'event listener' can subscribe to 'event notification' +which indicates that something happened related to a sub-process controlled by Supervisord. + +We designed an 'event listener' which subscribed to the events 'PROCESS_STATE_EXITED' +and 'PROCESS_STATE_RUNNING'. If a critical process exited unexpectedly, then Supervisor will +emit the event 'PROCESS_STATE_EXITED' which will be received by 'event listener'. After +that, the 'event listener' will check whether an alerting message should be written into +the syslog. + ## 1.2 Auto-Mitigating + +### 1.2.1 Container auto-restart related to crash of critical process This feature demonstrated docker container can be automatically shut down and restarted if one of critical processes running in docker container exits unexpectedly. Restarting the entire docker container ensures that configuration is reloaded and all processes in @@ -71,6 +90,23 @@ if one of its critical processes exited unexpectedly. We also added a configurat auto-restart feature dynamically configurable. Specifically users can run CLI to configure this feature residing in Config_DB as enabled/disabled status. +### 1.2.2 Container restart related to high memory usage +This feature demonstrated docker container can be shut down and restarted if memory usage +of it is continuously beyond the threshold during monitoring interval. Restarting +the entire docker container ensures that configuration is reloaded and all processes in +docker container get restarted, thus increasing the likelihood of entering a healthy state. + +We defined a threshold of memory usage for each container and Monit in background will +compare the current memory usage of a container with this threshold periodically. If memory usage +of a container is continuously beyond the threshold during monitoring interval, then it +will be restarted to avoid bringing down the device due to the out-of-memory issue. + +We also added configuration options to make this high memory restart feature and memory +threshold dynamically configurable. Specifically `show` command can be used to get the +state of this restart feature and threshold value of each container. `config` command +is leveraged to configure this restart feature residing in Config_DB as 'enabled/disabled' +status and change the threshold of each container. + ## 1.3 Requirements ### 1.3.1 Functional Requirements @@ -81,15 +117,28 @@ feature residing in Config_DB as enabled/disabled status. 3. The event listener in supervisord must receive the signal when a critical process in a docker container crashed or exited unexpectedly and then restart this docker container. -4. CONFIG_DB can be configured to enable/disable this auto-restart feature for each docker - container.. -5. Users can access the status of auto-restart feature via the CLI utility - 1. Users can see current auto-restart status for docker containers. +4. The event listener in supervisord must receive the signal when a critical process in + a docker container crashed or exited unexpectedly and then check whether an alerting + message should be written into syslog. +5. CONFIG_DB can be configured to enable/disable the auto-restart feature related to + process crash of each docker container. +6. CONFIG_DB can be configured to enable/disable the restart feature related to high + memory usage of each docker container. +7. CONFIG_DB can be configured to set memory threshold of each docker container. +8. Users can access the status of auto-restart feature via the CLI utility + 1. Users can retrieve current auto-restart status of docker containers. 2. Users can configure auto-restart status for a specific docker container. +9. Users can access the status of high memory restart feature via the CLI utility + 1. Users can retrieve current high memory restart status of docker containers. + 2. Users can configure high memory restart status for a specific docker container. +10. Users can access the memory threshold of each docker cotnainer via the CLI utility + 1. Users can retrieve current memory threshold of docker containers. + 2. Users can configure memory threshold for a specific docker container. ### 1.3.2 Configuration and Management Requirements -Via the init_cfg.json file, these container features are disabled by default. -Configuration of these features can be done via: +The default state of auto-restart, high memory restart and default memory threshold of +each container can be configured in the file init_cfg.json.j2 file. +Configuration of these features can be changed via: 1. config_db.json 2. CLI @@ -110,7 +159,9 @@ to automatically restart the container. Monitoring the running status of critical processes and resource usage of docker containers depends on the Monit system tool. Since Monit natively provides a mechanism to check whether a process is running or not, it will be straightforward to integrate this into monitoring -the critical processes in SONiC. However, Monit only provides a method to monitor the resource +the critical processes in SONiC. + +However, Monit only provides a method to monitor the resource usage on a per-process level not a per-container level. As such, monitoring the resource usage of a docker container is not as straightforward. In our design, we propose to utilize the mechanism with which Monit can spawn a process and check the return value of the process. We will have Monit @@ -120,24 +171,43 @@ the configured threshold value, the script will return 0 and Monit will not log However, if the resource usage exceeds the threshold, the script will return a non-zero value and Monit will log an alert message to the syslog. -We employed the 'event listener' mechanism in supervisord to achieve auto-restarting docker -containers. We configure our event listener to listen for process exit events. When a supervised -process exits, supervisord will pass the event to our custom event listener. The event listener -determines if the process is a critical process and whether it exited unexpectedly. If both of -these conditions are true, the event listener will kill the supervisord process. Since supervisord -runs as PID 1 inside the containers, when supervisord exits, the container will stop. When the -container stops, the systemd service which manages the container will also stop, but it is -configured to automatically restart the service, thus it will restart the container. +Similalr to the mechanism of monitoring resource usage of a docker container, first we have +Monit launch a monitoring script which reads the memory usage of a container and compares it with the +memeory threshold. If the current memory usage is less than threshold, the monitoring script will +return 0 and Monit will not take any action. If the current memory usage is equal to or larger +than threshold, the monitoring script will return exit code 3. Monit will record this exit code +and do next round monitoring afte 1 minute. If this scenario occurred 15 times within +20 minutes, Then Monit will launch a restarting script which will first check whether the state of high +memory restart was enabled or not. If high memory restart of the docker container was enabled, then +the restarting script will restart this docker container. + +We employed the 'event listener' mechanism in Supervisor to achieve critical process monitoring and +auto-restarting docker containers. We configure an 'event listener' to listen for process exit events. +When a supervised process in docker container exits, supervisord will emit this event and notify +customized event listener. Then the event listener determines whether the process is a critical process +and it exited unexpectedly. If both of these conditions are true, the event listener will check whether +the auto-restart of this docker container was enabled or not. If it was disabled, then 'event listener' +will do the alerting and check whether an message should be written into syslog or not. If it was enabled, +'event listener' will kill the supervisord process. Since supervisord runs as PID 1 process inside the docker +container, the docker container will stop if supervisord process exits. Once the docker +container stops, the systemd service which manages this container will also stop. As this service is +configured to automatically restart, systemd will start it after 30 seconds and thus the corresponding +docker container will be restarted again. # 2 Functionality ## 2.1 Target Deployment Use Cases -These two features are used to perform the following functions: +These features are used to perform the following functions: 1. Monit will write an alert message into syslog if one if critical process has not been alive for 5 minutes. 2. Monit will write an alert message into syslog if the usage of memory is larger than the pre-defined threshold for a docker container. 3. A docker container will auto-restart if one of its critical processes crashed or exited unexpectedly. +4. If auto-restart of a docker container is disabled and one of its critical processes has + not been alive for more than 1 minute, then 'event listener' will write an alerting message + into syslog periodically. +5. If memory usage of a docker container is larger than the threshold for 15 times within 20 minutes, + then it will be restarted. ## 2.2 Functional Description @@ -180,7 +250,7 @@ to solve the same issue in dhcp_relay container. Similar to monitoring the critical processes, we can employ Monit to monitor the resource usage such as CPU, memory and disk for each process. Unfortunately Monit is unable to do the resource monitoring in the container level. Thus we propose a new design to achieve such monitoring based on Monit. -Specifically Monit will monitor a script and check its exit status. This script +Specifically Monit will launch a script and check its exit status. This script will correspondingly read the resource usage of docker containers, compare it with pre-defined threshold and then return a value. The value 0 signified that the resource usage is less than threshold and non-zero means Monit will send an alert since @@ -200,7 +270,7 @@ a production environment. The value `0` in table represents the corresponding fe the docker container is in `disabled` status. -### 2.2.3 Auto-restart Docker Container +### 2.2.3 Restart Docker Container per crash of critical process The design principle behind this auto-restart feature is docker containers can be automatically shut down and restarted if one of critical processes running in the container exits unexpectedly. Restarting the entire container ensures that configuration is reloaded and all processes in the container @@ -224,14 +294,44 @@ named `CONTAINER_FEATURE` in Config_DB and this table includes the status of auto-restart feature for each docker container. Users can easily use CLI to check and configure the corresponding docker container status. +### 2.2.3 Restart Docker Container per high memory usage +The design principle behind this high memory restart is docker container will be restarted +if memory usage of it is continuously larger than the threshold during a monitoring interval. +Restarting the entire container ensures that configuration is reloaded and all processes in the container +get restarted, thus increasing the likelihood of entering a healthy state and avoiding the device +was down due to the out-of-memory (OOM) issue. + +We have Monit launch a monitoring script which reads the memory usage of a container and compares it with the +memeory threshold. If the current memory usage is less than threshold, the monitoring script will +return 0 and Monit will not take any action. If the current memory usage is equal to or larger +than threshold, the monitoring script will return exit code 3. Monit will record this exit code +and do next round monitoring afte 1 minute. If this scenario occurred 15 times within +20 minutes, Then Monit will launch a restarting script which will first check whether the state of high +memory restart was enabled or not. If high memory restart of the docker container was enabled, then +the restarting script will restart this docker container. + +We also introduced configuration options which can enable or disable this high memory restart feature +and set the memory threshold dynamically according to the requirement of users. In detail, we add two +fields 'high_mem_restart' and 'mem_threhsold' in 'FEATURE' table of each container in Config_DB. +Users can easily use CLI to retrieve and set these two configuration option of each docker container. + +```bash +check program container_memory_lldp with path "/usr/bin/memory_checker lldp" + if status == 3 for 15 times within 20 cycles then exec "/usr/bin/restart_service lldp" +` + ### 2.2.4 CLI (and usage example) The CLI tool will provide the following functionality: 1. Show current status of auto-restart feature for docker containers. -2. Configure the status of a specific docker container. +2. Show current status of high memory restart feature for docker containers. +3. Show current memory threshold of high memory restart feature for docker containers. +4. Configure the auto-restart status of a specific docker container. +5. Configure the high memory restart status of a specific docker container. +6. Configure the memory threshold of a specific docker container. -#### 2.2.4.1 Show the Status of Auto-restart +#### 2.2.4.1 Show the status of Auto-restart ``` -admin@sonic:~$ show container feature autorestart +admin@sonic:~$ show feature autorestart Container Name Status -------------------- -------- database disabled @@ -249,93 +349,81 @@ syncd enabled swss disabled ``` -#### 2.2.4.2 Configure the Status of Auto-restart +#### 2.2.4.2 Show the status of high memory restart +``` +admin@sonic:~$ show feature high_mem_restart +Container Name Status +-------------------- -------- +database disabled +lldp disabled +radv disabled +pmon disabled +sflow enabled +snmp enabled +telemetry always_enabled +bgp disabled +dhcp_relay disabled +rest-api enabled +teamd disabled +syncd enabled +swss disabled +``` + +#### 2.2.4.3 Show the memory threshold of high memory restart +``` +admin@sonic:~$ show feature high_mem_restart mem_threhsold +Container Name Memory Threshold +-------------------- ----------------- +database 157286400 +lldp 104857600 +radv 31457280 +pmon 104857600 +snmp 104857600 +telemetry 209715200 +bgp 314572800 +dhcp_relay 62914560 +teamd 73400320 +syncd 629145600 +swss 104857600 +``` + +#### 2.2.4.4 Configure the Status of Auto-restart ``` -admin@sonic:~$ sudo config container feature autorestart database enabled +admin@sonic:~$ sudo config feature autorestart database enabled ``` -### 2.2.5 CONTAINER_FEATURE Table +#### 2.2.4.5 Configure the Status of high memory restart +``` +admin@sonic:~$ sudo config feature high_mem_restart database enabled +``` + +#### 2.2.4.5 Configure the memory threshold of high memory restart +``` +admin@sonic:~$ sudo config feature high_mem_restart database +``` + +### 2.2.5 FEATURE Table Example: ``` { "CONTAINER_FEATURE": { "database": { + "state": "always_enabled", + "has_timer": false, + "has_global_scope": true, + "has_per_asic_scope": true, "auto_restart": "enabled", - "high_mem_alert": "157286400", - "high_cpu_alert": "0", - "high_disk_alert": "0" + "high_mem_restart": "disabled", + "mem_threshold": 157286400, }, "lldp": { - "auto_restart": "disabled", - "high_mem_alert": "104857600", - "high_cpu_alert": "0", - "high_disk_alert": "0" - }, - "radv": { - "auto_restart": "disabled", - "high_mem_alert": "104857600", - "high_cpu_alert": "0", - "high_disk_alert": "0" - }, - "pmon": { - "auto_restart": "disabled", - "high_mem_alert": "104857600", - "high_cpu_alert": "0", - "high_disk_alert": "0" - }, - "sflow": { - "auto_restart": "enabled", - "high_mem_alert": "0", - "high_cpu_alert": "0", - "high_disk_alert": "0" - }, - "snmp": { - "auto_restart": "enabled", - "high_mem_alert": "157286400", - "high_cpu_alert": "0", - "high_disk_alert": "0" - }, - "telemetry": { - "auto_restart": "enabled", - "high_mem_alert": "0", - "high_cpu_alert": "0", - "high_disk_alert": "0" - }, - "bgp": { - "auto_restart": "disabled", - "high_mem_alert": "314572800", - "high_cpu_alert": "0", - "high_disk_alert": "0" - }, - "dhcp_relay": { - "auto_restart": "disabled", - "high_mem_alert": "104857600", - "high_cpu_alert": "0", - "high_disk_alert": "0" - }, - "rest-api": { + "state": "enabled", + "has_timer": false, + "has_global_scope": true, + "has_per_asic_scope": false, "auto_restart": "enabled", - "high_mem_alert": "0", - "high_cpu_alert": "0", - "high_disk_alert": "0" - }, - "teamd": { - "auto_restart": "disabled", - "high_mem_alert": "104857600", - "high_cpu_alert": "0", - "high_disk_alert": "0" - }, - "syncd": { - "auto_restart": "enabled", - "high_mem_alert": "629145600", - "high_cpu_alert": "0", - "high_disk_alert": "0" - }, - "swss": { - "auto_restart": "disabled", - "high_mem_alert": "157286400", - "high_cpu_alert": "0", - "high_disk_alert": "0" + "high_mem_restart": "disabled", + "mem_threshold": 104857600, }, } } From 9b3050241193dcc6f4b660084bac9435857ec0a4 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Thu, 22 Jul 2021 20:37:33 +0000 Subject: [PATCH 53/57] [memory_restart] Fix the format issue. Signed-off-by: Yong Zhao --- .../monitoring_containers.md | 39 +++++++++++-------- 1 file changed, 22 insertions(+), 17 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index dce56d86c6..a282691bac 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -13,8 +13,8 @@ - [1.1.1 Monitoring critical processes by Monit](#111-monitoring-critical-processes-by-monit) - [1.1.2 Monitoring critical processes by Supervisor](#112-monitoring-critical-processes-by-supervisor) - [1.2 Auto-mitigating](#12-auto-mitigating) - - [1.2.1 Container auto-restart related to crash of critical process](#121-container-auto-restart-related-to-crash-of-critical-process) - - [1.2.2 Container restart related to high memory usage](#122-container-restart-related-to-high-memory-usage) + - [1.2.1 Container restart per crash of critical process](#121-container-restart-per-crash-of-critical-process) + - [1.2.2 Container restart per high memory usage](#122-container-restart-per-high-memory-usage) - [1.3 Requirements](#13-requirements) - [1.3.1 Functional Requirements](#131-functional-requirements) - [1.3.2 Configuration and Management Requirements](#132-configuration-and-management-requirements) @@ -26,11 +26,16 @@ - [2.2 Functional Description](#22-functional-description) - [2.2.1 Monitoring Critical Processes](#221-monitoring-critical-processes) - [2.2.2 Monitoring Critical Resource Usage](#222-monitoring-critical-resource-usage) - - [2.2.3 Auto-restart Docker Container](#223-auto-restart-docker-container) - - [2.2.4 CLI (and usage example)](#224-cli-and-usage-example) - - [2.2.4.1 Show the Status of Auto-restart](#2241-show-the-status-of-auto-restart) - - [2.2.4.2 Configure the Status of Auto-restart](#2242-configure-the-status-of-auto-restart) - - [2.2.5 CONTAINER_FEATURE Table](#225-container_feature-table) + - [2.2.3 Restarting Docker Container per Crash of Critical Process](#223-restarting-docker-container-per-crash-of-critical-process) + - [2.2.4 Restarting Docker Container per High Memory Usage](#223-restarting-docker-container-per-high-memory-usage) + - [2.2.5 CLI (and usage example)](#224-cli-and-usage-example) + - [2.2.5.1 Show the Status of Auto-restart](#2241-show-the-status-of-auto-restart) + - [2.2.5.2 Show the Status of High Memory Restart](#2241-show-the-status-of-high-memory-restart) + - [2.2.5.3 Show the Memory Threshold of High Memory Restart](#2241-show-the-memory-threshold-of-high-memory-restart) + - [2.2.5.4 Configure the Status of Auto-restart](#2242-configure-the-status-of-auto-restart) + - [2.2.5.5 Configure the Status of High Memory Restart](#2242-configure-the-status-of-high-memory-restart) + - [2.2.5.6 Configure the Memory Threshold of High Memory Restart](#2242-configure-the-memory-threshold-of-high-memory-restart) + - [2.2.6 CONTAINER_FEATURE Table](#225-container_feature-table) # List of Tables * [Table 1: Abbreviations](#definitionsabbreviation) @@ -294,7 +299,7 @@ named `CONTAINER_FEATURE` in Config_DB and this table includes the status of auto-restart feature for each docker container. Users can easily use CLI to check and configure the corresponding docker container status. -### 2.2.3 Restart Docker Container per high memory usage +### 2.2.4 Restart Docker Container per high memory usage The design principle behind this high memory restart is docker container will be restarted if memory usage of it is continuously larger than the threshold during a monitoring interval. Restarting the entire container ensures that configuration is reloaded and all processes in the container @@ -318,9 +323,9 @@ Users can easily use CLI to retrieve and set these two configuration option of e ```bash check program container_memory_lldp with path "/usr/bin/memory_checker lldp" if status == 3 for 15 times within 20 cycles then exec "/usr/bin/restart_service lldp" -` +``` -### 2.2.4 CLI (and usage example) +### 2.2.5 CLI (and usage example) The CLI tool will provide the following functionality: 1. Show current status of auto-restart feature for docker containers. 2. Show current status of high memory restart feature for docker containers. @@ -329,7 +334,7 @@ The CLI tool will provide the following functionality: 5. Configure the high memory restart status of a specific docker container. 6. Configure the memory threshold of a specific docker container. -#### 2.2.4.1 Show the status of Auto-restart +#### 2.2.5.1 Show the Status of Auto-restart ``` admin@sonic:~$ show feature autorestart Container Name Status @@ -349,7 +354,7 @@ syncd enabled swss disabled ``` -#### 2.2.4.2 Show the status of high memory restart +#### 2.2.5.2 Show the Status of High Memory Restart ``` admin@sonic:~$ show feature high_mem_restart Container Name Status @@ -369,7 +374,7 @@ syncd enabled swss disabled ``` -#### 2.2.4.3 Show the memory threshold of high memory restart +#### 2.2.5.3 Show the Memory Threshold of High Mmemory Restart ``` admin@sonic:~$ show feature high_mem_restart mem_threhsold Container Name Memory Threshold @@ -387,22 +392,22 @@ syncd 629145600 swss 104857600 ``` -#### 2.2.4.4 Configure the Status of Auto-restart +#### 2.2.5.4 Configure the Status of Auto-restart ``` admin@sonic:~$ sudo config feature autorestart database enabled ``` -#### 2.2.4.5 Configure the Status of high memory restart +#### 2.2.5.5 Configure the Status of High Memory Rrestart ``` admin@sonic:~$ sudo config feature high_mem_restart database enabled ``` -#### 2.2.4.5 Configure the memory threshold of high memory restart +#### 2.2.5.6 Configure the Memory Threshold of High Memory Rrestart ``` admin@sonic:~$ sudo config feature high_mem_restart database ``` -### 2.2.5 FEATURE Table +### 2.2.6 FEATURE Table Example: ``` { From dc80bcb180829b403d077b927cb102890f492e5f Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Thu, 22 Jul 2021 20:46:01 +0000 Subject: [PATCH 54/57] [memory_restart] Fix the format issues. Signed-off-by: Yong Zhao --- .../monitoring_containers.md | 22 +++++++++---------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index a282691bac..7038e21d61 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -1,7 +1,7 @@ # Monitoring and Auto-Mitigating Unhealthy Containers in SONiC # High Level Design Document -#### Rev 0.1 +#### Rev 0.2 # Table of Contents * [List of Tables](#list-of-tables) @@ -10,11 +10,11 @@ * [Defintions/Abbreviation](#definitionsabbreviation) * [1 Feature Overview](#1-feature-overview) - [1.1 Monitoring](#11-monitoring) - - [1.1.1 Monitoring critical processes by Monit](#111-monitoring-critical-processes-by-monit) - - [1.1.2 Monitoring critical processes by Supervisor](#112-monitoring-critical-processes-by-supervisor) + - [1.1.1 Monitoring Critical Processes by Monit](#111-monitoring-critical-processes-by-monit) + - [1.1.2 Monitoring Critical Processes by Supervisor](#112-monitoring-critical-processes-by-supervisor) - [1.2 Auto-mitigating](#12-auto-mitigating) - - [1.2.1 Container restart per crash of critical process](#121-container-restart-per-crash-of-critical-process) - - [1.2.2 Container restart per high memory usage](#122-container-restart-per-high-memory-usage) + - [1.2.1 Restarting Container per Crash of Critical Process](#121-restarting-container-per-crash-of-critical-process) + - [1.2.2 Restarting Container per High Memory Usage](#122-restarting-container-per-high-memory-usage) - [1.3 Requirements](#13-requirements) - [1.3.1 Functional Requirements](#131-functional-requirements) - [1.3.2 Configuration and Management Requirements](#132-configuration-and-management-requirements) @@ -65,13 +65,13 @@ container working correctly but also for the intended functionalities of entire ## 1.1 Monitoring -### 1.1.1 Monitoring critical processes by Monit +### 1.1.1 Monitoring Critical Processes by Monit This feature is used to monitor the running status of critical processes in containers. We used Monit system tool to detect whether a critical process is running or not and whether the resource usage of a docker container is beyond the pre-defined threshold. -### 1.1.2 Monitoring critical processes by Supervisor +### 1.1.2 Monitoring Critical Processes by Supervisor This feature demonstrated 'event listener' provided by Supervisord can be leveraged to do the critical processes monitoring. Specifically 'event listener' can subscribe to 'event notification' which indicates that something happened related to a sub-process controlled by Supervisord. @@ -84,7 +84,7 @@ the syslog. ## 1.2 Auto-Mitigating -### 1.2.1 Container auto-restart related to crash of critical process +### 1.2.1 Restarting Docker Container per Crash of Critical Process This feature demonstrated docker container can be automatically shut down and restarted if one of critical processes running in docker container exits unexpectedly. Restarting the entire docker container ensures that configuration is reloaded and all processes in @@ -95,7 +95,7 @@ if one of its critical processes exited unexpectedly. We also added a configurat auto-restart feature dynamically configurable. Specifically users can run CLI to configure this feature residing in Config_DB as enabled/disabled status. -### 1.2.2 Container restart related to high memory usage +### 1.2.2 Restarting Docker Container per High Memory Usage This feature demonstrated docker container can be shut down and restarted if memory usage of it is continuously beyond the threshold during monitoring interval. Restarting the entire docker container ensures that configuration is reloaded and all processes in @@ -275,7 +275,7 @@ a production environment. The value `0` in table represents the corresponding fe the docker container is in `disabled` status. -### 2.2.3 Restart Docker Container per crash of critical process +### 2.2.3 Restarting Docker Container per Crash of Critical Process The design principle behind this auto-restart feature is docker containers can be automatically shut down and restarted if one of critical processes running in the container exits unexpectedly. Restarting the entire container ensures that configuration is reloaded and all processes in the container @@ -299,7 +299,7 @@ named `CONTAINER_FEATURE` in Config_DB and this table includes the status of auto-restart feature for each docker container. Users can easily use CLI to check and configure the corresponding docker container status. -### 2.2.4 Restart Docker Container per high memory usage +### 2.2.4 Restarting Docker Container per High Memory Usage The design principle behind this high memory restart is docker container will be restarted if memory usage of it is continuously larger than the threshold during a monitoring interval. Restarting the entire container ensures that configuration is reloaded and all processes in the container From 7ed89b7755526edcb2538812cb98e15306763d18 Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Thu, 22 Jul 2021 21:13:41 +0000 Subject: [PATCH 55/57] [memory_restart] Change the syntax of `show` and `config` commands. Signed-off-by: Yong Zhao --- .../monitoring_containers.md | 30 +++++++++---------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 7038e21d61..cf738cbe48 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -376,20 +376,20 @@ swss disabled #### 2.2.5.3 Show the Memory Threshold of High Mmemory Restart ``` -admin@sonic:~$ show feature high_mem_restart mem_threhsold -Container Name Memory Threshold --------------------- ----------------- -database 157286400 -lldp 104857600 -radv 31457280 -pmon 104857600 -snmp 104857600 -telemetry 209715200 -bgp 314572800 -dhcp_relay 62914560 -teamd 73400320 -syncd 629145600 -swss 104857600 +admin@sonic:~$ show feature mem_threhsold +Container Name Memory Threshold (Bytes) +-------------------- ------------------------- +database 157286400 +lldp 104857600 +radv 31457280 +pmon 104857600 +snmp 104857600 +telemetry 209715200 +bgp 314572800 +dhcp_relay 62914560 +teamd 73400320 +syncd 629145600 +swss 104857600 ``` #### 2.2.5.4 Configure the Status of Auto-restart @@ -404,7 +404,7 @@ admin@sonic:~$ sudo config feature high_mem_restart database enabled #### 2.2.5.6 Configure the Memory Threshold of High Memory Rrestart ``` -admin@sonic:~$ sudo config feature high_mem_restart database +admin@sonic:~$ sudo config feature mem_threshold database ``` ### 2.2.6 FEATURE Table From 702e4d8e46ebf81ed9eae8ee7beca3017d58fd5b Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Thu, 22 Jul 2021 21:29:13 +0000 Subject: [PATCH 56/57] [mem_restart] Fix the typos. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index cf738cbe48..8b3ddfe212 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -397,12 +397,12 @@ swss 104857600 admin@sonic:~$ sudo config feature autorestart database enabled ``` -#### 2.2.5.5 Configure the Status of High Memory Rrestart +#### 2.2.5.5 Configure the Status of High Memory Restart ``` admin@sonic:~$ sudo config feature high_mem_restart database enabled ``` -#### 2.2.5.6 Configure the Memory Threshold of High Memory Rrestart +#### 2.2.5.6 Configure the Memory Threshold of High Memory Restart ``` admin@sonic:~$ sudo config feature mem_threshold database ``` From 91c5d9be73b70cc922900ab6d7d9708aa869db4b Mon Sep 17 00:00:00 2001 From: Yong Zhao Date: Thu, 22 Jul 2021 21:30:27 +0000 Subject: [PATCH 57/57] [mem_restart] Fix the typos. Signed-off-by: Yong Zhao --- doc/monitoring_containers/monitoring_containers.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/monitoring_containers/monitoring_containers.md b/doc/monitoring_containers/monitoring_containers.md index 8b3ddfe212..79c5c9f6d7 100644 --- a/doc/monitoring_containers/monitoring_containers.md +++ b/doc/monitoring_containers/monitoring_containers.md @@ -411,7 +411,7 @@ admin@sonic:~$ sudo config feature mem_threshold database