diff --git a/doc/smart-switch/pmon/images/dpu-reboot-seq.svg b/doc/smart-switch/pmon/images/dpu-reboot-seq.svg new file mode 100644 index 0000000000..8182562f92 --- /dev/null +++ b/doc/smart-switch/pmon/images/dpu-reboot-seq.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/doc/smart-switch/pmon/smartswitch-pmon.md b/doc/smart-switch/pmon/smartswitch-pmon.md index 2c95cbb385..9c58b1bfd0 100644 --- a/doc/smart-switch/pmon/smartswitch-pmon.md +++ b/doc/smart-switch/pmon/smartswitch-pmon.md @@ -5,6 +5,7 @@ | 0.1 | 12/02/2023 | Ramesh Raghupathy | Initial version| | 0.2 | 01/08/2024 | Ramesh Raghupathy | Updated API, CPI sections and addressed review comments | | 0.3 | 02/26/2024 | Ramesh Raghupathy | Addressed review comments | +| 0.4 | 06/06/2024 | Ramesh Raghupathy | Added schema for DPU health-info and called out phase:1 and phase:2 activities for DPU health-info. Added key suffix to module reboot-cause to avoid key conflicts | ## Definitions / Abbreviations @@ -60,7 +61,7 @@ The picture below highlights the PMON vertical and its association with other lo * The SmartSwitch host PMON should be able to Startup, Shutdown, Restart, and Soft Reboot the entire system or the individual DPUs. The DPU_MODULE will behave like the LINE_CARD_MODULE of a modular chassis with respect to these functions. ### SmartSwitch Power up/down sequence: -* When the smartswitch device is booted, the host will boot first and leave the DPUs either up or down depending on the configuration. The DPUs will be up by default. +* When the smartswitch device is booted, the host will boot first and leave the DPUs down by default. * This section describes the cold startup, shutdown, restart and soft reboot. ### Cold Startup @@ -77,7 +78,7 @@ The picture below highlights the PMON vertical and its association with other lo #### DPU cold startup Sequence * The chassis is powered up and the host is booting up. * The switch PMON is registered with the configDB state change handler. -* DPU "admin_status: up" is set in the configDB. When not explicitly configured the default is "up". +* If the DPU's "admin_status: down" in the configDB, the DPU will remain powered down. The default setting is "down". * The switch PMON gets the admin up notification from the configDB * The switch PMON invokes the platform API to power on the DPU * DPU boots up and attaches itself to the midplane. @@ -86,12 +87,23 @@ The picture below highlights the PMON vertical and its association with other lo ### DPU startup sequence diagram
-### Configuring startup and shutdown -* The DPUs can be powered down by configuring the admin_status as shown. -* The corresponding switch configDB table is also shown +#### 2.1.1 DPUs in dark mode +* A smartswitch when configured to boot up with all the DPUs in it are powered down upon boot up is referred as DPUs in dark mode. +* In the dark mode the platform.json file shown in section "3.1.3" will not have the dictionary for the DPUS. +* The term dark mode is overloaded in some cases where the platform.json may have the dictionary but the config_db.json will have the admin_state of all DPU modules as "down". +* When platform.json has DPU information, but config DB doesn't have the DPU admin state configuration, the DPUs should be in downstate. +* Default mode on smartswitch is dark mode, unless platform.json configures it to be in light up mode. +* The DPUs would stay power down in dark mode and will not consume power. + +#### 2.1.2 Configuring startup and shutdown +* The user can use the “config chassis modules startup DPUx” to power ON a DPU. Example: “config chassis modules startup DPU0” +* The “config chassis modules shutdown DPUx” is used to power OFF a DPU. Example: “config chassis modules shutdown DPU0” +* The DPUs are powered down by configuring the admin_status as shown in the schema +* The config change event handler running on the chassisd of PMON listens to the config change and sets the corresponding switch configDB table and also triggers the module set_admin_state() API. +* The platform executes the power ON/OFF sequence #### config_db.json -CHASSIS_MODULE table holds the list and configuration of DPU modules in a smartswitch chassis. It allows user to administratively bring down a DPU +CHASSIS_MODULE table holds the list and configuration of DPU modules in a smartswitch chassis. It allows user to administratively bring down a DPU. The following example assumes four DPUs in the system. Unlike the existing modular chassis configuration in a SmartSwitch a DPU module has to be explicitly configured to be "up", otherwise it be configured in the DB to be down as shown in the example. The example assumes four DPUs in the system. ``` { "CHASSIS_MODULE": { @@ -99,9 +111,10 @@ CHASSIS_MODULE table holds the list and configuration of DPU modules in a smarts "admin_status": "down" }, "DPU1": { - "admin_status": "down" + "admin_status": "up" } } + Note: There is no config for DPU2, DPU3 and the default is "down" as shown below } ``` #### switch configDB @@ -111,6 +124,21 @@ Key: "CHASSIS_MODULE|DPU0" "value": { "admin_status":"down" } + }, + "CHASSIS_MODULE|DPU1": { + "value": { + "admin_status":"up" + } + }, + "CHASSIS_MODULE|DPU2": { + "value": { + "admin_status":"down" + } + }, + "CHASSIS_MODULE|DPU3": { + "value": { + "admin_status":"down" + } } ``` @@ -127,10 +155,11 @@ Key: "CHASSIS_MODULE|DPU0" * DPU: Maintenance, Critical alarm, Service migration, RMA #### DPU shutdown sequence * There could be two possible sources for DPU shutdown. 1. A configuration change to DPU "admin_status: down" 2. The GNOI logic can trigger it. -* The GNOI server runs on the DPU even after the DPU is shutdown. -* The host sends a GNOI signal to shutdown the DPU. The DPU does a pre-shutdown and sends an ack back to the host. -* Upon receiving the ack or on a timeout the host may trigger the switch PMON to shutdown the DPU. -* NPU-DPU (GNOI) soft reboot workflow will be captured in another document. +* The GNOI server runs on the DPU even after the DPU is pre-shutdown and listens until the graceful shutdown finishes. +* The host sends a GNOI signal to shutdown the DPU. The DPU does a graceful-shutdown if not already done and sends an ack back to the host. +* Upon receiving the ack or on a timeout the host may trigger the switch PMON vendor API to shutdown the DPU. +* If a vendor specific API is not defined, detachment is done via sysfs (echo 1 > /sys/bus/pci/devices/XXXX:XX:XX.X/remove). +* NPU-DPU (GNOI) soft reboot workflow is captured in [reboot-hld.md](https://github.com/sonic-net/SONiC/blob/26f3f4e282f3d2bd4a5c684608897850354f5c30/doc/smart-switch/reboot/reboot-hld.md) * In the first option the "admin_status: down" configDB status change event will send a message to the switch PMON. * The switch PMON will invoke the module class API "set_admin_state(self, up):" with the state being "down" and the platform in turn will call its API to gracefully shutdown the DPU. * The DPU upon receiving the shutdown message will do a graceful shutdown and send an ack back. The DPU graceful shutdown is vendor specific. The DPU power will be turned off after the graceful shutdown. In case of timeout the platform will force power down. @@ -158,14 +187,14 @@ Key: "CHASSIS_MODULE|DPU0" * DPU: Gracefully restart SONiC on DPU #### Trigger * Switch: “reboot” command issued on switch console. -* DPU: “reboot” command issued on DPU console or Switch Software issues GNOI +* DPU: “reboot” command issued to the DPU or Switch Software issues GNOI #### Use case * Switch: Fault recovery, Maintenance, debug -* DPU: Fault recover, Maintenance, debug, Service provisioning +* DPU: Fault recovery, Maintenance, debug, Service provisioning #### Onboarding requirements * The DPU must provide additional control-plane and data-plane state information, timestamp, etc to DPU_STATE table as explained in the DPU_STATE schema once booted up. -* When the DPU reboots itself, should log the reboot cause and update the previous-reboot-cause field in the stateDB +* When the DPU reboots itself, should log the reboot cause and update the previous-reboot-cause field in the stateDB and persist in on the Switch. * The reboot-cause history on the switch should provide a holistic view of the reboot cause of the SmartSwitch host CPU, and the reboot-cause of all the DPUs * The DPUs should be uniquely identified and the DPU upon boot may get this ID from the host and identify itself. * Implement the required API enhancements and new APIs for DPU management (see details in design section) @@ -179,6 +208,7 @@ Key: "CHASSIS_MODULE|DPU0" * The DPUs should provide their state to the host by updating the dpu state data in the DPU_STATE table in the host ChassisStateDB (explained in DB schema). This holds the minimal, viable set of low latency dpu state data and can be consumed by HA, LB, FaultManagement logics. * DPUs should be able to store the data using a redis call directly on to the switch chassisStateDB * The DPU must provide the state information once it boots its OS to DPU_STATE table. + * The chassisd on the NPU-PMON will update the dpu_midplane_link_state, dpu_midplane_link_time, dpu_midplane_link_reason fields whenever there is a dpu_midplane_link_state transition. * Thermal management * Besides additional DPU specific sensors, cooling device changes the logic remains the same. @@ -189,13 +219,15 @@ Key: "CHASSIS_MODULE|DPU0" * Show CLIs * Extend existing CLIs such as 'show platform fan/temperature' to support the new HW * Extend the modular chassis CLI 'show chassis modules status" to display the detailed DPU states. (See CLIs section) - * The data for the CLIs come either from the DBs or through the platform APIs. Example: The reboot-cause history and the DPU_STATE are stored in the ChassisStateDB directly using a redis call and the CLIs access them from the DB. It is upt to the platform implementation on how do they fetch the DPU inventory and health data and store them on the host. For the sake of consistency and efficiency we recommend using the redis call and storing them in the chassisStateDB on the host. + * The data for the CLIs come either from the DBs or through the platform APIs. Example: The DPU_STATE data are stored in the ChassisStateDB directly using a redis call and the CLIs access them from the DB. It is upto to the platform implementation on how do they fetch the DPU inventory and health data and store them on the host. For the sake of consistency and efficiency we recommend using the redis call and storing them in the chassisStateDB on the host. ### 2.3. Detect and Debug * Health * SmartSwitch DPUs should store their health data locally and also provide it to the host for a consolidated view of the CLIs * DPUs should support a CLI to display the health data “show system-health ...” (See CLIs section) - * The host pmon should use this data to support the host side CLIs. Though accessing this data from DPUs and storing them on the switch is implementation specific it is recommended to use redis call and store them on the switch chassisStateDB for faster access. use "UserDefinedChecker" class to provide this data to the CLIs. + * The host pmon should use this data to support the host side CLIs. Though accessing this data from DPUs and storing them on the switch is implementation specific it is recommended to use redis call and store them on the switch chassisStateDB for faster access. + * This is done in two phases. Please refer to section:3.1.5.1 for the HEALTH_INFO schema + * use "UserDefinedChecker" class to provide this data to the CLIs. * Vendor specific data such as interrupt events can also be placed in user defined fields under this DB * This table already exists in modular chassis design and the DPUs will use this just like a line card. * Alarm and Syslog @@ -219,7 +251,6 @@ SmartSwitch PMON block diagram ### 3.1. Platform monitoring and management * SmartSwitch design Extends the existing chassis_base class and module_base class as described below. -* Extend MODULE_TYPE in ModuleBase class with MODULE_TYPE_DPU and MODULE_TYPE_SWITCH to support SmartSwitch #### 3.1.1 ChassisBase class API enhancements is_modular_chassis(self): @@ -280,7 +311,7 @@ get_dpu_id(self, name): Retrieves the DPU ID for the given dpu-module name. Returns None for non-smartswitch chassis. Returns: - An integer, indicating the DPU ID Ex: name:DPU0 return value 1, name:DPU1 return value 2, name:DPUX return value X+1 + An integer, indicating the DPU ID Ex: name:DPU0 return value 0, name:DPU1 return value 1, name:DPUX return value X ``` is_smartswitch(self): @@ -291,42 +322,22 @@ is_smartswitch(self): True ``` -get_module_dpu_data_port(self, index): -``` - Retrieves the DPU data port NPU-DPU association represented for the DPU index. Platforms that need to overwrite the platform.json file will use this API - - This is valid only on the Switch and not on DPUs. On the DPUs this can return None - - Args: - index: An integer, the index of the module to retrieve - - Returns: - NPU-DPU port association: A string Ex: For index: 1 will return the dup0 port association which is "Ethernet224: Ethernet0" where the string left of ":" (Ethernet224) is the NPU port and the string right of ":" (Ethernet0) is the DPU port. -``` #### 3.1.3 NPU to DPU data port mapping -platform.json of NPU/switch will show the NPU to DPU data port mapping. This will be used by services early in the system boot. +platform.json of NPU/switch will show the NPU to DPU data port mapping. This will be used by services early in the system boot. ``` -{ - "DPUs" : [ + "DPUS": [ { - "dpu0": { - "interface": {"Ethernet224": "Ethernet0"} - } - }, - { - "dpu1": { - "interface": {"Ethernet232": "Ethernet0"} + "dpu0": { + "interface": {"Ethernet224": "Ethernet0"} + }, + "dpu1": { + "interface": {"Ethernet232": "Ethernet0"} + }, + "dpux": { + "interface": {"Ethernet2xx": "Ethernet0"} }, - }, - . - . - { - "dpuX": { - "interface": {"EthernetX": "EthernetY"} - } } ] -} ``` #### 3.1.4 ModuleBase class API enhancements get_base_mac(self): @@ -367,7 +378,7 @@ get_type(self): Returns: A string, the module-type from one of the predefined types: - MODULE_TYPE_SWITCH, MODULE_TYPE_DPU + MODULE_TYPE_DPU ``` get_oper_status(self): @@ -403,7 +414,7 @@ reboot(self, reboot_type): set_admin_state(self, up): ``` Request to keep the card/DPU in administratively up/down state. - Default state is up. + Default state is down. ``` get_maximum_consumed_power(self): @@ -436,34 +447,94 @@ is_midplane_reachable(self): #### 3.1.5 ModuleBase class new APIs ##### 3.1.5.1 Need for consistent storage and access of DPU reboot cause, state and health -1. The smartswitch needs to know the reboot cause for DPUs. Please refer to the CLI section for the various options and their effects when executed on the switch and DPUs. +#### 1 Reboot Cause +The smartswitch needs to know the reboot cause for the NPU and the DPUs. -* Each DPU will update its reboot cause history in the Switch ChasissStateDB upon boot up. The recent reboot-cause can be derived from that list of reboot-causes. -* The get_reboot_cause will return the current reboot-cause of the module. -* For persistent storage of the DPU reboot-cause and reboot-caue-history files use the existing host storage path and mechanism. +#### NPU Reboot Cause +* When the NPU undergoes a reboot the existing reboot-cause workflow for a switch still remains the same for a smartswitch as well. +* When the NPU reboots the "determine_reboot_cause" fetches the npu reboot-cause and persists the files in "/host/reboot-cause" path. +* A maximum of ten reboot-cause history entries per dpu will be persisted just like the npu. +* The "update_reboot_cause" function updates the stateDB with reboot-cause, time, user etc as shown in the schema. +* The existing "show reboot-cause" and "show reboot-cause history" CLIs will be backward compatible. Please refer to the CLI section. -#### Schema for REBOOT_CAUSE - switch stateDB +#### Schema for REBOOT_CAUSE of SWITCH on switch stateDB ``` Key: "REBOOT_CAUSE|2023_06_18_14_56_12" "REBOOT_CAUSE|2023_06_18_14_56_12": { "value": { - "cause": "REBOOT_CAUSE_HOST_RESET_DPU", + "cause": "Unknown", "comment": "N/A", - "device": "DPU5", "time": "2023_06_18_14_56_12", "user": "N/A" } - }, + } + +``` +#### DPU Reboot Cause +* The smartswitch needs to know the reboot cause for all the DPUs. +* The NPU hardware should be capable of providing the DPU reboot-cause even when the DPUs are dead. +* The get_reboot_cause will return the current reboot-cause of the module. +* For persistent storage of the DPU reboot-cause and reboot-cause-history files use the existing mechanism and host storage path under "/host/reboot-cause/module/dpux". +* The storage and retrieval of the reboot-cause of the Switch and PDUs are shown in the sequence diagram + + + +#### Reboot workflow +* The switch boots up. Determines the NPU reboot cause. +* Processes the previously stored NPU and DPU reboot-cause files and history files. +* A maximum of ten reboot-cause history entries per dpu will be persisted just like the npu. +* Updates the NPU reboot-cause into the StateDB and the DPU reboot-cause into the ChassisStateDB. +* The above process is a one-shot event on boot up. +* The module_db_update function in the NPU-PMON chassisd is an existing function constantly updating the operational status of the DPUs. +* This function looks for DPU operational status change events and when the DPUs come out of "offline" state, issues "get_reboot_cause" API to the platform. +* The dpu operational state transition from 'offline' to 'online' guarantees the reboot of a DPU. +* The platform code will extract the DPU reboot cause from the NPU hardware itself even when the DPU is not reachable. +* The DPU reboot cause will be mapped to one of the following existing reboot causes and returned back. + * REBOOT_CAUSE_POWER_LOSS = "Power Loss" + * REBOOT_CAUSE_THERMAL_OVERLOAD_CPU = "Thermal Overload: CPU" + * REBOOT_CAUSE_THERMAL_OVERLOAD_ASIC = "Thermal Overload: ASIC" + * REBOOT_CAUSE_THERMAL_OVERLOAD_OTHER = "Thermal Overload: Other" + * REBOOT_CAUSE_INSUFFICIENT_FAN_SPEED = "Insufficient Fan Speed" + * REBOOT_CAUSE_WATCHDOG = "Watchdog" + * REBOOT_CAUSE_HARDWARE_OTHER = "Hardware - Other" + * REBOOT_CAUSE_HARDWARE_BIOS = "BIOS" + * REBOOT_CAUSE_HARDWARE_CPU = "CPU" + * REBOOT_CAUSE_HARDWARE_BUTTON = "Push button" + * REBOOT_CAUSE_HARDWARE_RESET_FROM_ASIC = "Reset from ASIC" + * REBOOT_CAUSE_NON_HARDWARE = "Non-Hardware" +* When the platform wants to provide additional details to the reboot cause they can append it to the string with a ',' separation as shown below. +``` +return REBOOT_CAUSE_NON_HARDWARE + ', ' + 'kernel panic' +``` +* The switch reboot use case will follow the same sequence. +* The pmon container restart should not affect this sequence as the states are persisted either in the DB or in the file system. + +#### Schema for REBOOT_CAUSE of DPUs on switch ChassisStateDB +``` + Key: "REBOOT_CAUSE|DPU0|2024_06_06_09_31_18" + + "REBOOT_CAUSE|DPU0|2024_06_06_09_31_18": { + "value": { + "cause": "Software causes (Reboot)", + "comment": "User issued 'reboot' command [User: admin, Time: Thu Jun 6 09:46:43 AM UTC 2024]", + "device": "DPU0", + "time": "N/A", + "user": "N/A" + } + } ``` -2. Though the get_oper_status(self) can get the operational status of the DPU Modules, the current implementation only has limited capabilities. - * Can only state MODULE_STATUS_FAULT and can't show exactly where in the state progression the DPU failed. This is critical in fault isolation, DPU switchover decision, resiliency and recovery +#### 2. DPU State + * Added supporting fileds such as "time" and "reason" to show exactly where in the state progression the DPU failed. This will be helpful in fault isolation, DPU switchover decision, resiliency and recovery * Though this is platform implementation specific, in a multi vendor use case, there has to be a consistent way of storing and accessing the information. * Store the state progression (dpu_midplane_link_state, dpu_control_plane_state, dpu_data_plane_state) on the host ChassisStateDB. * get_state_info(self) will return an object with the ChassisStateDB data + * Two more APIs get_dataplane_state(self) and get_controlplane_state(self) are available to get the dataplane_state and controlplane_state are available to access them individually. + * The dpu_midplane_link_state will be monitored and updated by the NPU PMON chassisd. * Potential consumers: HA, LB, Switch CLIs, Utils (install/repair images), Life Cycle Manager * Use cases: HA, Debuggability, error recovery (reset, power cycle) and fault management, consolidated view of Switch and DPU state + #### DPU_STATE definition dpu_midplane_link_state: up refers to the pcie link between the NPU and DPU is operational. This will be updated by the switch pcied. @@ -490,13 +561,40 @@ dpu_data_plane_state: up refers to configuration downloaded, the pipeline stage ”dpu_data_plane_time": ”timestamp", ”dpu_data_plane_reason": ”Pipeline failure", ``` - -3. Each DPU has to store the health data in its local DB and should provide it to the switch. -* When the "show system-health ..." CLI is executed on the switch, the "UserDefinedChecker" class will collect the data and feed it to the CLI. It is up to the platform on how this is done. However, for faster access store it in the switch ChassisStateDB. -* The DPU is a complex hardware, to facilitate debug, a consistent way of storing and accessing the health record of the DPUs is critical in a multi vendor scenario even though it is a platform specific implementation. -* Both switch and the DPUs will follow to the [SONiC system health monitor HLD](https://github.com/sonic-net/SONiC/blob/ce313db92a694e007a6c5332ce3267ac158290f6/doc/system_health_monitoring/system-health-HLD.md) +#### 3 DPU Health +This feature is implemented in two phases. +#### Phase:1 +* Each DPU has to store the health info locally and should be available on the DPU when the "show system-health ..." CLI is executed on the DPU just like the switch. +#### Phase:2 +* Each DPU besides storing the health info locally, should also store the DPU health info in the switch ChassisStateDB. The schema for each DPU health info is the same as the switch and also is shown below. +* When the "show system-health