-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added schema for health_info, reboot_cause on chassisStateDB and added the link to pmon-test-plan #1709
Conversation
module_type_switch based on the implementation PR review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy Can you add a section for DPU dark mode support. In this case,
NPU's PMON should honor the user configuration to power OFF the DPU via platform API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor query, otherwise LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy In section 3.2 can you specify if the thermal management is in NPU or DPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor Updated. It runs on the NPU.
#### REBOOT_CAUSE DB schema | ||
``` | ||
Key: "REBOOT_CAUSE|2023_06_18_14_56_12" | ||
* Each DPU will update its reboot cause history in the Switch ChassisStateDB upon boot up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy How? Which daemon/service?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dpu_db_util/system_health service will update the ChassisStateDB table.
* Though how DPU pmon updates this is vendor dependent, it is recommended to use the sonic telemetry agent to align with the existing SONiC implementation. | ||
* The DPUs will limit the number of history entries to a maximum of ten. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy Why DPU pmon updates needs to be vendor dependent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no guarantee that the SONiC running on the DPUs will necessarily be running Telemetry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A small spelling mistake. Otherwise looks good to me.
@@ -60,7 +61,7 @@ The picture below highlights the PMON vertical and its association with other lo | |||
* The SmartSwitch host PMON should be able to Startup, Shutdown, Restart, and Soft Reboot the entire system or the individual DPUs. The DPU_MODULE will behave like the LINE_CARD_MODULE of a modular chassis with respect to these functions. | |||
|
|||
### SmartSwitch Power up/down sequence: | |||
* When the smartswitch device is booted, the host will boot first and leave the DPUs either up or down depending on the configuration. The DPUs will be up by default. | |||
* When the smartswitch device is booted, the host will boot first and leave the DPUs down by defualt. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: default spelling mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it.
removed the ID cloum in "show system-health dpu DPU0", changed the DPU admin_state default behavior, added the dpu_state transition update.
#### Schema for REBOOT_CAUSE - switch stateDB | ||
#### Reboot Cause | ||
1. The smartswitch needs to know the reboot cause for DPUs. Please refer to the CLI section for the various "show reboot-cause" options and their effects. | ||
* Each DPU will update its reboot cause history in the Switch ChassisStateDB upon boot up and also persist this on the host. The recent reboot-cause is derived from that list of reboot-causes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy it seems some agent in the DPU is responsible for pushing the reboot cause...so this will not work if the midplane is down for whatever reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy it seems some agent in the DPU is responsible for pushing the reboot cause...so this will not work if the midplane is down for whatever reason.
@prgeor 1. The switch hardware is capable of getting the DPU reboot-cause on the NPU side and the required software work is in progress to provide it when requested with "get_reboot_cause()" even if the DPU is not reachable though the midplane. 2. We will have support to indicate a DPU-reboot every time a DPU reboots. The existing workflow will remain the same but how the platform code fetches the reboot-cause will be as explained above when the implementation is ready. Updated this section.
* For persistent storage of the DPU reboot-cause and reboot-caue-history files use the existing host storage path and mechanism. | ||
|
||
#### Schema for REBOOT_CAUSE - switch stateDB | ||
#### Reboot Cause |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy need separate section
- DPU reboot cause
- NPU reboot cause
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor Added separate sections for DPU reboot cause and NPU reboot cause
* The switch boots up. Determines the NPU reboot cause. | ||
* Processes the previously stored NPU and DPU reboot-cause files and history files and updates the NPU reboot-cause into the StateDB and the DPU reboot-cause into the ChassisStateDB. | ||
* The above process is a one-shot event on boot up. | ||
* The module_db_update function in the NPU-PMON chassisd is an existing function constantly updating the operational status of the DPUs. This function looks for DPU operational status change events and when the DPUs come out of "offline" state, issues "get_reboot_cause" API to the platform. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy can you state explicityl that out of offline state must mean DPU rebooted just to make sure DPU offline -> online does not happen without reboot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
multiple places to be clear. Called out the output of the new CLI extensions on the DPUs.
#### 2.1.1 DPUs in dark mode | ||
* A smartswitch when configured to boot up with all the DPUs in it are powered down upon boot up is referred as DPUs in dark mode. | ||
* In the dark mode the platform.json file shown in section "3.1.3" will not have the dictionary for the DPUS. | ||
* The term dark mode is overloaded in some cases where the platform.json may have the dictionary but the config_db.json will have the admin_state of all DPU modules as "down". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the case when the platform.json has DPU information, but config DB doesn't have the DPU admin state configuration. The DPUs should be in downstate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oleksandrivantsiv Done. This case is already shown in the example as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#closed
@@ -219,7 +250,7 @@ SmartSwitch PMON block diagram | |||
|
|||
### 3.1. Platform monitoring and management | |||
* SmartSwitch design Extends the existing chassis_base class and module_base class as described below. | |||
* Extend MODULE_TYPE in ModuleBase class with MODULE_TYPE_DPU and MODULE_TYPE_SWITCH to support SmartSwitch | |||
* Extend MODULE_TYPE in ModuleBase class with MODULE_TYPE_DPU to support SmartSwitch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is captured in Smart Switch reboot HLD. Why do we need to duplicate the information here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oleksandrivantsiv Removed it. This was captured here way before that doc was generated. Now that the "Smart Switch reboot HLD" captures this, cleaned it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#closed
* The switch boots up. Determines the NPU reboot cause. | ||
* Processes the previously stored NPU and DPU reboot-cause files and history files. | ||
* A maximum of ten reboot-cause history entries per dpu will be persisted just like the npu. | ||
* Updates the NPU reboot-cause into the StateDB and the DPU reboot-cause into the ChassisStateDB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to differentiate between NPU and DPU configuration and store one in StateDB and anther in the ChassisStateDB? Won't it be easier to store everything in StateBD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oleksandrivantsiv I raised the same question in one of the previous meetings and back then the consensus was to keep them in their respective DB. I'm ok either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#closed
|
||
``` | ||
2. Though the get_oper_status(self) can get the operational status of the DPU Modules, the current implementation only has limited capabilities. | ||
#### 2. DPU State | ||
Though the get_oper_status(self) can get the operational status of the DPU modules, the current implementation only has limited capabilities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which implementation has limitations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oleksandrivantsiv The existing PMON can only indicate if a module is offline/online/faulty. It does not have the granularity to narrow it down to control-plane or data-plane issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation is not limited. It does exactly what is expected—provides a DPU operation state. We have a different API to query the software state. I suggest to remove the description of the limitation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oleksandrivantsiv Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#closed
* Can only state MODULE_STATUS_FAULT and can't show exactly where in the state progression the DPU failed. This is critical in fault isolation, DPU switchover decision, resiliency and recovery | ||
* Though this is platform implementation specific, in a multi vendor use case, there has to be a consistent way of storing and accessing the information. | ||
* Store the state progression (dpu_midplane_link_state, dpu_control_plane_state, dpu_data_plane_state) on the host ChassisStateDB. | ||
* Store the state progression (dpu_midplane_link_state, dpu_control_plane_state, dpu_data_plane_state) on the host ChassisStateDB using the push model specified in [section: 3.2.4 of SONiC Chassis Platform Management & Monitoring HLD](https://github.com/sonic-net/SONiC/blob/master/doc/pmon/pmon-chassis-design.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We agreed that the dpu_control_plane_state
and dpu_data_plane_state
will be pushed by the DPU. But the dpu_midplane_link_state
should be monitored and set by the NPU. Please update this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oleksandrivantsiv Updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#closed
* Thermal manager reads all thermal sensor data, run thermal policy and take policy action Ex. Set fan speed, set alarm, set syslog, set LEDs | ||
* Platform collects fan related data such as presence, failure and then applies fan algorithm to set the new fan speed | ||
* The north bound CLI/Utils/App use DB data to ”show environment”, ”show platform temp” show platform fan” | ||
* The DPUs will update the ChassisStateDB "TEMPERATURE_INFO" tables through redis client call which in turn will be pushed into the switch StateDB. | ||
* The existing "TEMPERATURE_INFO" schema will be used to store the values and is shown below for convenience. | ||
* For phase:1 implementation the sensor values collected by DPU will not be pushed to the chassisStateDB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? Won't be pushed by whom and on which platform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oleksandrivantsiv cleaned it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#closed
#### 3.4.1 Reboot Cause CLIs | ||
* There are two existing CLIs "show reboot-cause" and "show reboot-cause history" | ||
* These two CLIs are extended to "show reboot-cause all" and "show reboot-cause history \<option\>", where the "option" could be DPUx, all or SWITCH | ||
* When each DPU turns online the NPU chassisd will fetch the reboot-cause using the "get_reboot_cause()" API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to duplicate here information from #### Reboot workflow
section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oleksandrivantsiv There is some overlap mainly to give a quick summary for the convenience of people just reading only CLI section. Cleaned it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* NPU-DPU (GNOI) soft reboot workflow will be captured in another document. | ||
* The GNOI server runs on the DPU even after the DPU is pre-shutdown and listens until the graceful shutdown finishes. | ||
* The host sends a GNOI signal to shutdown the DPU. The DPU does a graceful-shutdown if not already done and sends an ack back to the host. | ||
* Upon receiving the ack or on a timeout the host may trigger the switch PMON vendor API to shutdown the DPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Here vendor API is to send PCI detachment but not shutdown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A small nit, looks good to me otherwise.
Added schema for health_info, reboot_cause on chassisStateDB and added the link to pmon-test-plan