-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SONiC Chassis Platform Requirements and Enhancements Analysis #945
Conversation
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
@@ -0,0 +1,37 @@ | |||
Section 1 Requirements that are needed by default:- | |||
1. On LC the reboot command should power-cycle the entire LC . Expectation is Peer node should detect link down when reboot is given on LC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sonic Reboot command
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sonic reboot command that invokes platform plugin to reboot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abdosi remove power-cycle
in the point
@@ -0,0 +1,37 @@ | |||
Section 1 Requirements that are needed by default:- | |||
1. On LC the reboot command should power-cycle the entire LC . Expectation is Peer node should detect link down when reboot is given on LC | |||
2. On RP the reboot command should reboot the entire system (RP and LC). . Expectation is Peer node should detect link down when reboot is given on RP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graceful reboot for LC vs power-cycle the LC ? Open question. Possibility of SDD corruption without Graceful restart.
Basically reboot of LC from RP is ungraceful.
Scenarios (High to Low Priority Order):
- Graceful Supervisor going down and LC ungraceful (default)
- Ungraceful Supervisor going down and LC react to this . Need further discussion. No Conclusion yet.
- Graceful Supervisor going down and LC graceful (Orchestration start from RP) Need further discussion. No Conclusion yet.
Conclusion for Ideal Case:
Enhance Reboot to do Supervisor only Reboot. Option for Reboot (Supervisor vs Entire Chassis)
-
Entire Chassis: Graceful Supervisor going down and LC ungraceful (default). Default and must-needed behavior.
-
Supervisor only option is useful for
a) Orchestrate Graceful reboot for entire chassis via external controller.
b) Dual Supervisor case.
Need more discussion: for Worst Case (Supervisor just disappearance eg: watchdog triggered on supervisor)
a) To determine what is happening now,
b) Enhancements to handle this will be needed.
if Line card detect sup going down and then LC should be kept in down state if Platform specific LC shutdown capability) else LC do self-reboot (can be continuous if SUP never comes up or always in bad state)
if Line-card can not detect sup going down then sup after comin-up broadcast all LC's to self-reboot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if Line-card can not detect sup going down then sup after coming-up broadcast all LC's to self-reboot
This might have an issue! There are chassis/platform whereby LC comes up on their own the moment power is turned-on to the chassis. So, Supervisor asking for its boot-up may interrupt already booting-up LC in a chassis power-on/reload scenarios. Suggest to have Supervisor 'booting up' workflow to be same regardless of SUP-only reload/boot-up or chassis reload.
if Line card detect sup going down
This would be over Keepalives/ heartbeats exchange between SUP and LC.
Suggest adding another scenario: SUP detecting LC not there and discuss it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary based of internal discussion:
- If Supervisor goes down unplanned/ungracefully then there is no need to reboot LC or any other action.
- In above scenario LC should be generating syslog complaining about Supervisor not being reachable (Eg: PMON trying to access Chassis DB to push the data)
- Above syslog can be used by Alerting logic and necessary action can be taken form LC/Chassis perspective like doing isolation and doing
config reload
on LC by external controller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding (1), "there is no need to reboot LC or any other action"
There are a number of concerns with letting linecards run headless while a supervisor reboots
- There will be a long window of time where no state exchange between linecards will be possible due to lack of chassis_db.
- The software on the linecard now has to deal with live disconnection/reconnection of connectivity to chassis_db or other supervisor
- There's no real stated benefit motivating this propsal. A strong working assumption behind the chassis architecture conception for the current set of use cases is that there is enough redundancy in the network to easily tolerate the entire chassis going down and coming back, especially when it is a rare event like an unplanned supervisor rebbot.
In general, it is best to keep failure handling simple and predictable, and avoid divergent flows across different scenarios. So unless there is a specific concrete problem that is solved by keeping the linecards, it would be strongly preferable to always reboot linecards when a supervisor reboots.
At the very least, the headless operation should be made optional and not mandated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option 1 (Preferable)
If Supervisor goes down (headless operation) then LC should go down
Option 2
IF LC are running in headless mode the LC should be able to send syslog asap (mid-plane connectivity should get restored) so that LC can communicate the error to External Management.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Supervisor platform services/code to be healthy to make linecard's running smooth. When supervisor platform code isn't healthy (like hw heartbeat is down, sw heartbeat etc), its considered as unhealthy supervisor or headless. In this case, HW vendor has defined what would linecards do. Linecard's don't operate when supervisor detected down.
Can we rename this PR to better reflect this doc? |
@rlhui Updated the PR title. |
cc @Staphylo |
cc @shyam77git |
3. Config shut/unshut of LC will be supported as per the Chassis-d design. | ||
4. Generate syslog for all the critical events and share the threshold (for appropriate/needed components) in documents and recommended for given threshold range. Expectation is we will bind syslog to our Alert Orchestration system and perform recommnded action based on the documents. | ||
5. PCI-e issue of not able to detect FC ASIC’s and LC ASIC’s and syslog for same. | ||
Integrate with pcied process in PMON[sonic-platform-daemons/pcied at master · Azure/sonic-platform-daemons (github.com)]. Note: Current PCI daemon polling for pci devices is 60sec which is large poll interval. Does it need optimization ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make this change generic https://github.com/Azure/sonic-buildimage/blob/master/files/scripts/swss.sh#L193 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can cause of delay of overall SW initialization. not backward-compatible as of now. can impact existing running systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generate syslog for all the critical events and share the threshold (for appropriate/needed components) in documents and recommended for given threshold range. Expectation is we will bind syslog to our Alert Orchestration system and perform recommnded action based on the documents.
Per today's chassis workgroup sync-up, can we enhance this point to highlight following:
a) For now (near-term solution): External controller would take the recommended action (based on document) in real-time
b) Enhancing it to have a system-driven solution: I suggested having a platform-supplied policy (look-up) file of events and actions. On receiving an event (syslog), SONiC (LC, RP) or Ext Controller to perform a lookup on this policy file and take recommended action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document can also provide how many FC'S ASIC should be there to support X FC and Y LC Scenario. Each Platform vendor need to provide this matrix.
5. PCI-e issue of not able to detect FC ASIC’s and LC ASIC’s and syslog for same. | ||
Integrate with pcied process in PMON[sonic-platform-daemons/pcied at master · Azure/sonic-platform-daemons (github.com)]. Note: Current PCI daemon polling for pci devices is 60sec which is large poll interval. Does it need optimization ? | ||
6. Boot-up failure Handling. Need to see the SONiC behaviour from system perspective/docker status/syslog getting generated with required/correct information | ||
7. HW-Watchdog adhering to current SONiC behavior. Start before reboot and explicitly disabled post reboot by SONiC (This means SONiC is booted up and Services are fine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Watchdog Scheme will need enhancement in case if we want to detect some faults (for eg: CPU getting stuck in running state) then given platform/vendor has HW-watchdog that can take recovery action in such cases. Currently since hw watchdog disable by SONiC post boot above scenarios can not be handle even if given platform/vendor can support it. It can be used for Debugging purpose where possible by collecting dumps.
Section 1 Requirements that are needed by default:- | ||
1. On LC the reboot command should power-cycle the entire LC . Expectation is Peer node should detect link down when reboot is given on LC | ||
2. On RP the reboot command should reboot the entire system (RP and LC). . Expectation is Peer node should detect link down when reboot is given on RP | ||
3. Config shut/unshut of LC will be supported as per the Chassis-d design. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not all platforms can support LC shut as there might not be power control on LC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abdosi will update the point from supervisor
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Useful when LC is not reachable via SSH or Console:-
These command are invoke from supervisor for given LC:-
shut: power shut for card (platform dependent if not supported return the not supported.)
unshut: bring power back (platform dependent if not supported return error)
reboot: can be power-cycle or cpu reset only (best to platform capability if not supported return error)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reboot-cause shows what on LC when above option are invoke on supervisor ? Need to revisit/discuss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config chassis startup <module-name>
config chassis shutdown <module-name>
reboot <module-name>
module-name here are are LC. Based on 03/18 update (see below) FC are also in scope.
power-cycle of FC can also have implication Kernel.
FC Graceful handling dependency on Kernel modules also ?
Do we support FC Insertion/Removal ?
03/18: Update:
Possible steps for RMA of FC:
- Isolate the Chassis (No Traffic)
- Config chassis shut on FC (Enhancement possible: To see if we can have SONiC Service also gracefully stopped here. Need to check.)
- Unplug the FC
- Plug new FC (post-RMA)
- config chassis startup on FC
- Config reload on Supervisor
Steps to Reload FC: (Link/Parity/cell Error are seen and Platform Vendor recommendation to reload FC. Not at stage for RMA but Recoverable)
What about LC ASIC bad ? In Such case action need to be taken from LC perspespective based on LC Alert.
These are not frequent errors.
If N+ 1 redundancy:-
- Config chassis shut on FC (Enhancement possible: To see if we can have SONiC Service also gracefully stopped here. Need to check.)
- config chassis startup on FC
else
See above RMA process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FC reload/shutdown scenario to be discussed separately.
Beside above cases (graceful handling of FC shutdown/reload), another case to discuss:
If chassis running on less FC(s)/FC-NPUs, then what is SONiC's expectation? Depending upon bandwidth impact, isolate the entire chassis or isolate (shutdown) impacted FC(s) or replace impacted FC(s)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reboot
similar to this, there should be 'shutdown CLI to shutdown specified module (LC/FC).
config chassis startup
config chassis shutdown
intent/goal of these commands is to do config shutdown or startup (reload/bring-up) of specified module.
wondering as to why 'chassis' keyword is there in these commands?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion when a FC goes bad/down, expect syslog be generated. For system that has extra FC redundancy I think losing one FC does not impact the overall operation of the chassis. So in this case the syslog should cause an alarm to allow schedule maintenance service to replace the FC. If loosing another FC or chassis has no FC redundancy then the lost of FC syslog should clearly indicate it is running in degraded mode with expected traffic impact. This syslog should cause alarm service to detect and trigger mitigation steps (whether be Admin user intervention or automation to start isolate it). The chassis itself can not tell if there are redundancy built into the involved network and if trigger self isolation it might causes more customer impact... This is just my own opinion...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config chassis startup <module-name>
config chassis shutdown <module-name>
reboot <module-name>
Note that the "Reboot " is an Ungraceful reload of the LC. It should not be expected that Supervisor will orchestrate a graceful reboot of the LC. Sup will simply cycle power to the LC.
7. HW-Watchdog adhering to current SONiC behavior. Start before reboot and explicitly disabled post reboot by SONiC (This means SONiC is booted up and Services are fine) | ||
8. chassisd daemon support on both LC and RP with all fields of table "CHASSIS_MODULE_TABLE|xxxx” correctly populated | ||
9. chassisd daemon support populating fields in table "CHASSIS_ASIC_TABLE|xxx", this is used to start swss/syncd in SUP when FABRIC ASIC is ready. | ||
10. Slot Nummber in "CHASSIS_MODULE_TABLE|xxxx” need not be unique ? Slot Number is based on physcial layout (Ex: LC can be back facing and can have 0..n and FC can be Front facing and be 0.n). chassisd can support this model ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Physcial slot Id need to be unique and can use sticker/label name based on given platform vendor. Use Case: Technician to identify the given Card based on Visual Inspection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI - this change requires changing the get_slot (in module.py) and get_supervisor_slot and get_my_slot (in chassis.py) PMON API's to return a string instead of an int.
However, in platform api sonic-mgmt tests this field is expected to be an int (example: https://github.com/Azure/sonic-mgmt/blob/master/tests/platform_tests/api/test_module.py#L334)
Do we modify the tests to be either int or str for now to allow backward compatibility till all vendors implement this enhancement?
9. chassisd daemon support populating fields in table "CHASSIS_ASIC_TABLE|xxx", this is used to start swss/syncd in SUP when FABRIC ASIC is ready. | ||
10. Slot Nummber in "CHASSIS_MODULE_TABLE|xxxx” need not be unique ? Slot Number is based on physcial layout (Ex: LC can be back facing and can have 0..n and FC can be Front facing and be 0.n). chassisd can support this model ? | ||
10. psud power algorithm on supervisor as specified in chassis design document | ||
11. PSU LED Status in the show command of supervisor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ref for Point10: https://github.com/Azure/SONiC/blob/master/doc/pmon/pmon-chassis-design.md please check if there is Platform API for setting Master LED for PSU. API is there: https://github.com/Azure/sonic-platform-common/blob/master/sonic_platform_base/psu_base.py#L226
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ref for Point 11: show platform psustatus
to display LED current color (based on current running status of PSU)
11. PSU LED Status in the show command of supervisor | ||
12. TEMPERATURE_INFO table update into Chassis State DB from both Supervisor and LC. Local TEMPERATURE_INFO is also available in LC STATE_DB. | ||
13. Fan speed algorithm on supervior as specified in chassis design document | ||
14. FAN LED Status in the show command of supervisor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fan tray led display enhancement. Need to check if sonic has component for Fan-Tray/Fan-Drawer.
For now given Vendor can overload on Fan Led
12. TEMPERATURE_INFO table update into Chassis State DB from both Supervisor and LC. Local TEMPERATURE_INFO is also available in LC STATE_DB. | ||
13. Fan speed algorithm on supervior as specified in chassis design document | ||
14. FAN LED Status in the show command of supervisor | ||
15. reboot-cause reason and history is working fine for both RP and LC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
13. Fan speed algorithm on supervior as specified in chassis design document | ||
14. FAN LED Status in the show command of supervisor | ||
15. reboot-cause reason and history is working fine for both RP and LC | ||
16. show commands for mid-plane switch as per Chassis Design Document. Add namespace parameter support for "show chassis midplane-status" command. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to check. show ip interface -n <asic> -d all
should be displaying it.
4. Generate syslog for all the critical events and share the threshold (for appropriate/needed components) in documents and recommended for given threshold range. Expectation is we will bind syslog to our Alert Orchestration system and perform recommnded action based on the documents. | ||
5. PCI-e issue of not able to detect FC ASIC’s and LC ASIC’s and syslog for same. | ||
Integrate with pcied process in PMON[sonic-platform-daemons/pcied at master · Azure/sonic-platform-daemons (github.com)]. Note: Current PCI daemon polling for pci devices is 60sec which is large poll interval. Does it need optimization ? | ||
6. Boot-up failure Handling. Need to see the SONiC behaviour from system perspective/docker status/syslog getting generated with required/correct information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more to understand the behavior and identify any missing test-gap to cover this.
16. show commands for mid-plane switch as per Chassis Design Document. Add namespace parameter support for "show chassis midplane-status" command. | ||
|
||
2. Section2: General Chassis Enhancements that are needed:- | ||
1. LC/FC Fabric Link down Handling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data-path component. Not in scope of platform. Expectation to have atleast monitoring and alert/syslog in such cases and action needed to be taken in such scenarios.
|
||
2. Section2: General Chassis Enhancements that are needed:- | ||
1. LC/FC Fabric Link down Handling | ||
2. Module/Chassis/Board LED’s . Need general infra enhancement of led daemon and show commands |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New Design Document is needed. Need to discuss in SONiC Community,
1. LC/FC Fabric Link down Handling | ||
2. Module/Chassis/Board LED’s . Need general infra enhancement of led daemon and show commands | ||
3. LC/FC operation status detection quicker using (get_change_event() notification handling to detect async card up/down events) rather than using current Polling Interval of 10 sec | ||
4. Generic console for LC using . Possible using this: https://github.com/Azure/SONiC/blob/master/doc/console/SONiC-Console-Switch-High-Level-Design.md ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need more analysis from platform vendor's and will need revisit.
2. Module/Chassis/Board LED’s . Need general infra enhancement of led daemon and show commands | ||
3. LC/FC operation status detection quicker using (get_change_event() notification handling to detect async card up/down events) rather than using current Polling Interval of 10 sec | ||
4. Generic console for LC using . Possible using this: https://github.com/Azure/SONiC/blob/master/doc/console/SONiC-Console-Switch-High-Level-Design.md ? | ||
5. Process for RMA the card (Fabric/LC). This is just a discussion to document correct process for doing so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Platform vendor recommendation/guidance needed here.
3. LC/FC operation status detection quicker using (get_change_event() notification handling to detect async card up/down events) rather than using current Polling Interval of 10 sec | ||
4. Generic console for LC using . Possible using this: https://github.com/Azure/SONiC/blob/master/doc/console/SONiC-Console-Switch-High-Level-Design.md ? | ||
5. Process for RMA the card (Fabric/LC). This is just a discussion to document correct process for doing so. | ||
6. Monit check on the supervisor to check if the LCs are reachable. This is to alert if the linecard is down. Do we need Monit here or use above 10 sec polling ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If CHASSIS_MIDPLANE_TABLE
have the information which monit can read from there else we need to see if we can push to DB.
4. Generic console for LC using . Possible using this: https://github.com/Azure/SONiC/blob/master/doc/console/SONiC-Console-Switch-High-Level-Design.md ? | ||
5. Process for RMA the card (Fabric/LC). This is just a discussion to document correct process for doing so. | ||
6. Monit check on the supervisor to check if the LCs are reachable. This is to alert if the linecard is down. Do we need Monit here or use above 10 sec polling ? | ||
7. Handling of parallel reboot of linecard and supervisor. This should not result in the chassis/linecard to go down or unreachable. (Mention by Arvind) . If we follow Section 1 Point 2 this should be handled ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to add more test case to cover different scenarios here.
5. Process for RMA the card (Fabric/LC). This is just a discussion to document correct process for doing so. | ||
6. Monit check on the supervisor to check if the LCs are reachable. This is to alert if the linecard is down. Do we need Monit here or use above 10 sec polling ? | ||
7. Handling of parallel reboot of linecard and supervisor. This should not result in the chassis/linecard to go down or unreachable. (Mention by Arvind) . If we follow Section 1 Point 2 this should be handled ? | ||
8. Mechanism to recover an down/unreachable linecard without power-cycle or reboot of the whole chassis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config chassis startup <module-name>
==> Power on LC (if platform can do it)
config chassis shutdown <module-name>
===> Power off LC ( if Platform can do it)
reboot <module-name>
===> Power on/off toggle for LC (if platform can do it) or CPU reset toggle for LC
Worst case we need to power-cycle of chassis from external agent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity to platforms/vendors, though these commands are under "config" but they are only executed (and not saved) until config save is issued.
6. Monit check on the supervisor to check if the LCs are reachable. This is to alert if the linecard is down. Do we need Monit here or use above 10 sec polling ? | ||
7. Handling of parallel reboot of linecard and supervisor. This should not result in the chassis/linecard to go down or unreachable. (Mention by Arvind) . If we follow Section 1 Point 2 this should be handled ? | ||
8. Mechanism to recover an down/unreachable linecard without power-cycle or reboot of the whole chassis. | ||
9. Enhance "Show chassis module status" command for linecard should display hostname iso of generic names like LINECARD1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Show chassis module status
is platform specific command . May need another command in SONiC /enhancement.
7. Handling of parallel reboot of linecard and supervisor. This should not result in the chassis/linecard to go down or unreachable. (Mention by Arvind) . If we follow Section 1 Point 2 this should be handled ? | ||
8. Mechanism to recover an down/unreachable linecard without power-cycle or reboot of the whole chassis. | ||
9. Enhance "Show chassis module status" command for linecard should display hostname iso of generic names like LINECARD1 | ||
10. Support "show system-health detail/monitor-list/summary" commands in RP/LC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
3. Section3 : Enhancements based on Significat Design Changes | ||
1. Auto Handling by Platfrom SW to reboot/shutdown the HW Component when detecting the critical Fault’s. | ||
2. Temperature Measuring Category Enhancements. More Granular and Increase Polling Interval for same. Also show command optimize not dump all sesors and filter based on location |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need Sonic Design Document. Cisco can propose something on this.
3. Section3 : Enhancements based on Significat Design Changes | ||
1. Auto Handling by Platfrom SW to reboot/shutdown the HW Component when detecting the critical Fault’s. | ||
2. Temperature Measuring Category Enhancements. More Granular and Increase Polling Interval for same. Also show command optimize not dump all sesors and filter based on location | ||
3. Move Voltage and Current sensors support from existing sensorsd/libsensors model to PMON/ thermalCtld model This provide Ability/mechanism in SONiC NOS to poll for board’s Voltage and Current sensors (from platform) for power alogorithm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need SONiC Design Document. Cisco can propose something on this.
1. Auto Handling by Platfrom SW to reboot/shutdown the HW Component when detecting the critical Fault’s. | ||
2. Temperature Measuring Category Enhancements. More Granular and Increase Polling Interval for same. Also show command optimize not dump all sesors and filter based on location | ||
3. Move Voltage and Current sensors support from existing sensorsd/libsensors model to PMON/ thermalCtld model This provide Ability/mechanism in SONiC NOS to poll for board’s Voltage and Current sensors (from platform) for power alogorithm. | ||
4. Midplane Switch Counters (Debugging) /Modifying QOS Properties if needed (Performance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
each platform vendor can provide some document to debug the midplane drop and any optimization that we need to do.
Based on PR sonic-net/SONiC#945, we should return the sticker/label name on the chassis for the physical slot id in the get_supervisor_slot PMON API and 'show chassis module status' command. For Nokia linecards, the sticker label for supervisor is 'A'. Thus we need to allow for string as possible return value as well - apart for int.
8498931
to
8837dc2
Compare
Based on PR sonic-net/SONiC#945, we should return the sticker/label name on the chassis for the physical slot id in the get_supervisor_slot PMON API and 'show chassis module status' command. For Nokia linecards, the sticker label for supervisor is 'A'. Thus we need to allow for string as possible return value as well - apart for int.
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Capture Platform specific Requirements for Chassis Platforms.