Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SONiC Chassis Platform Requirements and Enhancements Analysis #945

Merged
merged 11 commits into from
Jul 6, 2024

Conversation

abdosi
Copy link
Contributor

@abdosi abdosi commented Feb 23, 2022

Capture Platform specific Requirements for Chassis Platforms.

abdosi and others added 8 commits November 8, 2021 22:55
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
@@ -0,0 +1,37 @@
Section 1 Requirements that are needed by default:-
1. On LC the reboot command should power-cycle the entire LC . Expectation is Peer node should detect link down when reboot is given on LC
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sonic Reboot command

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sonic reboot command that invokes platform plugin to reboot

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abdosi remove power-cycle in the point

@@ -0,0 +1,37 @@
Section 1 Requirements that are needed by default:-
1. On LC the reboot command should power-cycle the entire LC . Expectation is Peer node should detect link down when reboot is given on LC
2. On RP the reboot command should reboot the entire system (RP and LC). . Expectation is Peer node should detect link down when reboot is given on RP
Copy link
Contributor Author

@abdosi abdosi Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graceful reboot for LC vs power-cycle the LC ? Open question. Possibility of SDD corruption without Graceful restart.

Basically reboot of LC from RP is ungraceful.

Scenarios (High to Low Priority Order):

  1. Graceful Supervisor going down and LC ungraceful (default)
  2. Ungraceful Supervisor going down and LC react to this . Need further discussion. No Conclusion yet.
  3. Graceful Supervisor going down and LC graceful (Orchestration start from RP) Need further discussion. No Conclusion yet.

Conclusion for Ideal Case:
Enhance Reboot to do Supervisor only Reboot. Option for Reboot (Supervisor vs Entire Chassis)

  • Entire Chassis: Graceful Supervisor going down and LC ungraceful (default). Default and must-needed behavior.

  • Supervisor only option is useful for

    a) Orchestrate Graceful reboot for entire chassis via external controller.
    b) Dual Supervisor case.

Need more discussion: for Worst Case (Supervisor just disappearance eg: watchdog triggered on supervisor)
a) To determine what is happening now,
b) Enhancements to handle this will be needed.

if Line card detect sup going down and then LC should be kept in down state if Platform specific LC shutdown capability) else LC do self-reboot (can be continuous if SUP never comes up or always in bad state)
if Line-card can not detect sup going down then sup after comin-up broadcast all LC's to self-reboot

Copy link
Contributor

@shyam77git shyam77git Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if Line-card can not detect sup going down then sup after coming-up broadcast all LC's to self-reboot

This might have an issue! There are chassis/platform whereby LC comes up on their own the moment power is turned-on to the chassis. So, Supervisor asking for its boot-up may interrupt already booting-up LC in a chassis power-on/reload scenarios. Suggest to have Supervisor 'booting up' workflow to be same regardless of SUP-only reload/boot-up or chassis reload.

if Line card detect sup going down

This would be over Keepalives/ heartbeats exchange between SUP and LC.
Suggest adding another scenario: SUP detecting LC not there and discuss it.

Copy link
Contributor Author

@abdosi abdosi Mar 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary based of internal discussion:

  1. If Supervisor goes down unplanned/ungracefully then there is no need to reboot LC or any other action.
  2. In above scenario LC should be generating syslog complaining about Supervisor not being reachable (Eg: PMON trying to access Chassis DB to push the data)
  3. Above syslog can be used by Alerting logic and necessary action can be taken form LC/Chassis perspective like doing isolation and doing config reload on LC by external controller

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding (1), "there is no need to reboot LC or any other action"
There are a number of concerns with letting linecards run headless while a supervisor reboots

  • There will be a long window of time where no state exchange between linecards will be possible due to lack of chassis_db.
  • The software on the linecard now has to deal with live disconnection/reconnection of connectivity to chassis_db or other supervisor
  • There's no real stated benefit motivating this propsal. A strong working assumption behind the chassis architecture conception for the current set of use cases is that there is enough redundancy in the network to easily tolerate the entire chassis going down and coming back, especially when it is a rare event like an unplanned supervisor rebbot.

In general, it is best to keep failure handling simple and predictable, and avoid divergent flows across different scenarios. So unless there is a specific concrete problem that is solved by keeping the linecards, it would be strongly preferable to always reboot linecards when a supervisor reboots.

At the very least, the headless operation should be made optional and not mandated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option 1 (Preferable)
If Supervisor goes down (headless operation) then LC should go down

Option 2
IF LC are running in headless mode the LC should be able to send syslog asap (mid-plane connectivity should get restored) so that LC can communicate the error to External Management.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supervisor platform services/code to be healthy to make linecard's running smooth. When supervisor platform code isn't healthy (like hw heartbeat is down, sw heartbeat etc), its considered as unhealthy supervisor or headless. In this case, HW vendor has defined what would linecards do. Linecard's don't operate when supervisor detected down.

@rlhui
Copy link
Contributor

rlhui commented Mar 2, 2022

Can we rename this PR to better reflect this doc?

@abdosi abdosi changed the title PMON SONiC Chassis Platform Requirements and Enhancements Analysis Mar 2, 2022
@abdosi
Copy link
Contributor Author

abdosi commented Mar 2, 2022

Can we rename this PR to better reflect this doc?

@rlhui Updated the PR title.

@abdosi
Copy link
Contributor Author

abdosi commented Mar 2, 2022

cc @Staphylo

@abdosi
Copy link
Contributor Author

abdosi commented Mar 2, 2022

cc @shyam77git

@abdosi
Copy link
Contributor Author

abdosi commented Mar 2, 2022

cc @mprabhu-nokia

3. Config shut/unshut of LC will be supported as per the Chassis-d design.
4. Generate syslog for all the critical events and share the threshold (for appropriate/needed components) in documents and recommended for given threshold range. Expectation is we will bind syslog to our Alert Orchestration system and perform recommnded action based on the documents.
5. PCI-e issue of not able to detect FC ASIC’s and LC ASIC’s and syslog for same.
Integrate with pcied process in PMON[sonic-platform-daemons/pcied at master · Azure/sonic-platform-daemons (github.com)]. Note: Current PCI daemon polling for pci devices is 60sec which is large poll interval. Does it need optimization ?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@abdosi abdosi Mar 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can cause of delay of overall SW initialization. not backward-compatible as of now. can impact existing running systems.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generate syslog for all the critical events and share the threshold (for appropriate/needed components) in documents and recommended for given threshold range. Expectation is we will bind syslog to our Alert Orchestration system and perform recommnded action based on the documents.

Per today's chassis workgroup sync-up, can we enhance this point to highlight following:
a) For now (near-term solution): External controller would take the recommended action (based on document) in real-time
b) Enhancing it to have a system-driven solution: I suggested having a platform-supplied policy (look-up) file of events and actions. On receiving an event (syslog), SONiC (LC, RP) or Ext Controller to perform a lookup on this policy file and take recommended action.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document can also provide how many FC'S ASIC should be there to support X FC and Y LC Scenario. Each Platform vendor need to provide this matrix.

5. PCI-e issue of not able to detect FC ASIC’s and LC ASIC’s and syslog for same.
Integrate with pcied process in PMON[sonic-platform-daemons/pcied at master · Azure/sonic-platform-daemons (github.com)]. Note: Current PCI daemon polling for pci devices is 60sec which is large poll interval. Does it need optimization ?
6. Boot-up failure Handling. Need to see the SONiC behaviour from system perspective/docker status/syslog getting generated with required/correct information
7. HW-Watchdog adhering to current SONiC behavior. Start before reboot and explicitly disabled post reboot by SONiC (This means SONiC is booted up and Services are fine)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@abdosi abdosi Mar 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Watchdog Scheme will need enhancement in case if we want to detect some faults (for eg: CPU getting stuck in running state) then given platform/vendor has HW-watchdog that can take recovery action in such cases. Currently since hw watchdog disable by SONiC post boot above scenarios can not be handle even if given platform/vendor can support it. It can be used for Debugging purpose where possible by collecting dumps.

Section 1 Requirements that are needed by default:-
1. On LC the reboot command should power-cycle the entire LC . Expectation is Peer node should detect link down when reboot is given on LC
2. On RP the reboot command should reboot the entire system (RP and LC). . Expectation is Peer node should detect link down when reboot is given on RP
3. Config shut/unshut of LC will be supported as per the Chassis-d design.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not all platforms can support LC shut as there might not be power control on LC

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abdosi will update the point from supervisor.

Copy link
Contributor Author

@abdosi abdosi Mar 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useful when LC is not reachable via SSH or Console:-

These command are invoke from supervisor for given LC:-
shut: power shut for card (platform dependent if not supported return the not supported.)
unshut: bring power back (platform dependent if not supported return error)
reboot: can be power-cycle or cpu reset only (best to platform capability if not supported return error)

Copy link
Contributor Author

@abdosi abdosi Mar 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reboot-cause shows what on LC when above option are invoke on supervisor ? Need to revisit/discuss.

Copy link
Contributor Author

@abdosi abdosi Mar 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config chassis startup <module-name>
config chassis shutdown <module-name>
reboot <module-name>

module-name here are are LC. Based on 03/18 update (see below) FC are also in scope.
power-cycle of FC can also have implication Kernel.
FC Graceful handling dependency on Kernel modules also ?
Do we support FC Insertion/Removal ?

03/18: Update:

Possible steps for RMA of FC:

  1. Isolate the Chassis (No Traffic)
  2. Config chassis shut on FC (Enhancement possible: To see if we can have SONiC Service also gracefully stopped here. Need to check.)
  3. Unplug the FC
  4. Plug new FC (post-RMA)
  5. config chassis startup on FC
  6. Config reload on Supervisor

Steps to Reload FC: (Link/Parity/cell Error are seen and Platform Vendor recommendation to reload FC. Not at stage for RMA but Recoverable)

What about LC ASIC bad ? In Such case action need to be taken from LC perspespective based on LC Alert.
These are not frequent errors.

If N+ 1 redundancy:-

  1. Config chassis shut on FC (Enhancement possible: To see if we can have SONiC Service also gracefully stopped here. Need to check.)
  2. config chassis startup on FC

else
See above RMA process

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FC reload/shutdown scenario to be discussed separately.
Beside above cases (graceful handling of FC shutdown/reload), another case to discuss:
If chassis running on less FC(s)/FC-NPUs, then what is SONiC's expectation? Depending upon bandwidth impact, isolate the entire chassis or isolate (shutdown) impacted FC(s) or replace impacted FC(s)?

Copy link
Contributor

@shyam77git shyam77git Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reboot

similar to this, there should be 'shutdown CLI to shutdown specified module (LC/FC).

config chassis startup
config chassis shutdown

intent/goal of these commands is to do config shutdown or startup (reload/bring-up) of specified module.
wondering as to why 'chassis' keyword is there in these commands?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion when a FC goes bad/down, expect syslog be generated. For system that has extra FC redundancy I think losing one FC does not impact the overall operation of the chassis. So in this case the syslog should cause an alarm to allow schedule maintenance service to replace the FC. If loosing another FC or chassis has no FC redundancy then the lost of FC syslog should clearly indicate it is running in degraded mode with expected traffic impact. This syslog should cause alarm service to detect and trigger mitigation steps (whether be Admin user intervention or automation to start isolate it). The chassis itself can not tell if there are redundancy built into the involved network and if trigger self isolation it might causes more customer impact... This is just my own opinion...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config chassis startup <module-name> config chassis shutdown <module-name> reboot <module-name>

Note that the "Reboot " is an Ungraceful reload of the LC. It should not be expected that Supervisor will orchestrate a graceful reboot of the LC. Sup will simply cycle power to the LC.

7. HW-Watchdog adhering to current SONiC behavior. Start before reboot and explicitly disabled post reboot by SONiC (This means SONiC is booted up and Services are fine)
8. chassisd daemon support on both LC and RP with all fields of table "CHASSIS_MODULE_TABLE|xxxx” correctly populated
9. chassisd daemon support populating fields in table "CHASSIS_ASIC_TABLE|xxx", this is used to start swss/syncd in SUP when FABRIC ASIC is ready.
10. Slot Nummber in "CHASSIS_MODULE_TABLE|xxxx” need not be unique ? Slot Number is based on physcial layout (Ex: LC can be back facing and can have 0..n and FC can be Front facing and be 0.n). chassisd can support this model ?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Physcial slot Id need to be unique and can use sticker/label name based on given platform vendor. Use Case: Technician to identify the given Card based on Visual Inspection

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI - this change requires changing the get_slot (in module.py) and get_supervisor_slot and get_my_slot (in chassis.py) PMON API's to return a string instead of an int.

However, in platform api sonic-mgmt tests this field is expected to be an int (example: https://github.com/Azure/sonic-mgmt/blob/master/tests/platform_tests/api/test_module.py#L334)

Do we modify the tests to be either int or str for now to allow backward compatibility till all vendors implement this enhancement?

9. chassisd daemon support populating fields in table "CHASSIS_ASIC_TABLE|xxx", this is used to start swss/syncd in SUP when FABRIC ASIC is ready.
10. Slot Nummber in "CHASSIS_MODULE_TABLE|xxxx” need not be unique ? Slot Number is based on physcial layout (Ex: LC can be back facing and can have 0..n and FC can be Front facing and be 0.n). chassisd can support this model ?
10. psud power algorithm on supervisor as specified in chassis design document
11. PSU LED Status in the show command of supervisor
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ref for Point 11: show platform psustatus to display LED current color (based on current running status of PSU)

11. PSU LED Status in the show command of supervisor
12. TEMPERATURE_INFO table update into Chassis State DB from both Supervisor and LC. Local TEMPERATURE_INFO is also available in LC STATE_DB.
13. Fan speed algorithm on supervior as specified in chassis design document
14. FAN LED Status in the show command of supervisor
Copy link
Contributor Author

@abdosi abdosi Mar 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fan tray led display enhancement. Need to check if sonic has component for Fan-Tray/Fan-Drawer.
For now given Vendor can overload on Fan Led

12. TEMPERATURE_INFO table update into Chassis State DB from both Supervisor and LC. Local TEMPERATURE_INFO is also available in LC STATE_DB.
13. Fan speed algorithm on supervior as specified in chassis design document
14. FAN LED Status in the show command of supervisor
15. reboot-cause reason and history is working fine for both RP and LC
Copy link
Contributor Author

@abdosi abdosi Mar 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13. Fan speed algorithm on supervior as specified in chassis design document
14. FAN LED Status in the show command of supervisor
15. reboot-cause reason and history is working fine for both RP and LC
16. show commands for mid-plane switch as per Chassis Design Document. Add namespace parameter support for "show chassis midplane-status" command.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check. show ip interface -n <asic> -d all should be displaying it.

4. Generate syslog for all the critical events and share the threshold (for appropriate/needed components) in documents and recommended for given threshold range. Expectation is we will bind syslog to our Alert Orchestration system and perform recommnded action based on the documents.
5. PCI-e issue of not able to detect FC ASIC’s and LC ASIC’s and syslog for same.
Integrate with pcied process in PMON[sonic-platform-daemons/pcied at master · Azure/sonic-platform-daemons (github.com)]. Note: Current PCI daemon polling for pci devices is 60sec which is large poll interval. Does it need optimization ?
6. Boot-up failure Handling. Need to see the SONiC behaviour from system perspective/docker status/syslog getting generated with required/correct information
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more to understand the behavior and identify any missing test-gap to cover this.

16. show commands for mid-plane switch as per Chassis Design Document. Add namespace parameter support for "show chassis midplane-status" command.

2. Section2: General Chassis Enhancements that are needed:-
1. LC/FC Fabric Link down Handling
Copy link
Contributor Author

@abdosi abdosi Mar 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data-path component. Not in scope of platform. Expectation to have atleast monitoring and alert/syslog in such cases and action needed to be taken in such scenarios.


2. Section2: General Chassis Enhancements that are needed:-
1. LC/FC Fabric Link down Handling
2. Module/Chassis/Board LED’s . Need general infra enhancement of led daemon and show commands
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New Design Document is needed. Need to discuss in SONiC Community,

1. LC/FC Fabric Link down Handling
2. Module/Chassis/Board LED’s . Need general infra enhancement of led daemon and show commands
3. LC/FC operation status detection quicker using (get_change_event() notification handling to detect async card up/down events) rather than using current Polling Interval of 10 sec
4. Generic console for LC using . Possible using this: https://github.com/Azure/SONiC/blob/master/doc/console/SONiC-Console-Switch-High-Level-Design.md ?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need more analysis from platform vendor's and will need revisit.

2. Module/Chassis/Board LED’s . Need general infra enhancement of led daemon and show commands
3. LC/FC operation status detection quicker using (get_change_event() notification handling to detect async card up/down events) rather than using current Polling Interval of 10 sec
4. Generic console for LC using . Possible using this: https://github.com/Azure/SONiC/blob/master/doc/console/SONiC-Console-Switch-High-Level-Design.md ?
5. Process for RMA the card (Fabric/LC). This is just a discussion to document correct process for doing so.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Platform vendor recommendation/guidance needed here.

3. LC/FC operation status detection quicker using (get_change_event() notification handling to detect async card up/down events) rather than using current Polling Interval of 10 sec
4. Generic console for LC using . Possible using this: https://github.com/Azure/SONiC/blob/master/doc/console/SONiC-Console-Switch-High-Level-Design.md ?
5. Process for RMA the card (Fabric/LC). This is just a discussion to document correct process for doing so.
6. Monit check on the supervisor to check if the LCs are reachable. This is to alert if the linecard is down. Do we need Monit here or use above 10 sec polling ?
Copy link
Contributor Author

@abdosi abdosi Mar 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If CHASSIS_MIDPLANE_TABLE have the information which monit can read from there else we need to see if we can push to DB.

4. Generic console for LC using . Possible using this: https://github.com/Azure/SONiC/blob/master/doc/console/SONiC-Console-Switch-High-Level-Design.md ?
5. Process for RMA the card (Fabric/LC). This is just a discussion to document correct process for doing so.
6. Monit check on the supervisor to check if the LCs are reachable. This is to alert if the linecard is down. Do we need Monit here or use above 10 sec polling ?
7. Handling of parallel reboot of linecard and supervisor. This should not result in the chassis/linecard to go down or unreachable. (Mention by Arvind) . If we follow Section 1 Point 2 this should be handled ?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add more test case to cover different scenarios here.

5. Process for RMA the card (Fabric/LC). This is just a discussion to document correct process for doing so.
6. Monit check on the supervisor to check if the LCs are reachable. This is to alert if the linecard is down. Do we need Monit here or use above 10 sec polling ?
7. Handling of parallel reboot of linecard and supervisor. This should not result in the chassis/linecard to go down or unreachable. (Mention by Arvind) . If we follow Section 1 Point 2 this should be handled ?
8. Mechanism to recover an down/unreachable linecard without power-cycle or reboot of the whole chassis.
Copy link
Contributor Author

@abdosi abdosi Mar 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config chassis startup <module-name> ==> Power on LC (if platform can do it)
config chassis shutdown <module-name> ===> Power off LC ( if Platform can do it)
reboot <module-name> ===> Power on/off toggle for LC (if platform can do it) or CPU reset toggle for LC

Worst case we need to power-cycle of chassis from external agent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity to platforms/vendors, though these commands are under "config" but they are only executed (and not saved) until config save is issued.

6. Monit check on the supervisor to check if the LCs are reachable. This is to alert if the linecard is down. Do we need Monit here or use above 10 sec polling ?
7. Handling of parallel reboot of linecard and supervisor. This should not result in the chassis/linecard to go down or unreachable. (Mention by Arvind) . If we follow Section 1 Point 2 this should be handled ?
8. Mechanism to recover an down/unreachable linecard without power-cycle or reboot of the whole chassis.
9. Enhance "Show chassis module status" command for linecard should display hostname iso of generic names like LINECARD1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show chassis module status is platform specific command . May need another command in SONiC /enhancement.

7. Handling of parallel reboot of linecard and supervisor. This should not result in the chassis/linecard to go down or unreachable. (Mention by Arvind) . If we follow Section 1 Point 2 this should be handled ?
8. Mechanism to recover an down/unreachable linecard without power-cycle or reboot of the whole chassis.
9. Enhance "Show chassis module status" command for linecard should display hostname iso of generic names like LINECARD1
10. Support "show system-health detail/monitor-list/summary" commands in RP/LC
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


3. Section3 : Enhancements based on Significat Design Changes
1. Auto Handling by Platfrom SW to reboot/shutdown the HW Component when detecting the critical Fault’s.
2. Temperature Measuring Category Enhancements. More Granular and Increase Polling Interval for same. Also show command optimize not dump all sesors and filter based on location
Copy link
Contributor Author

@abdosi abdosi Mar 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need Sonic Design Document. Cisco can propose something on this.

3. Section3 : Enhancements based on Significat Design Changes
1. Auto Handling by Platfrom SW to reboot/shutdown the HW Component when detecting the critical Fault’s.
2. Temperature Measuring Category Enhancements. More Granular and Increase Polling Interval for same. Also show command optimize not dump all sesors and filter based on location
3. Move Voltage and Current sensors support from existing sensorsd/libsensors model to PMON/ thermalCtld model This provide Ability/mechanism in SONiC NOS to poll for board’s Voltage and Current sensors (from platform) for power alogorithm.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need SONiC Design Document. Cisco can propose something on this.

1. Auto Handling by Platfrom SW to reboot/shutdown the HW Component when detecting the critical Fault’s.
2. Temperature Measuring Category Enhancements. More Granular and Increase Polling Interval for same. Also show command optimize not dump all sesors and filter based on location
3. Move Voltage and Current sensors support from existing sensorsd/libsensors model to PMON/ thermalCtld model This provide Ability/mechanism in SONiC NOS to poll for board’s Voltage and Current sensors (from platform) for power alogorithm.
4. Midplane Switch Counters (Debugging) /Modifying QOS Properties if needed (Performance)
Copy link
Contributor Author

@abdosi abdosi Mar 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each platform vendor can provide some document to debug the midplane drop and any optimization that we need to do.

sanmalho-git added a commit to sanmalho-git/sonic-mgmt that referenced this pull request Apr 15, 2022
Based on PR sonic-net/SONiC#945, we should return
the sticker/label name on the chassis for the physical slot id
in the get_supervisor_slot PMON API and 'show chassis module status' command.

For Nokia linecards, the sticker label for supervisor is 'A'.
Thus we need to allow for string as possible return value as well - apart for int.
@yxieca yxieca force-pushed the master branch 2 times, most recently from 8498931 to 8837dc2 Compare April 15, 2022 16:51
@abdosi abdosi added the chassis label Apr 21, 2022
judyjoseph pushed a commit to sonic-net/sonic-mgmt that referenced this pull request May 18, 2022
Based on PR sonic-net/SONiC#945, we should return
the sticker/label name on the chassis for the physical slot id
in the get_supervisor_slot PMON API and 'show chassis module status' command.

For Nokia linecards, the sticker label for supervisor is 'A'.
Thus we need to allow for string as possible return value as well - apart for int.
abdosi added 2 commits July 6, 2024 02:35
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
@rlhui rlhui merged commit 66af3be into sonic-net:master Jul 6, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

8 participants