Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backup STATE_DB PORT_TABLE|Ethernet during warm-reboot #3111

Merged
merged 2 commits into from
May 30, 2024

Conversation

mihirpat1
Copy link
Contributor

@mihirpat1 mihirpat1 commented Jan 5, 2024

What I did

Currently, entire PORT_TABLE in STATE_DB is being deleted during warm-reboot. Due to this, host_tx_ready changes to false after warm-reboot which causes the link to remain down.

How I did it

Backing up host_tx_ready, NPU_SI_SETTINGS_SYNC_STATUS and CMIS_REINIT_REQUIRED fields from `STATE_DB PORT_TABLE* during warm-reboot now.

How to verify it

Verified that host_tx_ready in STATE_DB PORT_TABLE is retained after warm-reboot and the link remains up. Also, ensured that the keys CMIS_REINIT_REQUIRED and NPU_SI_SETTINGS_SYNC_STATUS are retained after warm-reboot.
Before warm-reboot

root@sonic:/home/admin# redis-cli -n 6 hgetall "PORT_TABLE|Ethernet0"
 1) "state"
 2) "ok"
 3) "netdev_oper_status"
 4) "up"
 5) "admin_status"
 6) "up"
 7) "mtu"
 8) "9100"
 9) "CMIS_REINIT_REQUIRED"
10) "false"
11) "NPU_SI_SETTINGS_SYNC_STATUS"
12) "NPU_SI_SETTINGS_DEFAULT"
13) "supported_speeds"
14) "40000,100000"
15) "supported_fecs"
16) "none,rs"
17) "host_tx_ready"
18) "true"
19) "speed"
20) "100000"
21) "fec"
22) "N/A"
root@sonic:/home/admin# 

After warm-reboot script backs up PORT_TABLE and deletes unwanted fields

root@sonic:/home/admin# redis-cli -n 6 hgetall "PORT_TABLE|Ethernet0"
1) "CMIS_REINIT_REQUIRED"
2) "false"
3) "NPU_SI_SETTINGS_SYNC_STATUS"
4) "NPU_SI_SETTINGS_DEFAULT"
5) "host_tx_ready"
6) "true"
root@sonic:/home/admin# 

After switch boot-up post warm-reboot

root@sonic:/home/admin# redis-cli -n 6 hgetall "PORT_TABLE|Ethernet0"
 1) "state"
 2) "ok"
 3) "netdev_oper_status"
 4) "up"
 5) "admin_status"
 6) "up"
 7) "mtu"
 8) "9100"
 9) "supported_speeds"
10) "40000,100000"
11) "supported_fecs"
12) "none,rs"
13) "CMIS_REINIT_REQUIRED"
14) "false"
15) "NPU_SI_SETTINGS_SYNC_STATUS"
16) "NPU_SI_SETTINGS_DEFAULT"
17) "host_tx_ready"
18) "true"
19) "speed"
20) "100000"
21) "fec"
22) "N/A"
root@sonic:/home/admin# 

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

Signed-off-by: Mihir Patel <patelmi@microsoft.com>
@mihirpat1 mihirpat1 changed the title Backup STATE_DB PORT_TABLE during warm-reboot Backup STATE_DB PORT_TABLE|Ethernet* during warm-reboot Jan 11, 2024
@mihirpat1 mihirpat1 changed the title Backup STATE_DB PORT_TABLE|Ethernet* during warm-reboot Backup STATE_DB PORT_TABLE|Ethernet during warm-reboot Jan 11, 2024
@mihirpat1 mihirpat1 requested a review from vaibhavhd January 11, 2024 17:31
if not string.match(k, 'FDB_TABLE|') and not string.match(k, 'WARM_RESTART_TABLE|') \
if string.match(k, 'PORT_TABLE|Ethernet') then
for i, f in ipairs(redis.call('hgetall', k)) do
if i % 2 == 1 then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not clear on what this check looking for - if i % 2 == 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vaibhavhd - This logic will help in selecting the field from FieldValue Pair.

Since the getall command will return a flat list of field followed by its corresponding value, the above logic will help in selecting the field and delete the corresponding FieldValue Pair

Below snippet has a total of 11 FieldValue pairs.

root@sonic:/home/admin# redis-cli -n 6 hgetall "PORT_TABLE|Ethernet0"
 1) "state"
 2) "ok"
 3) "netdev_oper_status"
 4) "up"
 5) "admin_status"
 6) "up"
 7) "mtu"
 8) "9100"
 9) "CMIS_REINIT_REQUIRED"
10) "false"
11) "NPU_SI_SETTINGS_SYNC_STATUS"
12) "NPU_SI_SETTINGS_DEFAULT"
13) "supported_speeds"
14) "40000,100000"
15) "supported_fecs"
16) "none,rs"
17) "host_tx_ready"
18) "true"
19) "speed"
20) "100000"
21) "fec"
22) "N/A"
root@sonic:/home/admin# 

@@ -247,7 +247,17 @@ function backup_database()
# Delete keys in stateDB except FDB_TABLE|*, MIRROR_SESSION_TABLE|*, WARM_RESTART_ENABLE_TABLE|*, FG_ROUTE_TABLE|*
sonic-db-cli STATE_DB eval "
for _, k in ipairs(redis.call('keys', '*')) do
if not string.match(k, 'FDB_TABLE|') and not string.match(k, 'WARM_RESTART_TABLE|') \
if string.match(k, 'PORT_TABLE|Ethernet') then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zjswhhh do you think keeping port_table intact in state-db will cause any change in mux init logic for dualtor warmboot?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it shouldn't.

@@ -247,7 +247,17 @@ function backup_database()
# Delete keys in stateDB except FDB_TABLE|*, MIRROR_SESSION_TABLE|*, WARM_RESTART_ENABLE_TABLE|*, FG_ROUTE_TABLE|*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the comment accordingly.

Comment on lines +253 to +255
if not string.match(f, 'host_tx_ready') \
and not string.match(f, 'NPU_SI_SETTINGS_SYNC_STATUS') \
and not string.match(f, 'CMIS_REINIT_REQUIRED') then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. From the PR's description it sounds like you need only host_tx_ready field. What is the reason for also persisting NPU and CMIS fields? Please add the reasoning as a comment here too.
  2. Have you considered cross branch and in-branch warm-reboots where the 3 fields might get modified (deleted, name change, etc?) in the target image? How will the target image handle a scenario when state db has some filed which is not supported. This question arises from the fact that you are not preserving entire table, but just 3 fields from it. Existing cases preserved entire table.

To prevent us from getting into any unlikely scenarios in # 2 above, this handling can be done by the target image's portorch. In other words, when the device boots into target image, the portsorch initializes these fields afresh after checking that system is undergoing warmboot by checking system warm-restart flag. The downside in this approach is that portorch will have no idea what these fields were set to in the base image.

The argument is same for other fields such as OPER state (netdev_oper_status) and speeds. In the recovery path of warmboot these fields are set afresh based on SAI get calls (I think)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@longhuan-cisco
Copy link
Contributor

Just curious, what was the reason to not back up the entire table? Is it because some of the fields (e.g. netdev_oper_status ) should be re-populated after warm-reboot?

@mihirpat1
Copy link
Contributor Author

Just curious, what was the reason to not back up the entire table? Is it because some of the fields (e.g. netdev_oper_status ) should be re-populated after warm-reboot?

@longhuan-cisco - Yes, you are correct. Hence, we decided to preserve selected fields which xcvrd/OA cares about and delete other fields from STATE_DB.

@longhuan-cisco
Copy link
Contributor

As discussed, I tested the change from this PR, host_tx_ready gets retained properly after warm-reboot and link stays up (especially for those CMIS modules).

@mihirpat1 @prgeor Could you please continue on this PR for the remaining?

root@t0-dut:/home/cisco# show reboot-cause history
Name                 Cause        Time                             User    Comment
-------------------  -----------  -------------------------------  ------  ---------
2024_05_22_07_52_46  warm-reboot  Wed May 22 07:49:46 UTC 2024     cisco   N/A
...

May 22 07:55:26.154419 cmono-t0-dut NOTICE pmon#xcvrd[27]: XCVRD INIT: Wait for port config is done
May 22 07:55:26.156638 cmono-t0-dut NOTICE pmon#xcvrd[27]: XCVRD INIT: After port config is done
May 22 07:55:26.183632 cmono-t0-dut NOTICE pmon#xcvrd[27]: Start daemon main loop with thread count 3
May 22 07:55:26.183632 cmono-t0-dut NOTICE pmon#xcvrd[27]: Started thread CmisManagerTask
May 22 07:55:26.183675 cmono-t0-dut NOTICE pmon#xcvrd[27]: Started thread DomInfoUpdateTask
May 22 07:55:26.183675 cmono-t0-dut NOTICE pmon#xcvrd[27]: Started thread SfpStateUpdateTask
...
May 22 07:55:26.198509 cmono-t0-dut WARNING pmon#xcvrd[27]: $$$ Ethernet32 handle_port_update_event() : op=SET DB:STATE_DB Table:PORT_TABLE fvp {'host_tx_ready': 'true', 'state': 'ok', 'netdev_oper_status': 'up', 'admin_status': 'up', 'mtu': '9100', 'supported_speeds': '200000,400000', 'supported_fecs': 'rs', 'speed': '400000'}
May 22 07:55:26.198532 cmono-t0-dut WARNING pmon#xcvrd[27]: $$$ Ethernet56 handle_port_update_event() : op=SET DB:STATE_DB Table:PORT_TABLE fvp {'host_tx_ready': 'true', 'state': 'ok', 'netdev_oper_status': 'up', 'admin_status': 'up', 'mtu': '9100', 'supported_speeds': '200000,400000', 'supported_fecs': 'rs', 'speed': '400000'}
May 22 07:55:26.198554 cmono-t0-dut WARNING pmon#xcvrd[27]: $$$ Ethernet0 handle_port_update_event() : op=SET DB:STATE_DB Table:PORT_TABLE fvp {'host_tx_ready': 'false', 'state': 'ok', 'netdev_oper_status': 'down', 'admin_status': 'down', 'mtu': '9100', 'supported_speeds': '200000,400000', 'supported_fecs': 'rs'}
May 22 07:55:26.198577 cmono-t0-dut WARNING pmon#xcvrd[27]: $$$ Ethernet16 handle_port_update_event() : op=SET DB:STATE_DB Table:PORT_TABLE fvp {'host_tx_ready': 'true', 'state': 'ok', 'netdev_oper_status': 'up', 'admin_status': 'up', 'mtu': '9100', 'supported_speeds': '200000,400000', 'supported_fecs': 'rs', 'speed': '400000'}
May 22 07:55:26.198601 cmono-t0-dut WARNING pmon#xcvrd[27]: $$$ Ethernet128 handle_port_update_event() : op=SET DB:STATE_DB Table:PORT_TABLE fvp {'host_tx_ready': 'true', 'state': 'ok', 'netdev_oper_status': 'up', 'admin_status': 'up', 'mtu': '9100', 'supported_speeds': '200000,400000', 'supported_fecs': 'rs', 'speed': '400000'}
May 22 07:55:26.198618 cmono-t0-dut WARNING pmon#xcvrd[27]: $$$ Ethernet72 handle_port_update_event() : op=SET DB:STATE_DB Table:PORT_TABLE fvp {'host_tx_ready': 'true', 'state': 'ok', 'netdev_oper_status': 'up', 'admin_status': 'up', 'mtu': '9100', 'supported_speeds': '200000,400000', 'supported_fecs': 'rs', 'speed': '400000'}
May 22 07:55:26.198643 cmono-t0-dut WARNING pmon#xcvrd[27]: $$$ Ethernet120 handle_port_update_event() : op=SET DB:STATE_DB Table:PORT_TABLE fvp {'host_tx_ready': 'true', 'state': 'ok', 'netdev_oper_status': 'up', 'admin_status': 'up', 'mtu': '9100', 'supported_speeds': '200000,400000', 'supported_fecs': 'rs', 'speed': '400000'}
May 22 07:55:26.198661 cmono-t0-dut WARNING pmon#xcvrd[27]: $$$ Ethernet192 handle_port_update_event() : op=SET DB:STATE_DB Table:PORT_TABLE fvp {'host_tx_ready': 'true', 'state': 'ok', 'netdev_oper_status': 'up', 'admin_status': 'up', 'mtu': '9100', 'supported_speeds': '200000,400000', 'supported_fecs': 'rs', 'speed': '400000'}
May 22 07:55:26.198689 cmono-t0-dut WARNING pmon#xcvrd[27]: $$$ Ethernet200 handle_port_update_event() : op=SET DB:STATE_DB Table:PORT_TABLE fvp {'host_tx_ready': 'false', 'state': 'ok', 'netdev_oper_status': 'down', 'admin_status': 'down', 'mtu': '9100', 'supported_speeds': '200000,400000', 'supported_fecs': 'rs'}
May 22 07:55:26.198712 cmono-t0-dut WARNING pmon#xcvrd[27]: $$$ Ethernet176 handle_port_update_event() : op=SET DB:STATE_DB Table:PORT_TABLE fvp {'host_tx_ready': 'true', 'state': 'ok', 'netdev_oper_status': 'up', 'admin_status': 'up', 'mtu': '9100', 'supported_speeds': '200000,400000', 'supported_fecs': 'rs', 'speed': '400000'}
...

@prgeor prgeor merged commit a9720bf into sonic-net:master May 30, 2024
5 checks passed
@prgeor
Copy link
Contributor

prgeor commented May 30, 2024

@StormLiangMS @yxieca @bingwang-ms please cherry pick this to 202311. Need for warm reboot support for platforms using CMIS optics

mssonicbld pushed a commit to mssonicbld/sonic-utilities that referenced this pull request Jun 3, 2024
* Backup STATE_DB PORT_TABLE during warm-reboot

Signed-off-by: Mihir Patel <patelmi@microsoft.com>

* Backing up selected fields from STATE_DB PORT_TABLE|Ethernet* and deleting unwanted fields during warm-reboot

---------

Signed-off-by: Mihir Patel <patelmi@microsoft.com>
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202311: #3352

mssonicbld pushed a commit that referenced this pull request Jun 3, 2024
* Backup STATE_DB PORT_TABLE during warm-reboot

Signed-off-by: Mihir Patel <patelmi@microsoft.com>

* Backing up selected fields from STATE_DB PORT_TABLE|Ethernet* and deleting unwanted fields during warm-reboot

---------

Signed-off-by: Mihir Patel <patelmi@microsoft.com>
arfeigin pushed a commit to arfeigin/sonic-utilities that referenced this pull request Jun 16, 2024
* Backup STATE_DB PORT_TABLE during warm-reboot

Signed-off-by: Mihir Patel <patelmi@microsoft.com>

* Backing up selected fields from STATE_DB PORT_TABLE|Ethernet* and deleting unwanted fields during warm-reboot

---------

Signed-off-by: Mihir Patel <patelmi@microsoft.com>
@prgeor
Copy link
Contributor

prgeor commented Jun 18, 2024

@bingwang-ms we need this in 202405

@bingwang-ms
Copy link
Contributor

@prgeor Seems there is cherry-pick conflict. Please double check

@mihirpat1
Copy link
Contributor Author

@prgeor Seems there is cherry-pick conflict. Please double check

@bingwang-ms I have removed the 202405 tags since this is already part of 202405.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants