New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Config Reload Enhancement - Introduce the Transaction Mechanism #964

Draft

CharlieChenEC wants to merge 1 commit into sonic-net:master from CharlieChenEC:cfg_reload_xact_enhance

CharlieChenEC commented Mar 18, 2022

The current design of the CLI command config reload is like a simple oneshot command. It does not have any mechanism to prevent from the situations that a config reload session is issuing from the different login session when there is already an unfinished config reload session on the same device. It takes more time to let a device reach an operational state to start another 'config reload' when the previous one has not finished yet. Besides, the successful return of the CLI command on the current design only implies that the operations which are required to initiate the reloading of the configurations are executed without error. This enhancement will fork a new process together with the execution of 'config reload' to monitor the progress of the operation including the state of the critical sonic fundamental services and records down the operation result at the end of the transaction.

In addition to a use case that prevents multiple users from executing config reload on the same device when a processing config reload is not finished yet. Another use case that this enhancement might help is on the executions of sonic-mgmt test cases. It is known that some pytest code will execute config reload when abnormal states are detected. There is a chance that the config reload has been issued at the end of a test case and the abnormal states are detected again at the start of the next test case so that another config reload issues again when the previous one is still in progress. The execution of the second config reload just wastes time to recover the device. With this enhancement, it is able to realize that a config reload is already running and the pytest code could wait for the finish of the running config reload session and check the sanity status of the device after the session is done.


          Config Reload Enhancement - Introduce the Transaction Mechanism (soni…

6dba5c3

…c-net#964)

Signed-off-by: Charlie Chen <charlie_chen@edge-core.com>

CharlieChenEC force-pushed the cfg_reload_xact_enhance branch from dc6a3a7 to 6dba5c3 Compare

March 18, 2022 09:48

Collaborator

zhangyanzhao commented Mar 22, 2022

@Junchao-Mellanox Nvidia would like to be the reviewer for this feature, would you please help to review? Thanks.

zhangyanzhao requested a review from qiluo-msft

March 22, 2022 06:45

Collaborator

zhangyanzhao commented Mar 22, 2022

@qiluo-msft can you please add Junchao-Mellanox as reviewer? Thanks.

Junchao-Mellanox suggested changes

View reviewed changes

doc/cfg_reload_xact_enhance/Config_Reload_Enhancement_Xact_HLD.md


		'MONITOR_RELOAD_PROC' keeps monitoring the running status of 'CFG_RELOAD_PROC' by the passed pid. When the pid is gone, it checks the content of "CONFIG_RELOAD_OP_RESULT_FILE". If the content of "CONFIG_RELOAD_OP_RESULT_FILE" shows there is an error, 'MONITOR_RELOAD_PROC' writes the transaction result as a failure to "CONFIG_RELOAD_XACT_RESULT_FILE" and is terminated immediately. If the content shows it is OK, 'MONITOR_RELOAD_PROC' continues to monitor the status of the sonic related services and the sonic critical processes through the output of 'show system-health detail' to determine the transaction result. The details about the way to determine the transaction result is explained below.

		'MONITOR_RELOAD_PROC' invokes 'show system-health detail' periodically to know the transitions of the sonic related services and the sonic critical processes. The criterion to determine the success of the transaction is the status must be 'OK' on all of the 'Program' type entries in the output of 'show system-health detail'. A timeout mechanism is designed to avoid the infinite lock on the config reload transaction. A special note on the timeout mechanism is it will not be activated until 'CFG_RELOAD_PROC' is terminated. The rationale behind this is 'CFG_RELOAD_PROC' is usually launched by end users. Users are able to realize something wrong when the execution of 'config reload' cannot be terminated in a certain period of time at the first scene. Users may decide to interrupt the execution of 'config reload' if they think it is wrong to take so long. So 'MONITOR_RELOAD_PROC' will not interfere with the execution of 'CFG_RELOAD_PROC' and simply wait for its termination. Thus the measure of the time period begins at the termination of 'CFG_RELOAD_PROC'. If all of the services and critical processes cannot reach the expected states within the timeout period, 'MONITOR_RELOAD_PROC' writes the timeout result to "CONFIG_RELOAD_XACT_RESULT_FILE" and release the lock before the termination. If all of the components reach the expected states within the timeout period, 'MONITOR_RELOAD_PROC' writes the successful result to "CONFIG_RELOAD_XACT_RESULT_FILE", release the lock and terminates.

Contributor

Junchao-Mellanox Mar 22, 2022 •

edited

Loading

Should monitor "Service", "Process" type entries.

doc/cfg_reload_xact_enhance/Config_Reload_Enhancement_Xact_HLD.md


		The detection of running transaction enables the resilience. Assume that the transaction lock is not released correctly by 'MONITOR_RELOAD_PROC' due to an unexpected error upon the previous 'config reload', the error condition can be resolved automatically by the detection.

		Before the start of the 'config reload' operation, 'MONITOR_RELOAD_PROC' is forked and daemonized. The pid of 'CFG_RELOAD_PROC' is passed to 'MONITOR_RELOAD_PROC' at fork. Then 'CFG_RELOAD_PROC' continues to run the 'config reload' operation as is. At the end of 'CFG_RELOAD_PROC', it writes the result of the 'config reload' operation to a file named "CONFIG_RELOAD_OP_RESULT_FILE".

Contributor

Junchao-Mellanox Mar 22, 2022

How about starting MONITOR_RELOAD_PROC after "config reload" operation? I believe it will simplify the flow. Or is there any special purpose to start it early before "config reload" operation?

doc/cfg_reload_xact_enhance/Config_Reload_Enhancement_Xact_HLD.md


		'MONITOR_RELOAD_PROC' keeps monitoring the running status of 'CFG_RELOAD_PROC' by the passed pid. When the pid is gone, it checks the content of "CONFIG_RELOAD_OP_RESULT_FILE". If the content of "CONFIG_RELOAD_OP_RESULT_FILE" shows there is an error, 'MONITOR_RELOAD_PROC' writes the transaction result as a failure to "CONFIG_RELOAD_XACT_RESULT_FILE" and is terminated immediately. If the content shows it is OK, 'MONITOR_RELOAD_PROC' continues to monitor the status of the sonic related services and the sonic critical processes through the output of 'show system-health detail' to determine the transaction result. The details about the way to determine the transaction result is explained below.

		'MONITOR_RELOAD_PROC' invokes 'show system-health detail' periodically to know the transitions of the sonic related services and the sonic critical processes. The criterion to determine the success of the transaction is the status must be 'OK' on all of the 'Program' type entries in the output of 'show system-health detail'. A timeout mechanism is designed to avoid the infinite lock on the config reload transaction. A special note on the timeout mechanism is it will not be activated until 'CFG_RELOAD_PROC' is terminated. The rationale behind this is 'CFG_RELOAD_PROC' is usually launched by end users. Users are able to realize something wrong when the execution of 'config reload' cannot be terminated in a certain period of time at the first scene. Users may decide to interrupt the execution of 'config reload' if they think it is wrong to take so long. So 'MONITOR_RELOAD_PROC' will not interfere with the execution of 'CFG_RELOAD_PROC' and simply wait for its termination. Thus the measure of the time period begins at the termination of 'CFG_RELOAD_PROC'. If all of the services and critical processes cannot reach the expected states within the timeout period, 'MONITOR_RELOAD_PROC' writes the timeout result to "CONFIG_RELOAD_XACT_RESULT_FILE" and release the lock before the termination. If all of the components reach the expected states within the timeout period, 'MONITOR_RELOAD_PROC' writes the successful result to "CONFIG_RELOAD_XACT_RESULT_FILE", release the lock and terminates.

Contributor

Junchao-Mellanox Mar 22, 2022

What is the default timeout value? Is it configurable?

doc/cfg_reload_xact_enhance/Config_Reload_Enhancement_Xact_HLD.md

+              In order to guarantee only one 'config reload' session is allowed to be run in the system, it is required to acquire a transaction lock before the operation starts. When the command 'config reload' is executed, it acquires the lock. If the lock is acquired successfully, it continues the operation. If the lock cannot be acquired, it checks the running transaction by the existence of 'CFG_RELOAD_PROC' and 'MONITOR_RELOAD_PROC' through the output of 'ps -ax'. It continues the operation when no running transaction is found, and it is not necessary to do the lock operation in this condition because the lock is already in locked state. If any running transaction is detected, it aborts the operation and outputs the error message immediately. The flow chart shown below illustrates the flow to start a config reload transaction.
+              <img src="files/init-xact-flow-chart.png" alt="init-xact-flow-chart" width="1323px"/>
+              The detection of running transaction enables the resilience. Assume that the transaction lock is not released correctly by 'MONITOR_RELOAD_PROC' due to an unexpected error upon the previous 'config reload', the error condition can be resolved automatically by the detection.

Contributor

Junchao-Mellanox Mar 22, 2022

Not sure that we need a lock here since you are checking the the process existence by ps command.

Collaborator

venkatmahalingam Mar 22, 2022

I do have the same question, we could very well check the ps and determine whether to allow another 'config reload' or not.

venkatmahalingam reviewed

View reviewed changes

doc/cfg_reload_xact_enhance/Config_Reload_Enhancement_Xact_HLD.md


		### Overview

		The current design of the CLI command 'config reload' is like a simple oneshot command. It does not have any mechanism to prevent from the situations that a 'config reload' session is issuing from the different login session when there is already an unfinished 'config reload' session on the same device. It takes more time to let a device reach an operational state to start another 'config reload' when the previous one has not finished yet. Besides, the successful return of the CLI command on the current design only implies that the operations which are required to initiate the reloading of the configurations are executed without error. This enhancement will fork a new process together with the execution of 'config reload' to monitor the progress of the operation including the state of the critical sonic fundamental services and records down the operation result at the end of the transaction.

Collaborator

venkatmahalingam Mar 22, 2022

As discussed in today's meeting, we do have checks in place to prevent the parallel config reload operation i.e when the system is not up we don't allow another 'config reload' to go through, please state in which scenario you encountered the issue exactly.

https://github.com/Azure/sonic-utilities/blob/master/config/main.py#L1358

Author

CharlieChenEC Mar 23, 2022 •

edited

Loading

I have checked the commit history and find the code you mentioned are added by the PR listed below.

sonic-net/sonic-utilities#1664

PR #1664 can prevent the parallel config reload operations under the condition added by that commit. The intention of the PR is not the same as this enhancement. The intention of this enhancement is not only to avoid parallel operations but also to add more information on the processing status of the config reload operation and the final result of the operation. "Prevent parallel config reload operations" is one of the function brings in by this enhancement. Please allow me to have some time to rewrite the HLD to express my idea more clearly.

Author

CharlieChenEC Mar 28, 2022

After re-visiting the error logs of sonic-mgmt test cases which are related to the failures of 'config reload' operations in our internal reports, they can be categorized into two types.

Type 1: The config reload executes on the time that the device just boots up in a short time
In sonic-mgmt pytest 'test_vrf.py', it performs reboot in the function restore_config_db(). In our pytest environment, a 'config reload' operation will be executed between the two pytest sessions. Upon the end of 'test_vrf.py' session, the execution of 'config reload' gets failed. sonic-utilities PR #1664 already solves this problem.

Type 2: The config reload executes on the time that the syncd crashed
In the termination of a sonic-mgmt pytest session, 'syncd' is crashed due to some issues. Execute config reload at that time causes the failure similar to the issue described in Azure/sonic-utilities #1814. According to the content of the syslog when the issue happens, the failure of 'syncd' triggers the restart of 'swss' through the service auto restart mechanism. The systemctl operations from the service auto restart and from config reload are conflicted against swss and results in the similar condition described in #1814. This issue is resolved in #1814.

In summary, the proposed mechanism in this HLD does not solve the above two issues actually. I thought the HLD might help on the config reload issues that I've ever heard on the 'sonic-mgmt' test cases before I look into it in details. Sorry for the confusion. I will remove the paragraph regarding the issue in sonic-mgmt tests from the HLD and add more functionalities and description on the part directly related to "transaction concept". Thanks for your insightful comment to push me clarifying those issues.

venkatmahalingam reviewed

View reviewed changes

doc/cfg_reload_xact_enhance/Config_Reload_Enhancement_Xact_HLD.md


		The current design of the CLI command 'config reload' is like a simple oneshot command. It does not have any mechanism to prevent from the situations that a 'config reload' session is issuing from the different login session when there is already an unfinished 'config reload' session on the same device. It takes more time to let a device reach an operational state to start another 'config reload' when the previous one has not finished yet. Besides, the successful return of the CLI command on the current design only implies that the operations which are required to initiate the reloading of the configurations are executed without error. This enhancement will fork a new process together with the execution of 'config reload' to monitor the progress of the operation including the state of the critical sonic fundamental services and records down the operation result at the end of the transaction.

		In addition to a use case that prevents multiple users from executing 'config reload' on the same device when a processing 'config reload' is not finished yet. Another use case that this enhancement might help is on the executions of sonic-mgmt test cases. It is known that some pytest code will execute 'config reload' when abnormal states are detected. There is a chance that the 'config reload' has been issued at the end of a test case and the abnormal states are detected again at the start of the next test case so that another 'config reload' issues again when the previous one is still in progress. The execution of the second 'config reload' just wastes time to recover the device. With this enhancement, it is able to realize that a 'config reload' is already running and the pytest code could wait for the finish of the running 'config reload' session and check the sanity status of the device after the session is done.

Collaborator

venkatmahalingam Mar 22, 2022

I'm not sure why would we allow another test case execution when the config reload is not complete.

dgsudharsan suggested changes

View reviewed changes

doc/cfg_reload_xact_enhance/Config_Reload_Enhancement_Xact_HLD.md

+              This document describes the enhancement based on the existing design of the
+              Sonic CLI command 'config reload'. The enhancement introduces the transaction
+              mechanism to guarantee that only one 'config reload' session runs in the system
+              at the same time. The design regarding the determination of a transaction done

Contributor

dgsudharsan Mar 28, 2022

Please verify the existing behavior. Currently config reload cannot run in parallel unless force option is used.

yxieca force-pushed the master branch 2 times, most recently from 8498931 to 8837dc2 Compare

April 15, 2022 16:51

Collaborator

zhangyanzhao commented Jun 2, 2022

@CharlieChenEC would you please help to add the code PRs by referring to EVPN VxLAN update for platforms using P2MP tunnel based L2 forwarding by dgsudharsan · Pull Request #806 · Azure/SONiC (github.com) as an example?

Collaborator

zhangyanzhao commented Jun 2, 2022

@CharlieChenEC if this feature is not required anymore, please help to close it. Thanks.

CharlieChenEC marked this pull request as draft

June 2, 2022 10:54

Author

CharlieChenEC commented Jun 2, 2022

Sorry. I'm not available to work on this PR recently. Change it to draft.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

venkatmahalingam venkatmahalingam left review comments

Junchao-Mellanox Junchao-Mellanox requested changes

dgsudharsan dgsudharsan requested changes

qiluo-msft Awaiting requested review from qiluo-msft

At least 1 approving review is required to merge this pull request.

Labels

None yet