Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warmboot Manager HLD for warm reboot improvement #1485

Merged
merged 11 commits into from
Jun 4, 2024
280 changes: 280 additions & 0 deletions doc/warm-reboot/NSF_Manager_HLD.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,280 @@
# NSF Manager HLD
Copy link
Contributor

@qiluo-msft qiluo-msft Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you prevent using the terminalogy "NSF" in this design doc? There is another active HLD relating gNOI API, and we are discussing use gNOI NSF terminalogy to represent SONIC fast-reboot instead of SONiC warm-reboot, since there is another gNOI WARM already. ref: https://github.com/openconfig/gnoi/blob/98d6b81c6dfe3c5c400209f82a228cf3393ac923/system/system.proto#L128

May I propose "Warm Reboot Manager" in this HLD? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added a comment in that design: I am not sure why WARM and NSF reboot methods are defined separately in the gNOI spec. In my opinion they both refer to the same operation. I would propose adding a new reboot method for fast reboot since it doesn't map to any of the current gNOI reboot methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you prevent using the terminology "NSF" in this design doc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I somehow prefer using gnoi WARM terminology over NSF. Would you mind change all the NSF->WARM in this HLD including PR title and markdown headers. I will not block this PR if using WARM in HLD.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced NSF with Warmboot. Updated the document tile, headings and diagrams.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!



## Table of Content


- [Revision](#revision)
- [Scope](#scope)
- [Definitions/Abbreviations](#definitions-abbreviations)
- [Overview](#overview)
- [Requirements](#requirements)
- [Architecture Design](#architecture-design)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also section to cover where the mgr runs and how to start it ? will it be docker or run on host image.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added details in the Architecture Design section.

- [High-Level Design](#high-level-design)
* [NSF Details Registration](#nsf-details-registration)
* [Shutdown Orchestration](#shutdown-orchestration)
+ [Phase 1: Freeze Components & Wait for Switch Quiescence](#phase-1--freeze-components---wait-for-switch-quiescence)
+ [Phase 2: State Verification (Optional)](#phase-2--state-verification--optional-)
+ [Phase 3: Trigger Checkpointing](#phase-3--trigger-checkpointing)
+ [Phase 4: Prepare and Perform Reboot](#phase-4--prepare-and-perform-reboot)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sections to talk about transition of application from old to new reboot types and backward compatibility ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added Backward Compatibility with these details.

+ [Application Shutdown Optimization](#application-shutdown-optimization)
* [Reconciliation Monitoring](#reconciliation-monitoring)
* [Component Warmboot States](#component-warmboot-states)
- [SAI API](#sai-api)
- [Configuration and management](#configuration-and-management)
- [Warmboot and Fastboot Design Impact](#warmboot-and-fastboot-design-impact)
- [Restrictions/Limitations](#restrictions-limitations)
- [Testing Requirements/Design](#testing-requirements-design)
- [Open/Action items - if any](#open-action-items---if-any)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of sections can be added. one use cases covered ( LACP.. etc). Also sections which NPU types are tested or supported after testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added NPU types detail in Testing Requirements/Design section.





### Revision


<table>
<tr>
<td>Rev
</td>
<td>Rev Date
</td>
<td>Author(s)
</td>
<td>Change Description
</td>
</tr>
<tr>
<td>v0.1
</td>
<td>9/28/2023
</td>
<td>Google
</td>
<td>Initial version
</td>
</tr>
</table>



### Scope

This document captures the high-level design for NSF Manager - a new daemon that is responsible for shutdown orchestration and reconciliation monitoring during warm reboot, and the location of this daemon in SONiC.


### Definitions/Abbreviations


<table>
<tr>
<td><strong>NSF</strong>
</td>
<td>Non-Stop Forwarding
</td>
</tr>
<tr>
<td><strong>Applications</strong>
</td>
<td>Higher-layer components that may or may not be running depending on the switch role e.g. Teamd, BGP, P4RT etc.
</td>
</tr>
<tr>
<td><strong>Infrastructure Components</strong>
</td>
<td>Switch stack components that are essential for switch operation irrespective of switch role e.g. Orchagent, Syncd and transceiver daemon
</td>
</tr>
</table>



### Overview

SONiC uses [fast-reboot](https://github.com/sonic-net/sonic-utilities/blob/master/scripts/fast-reboot) and [finalize\_warmboot.sh](https://github.com/sonic-net/sonic-buildimage/blob/master/files/image_config/warmboot-finalizer/finalize-warmboot.sh) scripts for warm reboot reconciliation. The former script is responsible for preparing the platform and database for reboot and orchestrating switch shutdown. The latter script is responsible for monitoring the reconciliation status of a fixed set of switch stack components and eventually removing the warm-boot flag from Redis DB.There are multiple drawbacks of using a bash script:



* Separate binaries need to be developed for each component to send shutdown related notifications.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean adding new scripts to handle shutdown per service?

The current approach does not require a separate script for all services. There is a templated way for all services. The build process auto-generates the scripts per service. A new script is needed only when a service requires special handling for the shutdown path.

Templates:
Start and stop sequence: https://github.com/sonic-net/sonic-buildimage/blob/master/files/scripts/service_mgmt.sh
warm reboot is handled in a generic fashion here).

Services need to only add a soft reference this template and separate binary is not needed. Eg.;
radv: https://github.com/sonic-net/sonic-buildimage/blob/master/files/scripts/radv.sh
snmp: https://github.com/sonic-net/sonic-buildimage/blob/master/files/scripts/snmp.sh

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is referring to the separate binaries that we have to send shutdown notifications to different components. For example, orchagent_restart_check for Orchagent freeze notification, syncd_request_shutdown for Syncd notifications etc.

We aren't proposing to have separate service management scripts.

* The rich set of Redis DB features provided by swss-common library cannot be utilized.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please elaborate here. Perhaps w/ some examples?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

swss-common enables us to create a pub/sub model wherein we can send notifications and subscribe for warm-boot state updates from multiple applications in Redis DB as opposed to poll for this information for individual components.

* A framework to send notifications to components and wait for updates cannot be developed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this already not happening in the current bash script? Notifications are sent and are waited upon w/ a timeout?

I do not fully understand the concern here and may need your help to elaborate here on how the current script limits us from managing the shutdown process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shutdown state machine for different components (proposed in this HLD) can be better monitored using a daemon that is leveraging swss-common libs as opposed to polling these warm-boot states of individual components using a bash script.


Additionally, the current SONiC warm shutdown algorithm has custom shutdown related notifications for different components that triggers their warm shutdown logic. For example, Orchagent uses _freeze_ notification and notifies its ready status via notification channel. Syncd uses _pre-shutdown_ and _shutdown_ notifications and notifies its status via Redis DB.

The proposal is to introduce a new daemon called NSF Manager that will be responsible for both shutdown orchestration and reconciliation monitoring during warm reboot. It will leverage Redis DB features provided by swss-common to create a common framework for warm-boot orchestration. As a result, there will be a unified framework for warm-boot orchestration for all switch stack components.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean everything that happens today will still happen w/ this new design, only that a wrapper around the current process will unify/abstract the different ways in which notifications are sent/received today?

Copy link
Contributor Author

@akarshgupta25 akarshgupta25 Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This framework will co-exist with the existing shutdown orchestration. This framework introduces additional notification channels for the newer unified shutdown notifications. So components can choose to take actions based on the notification channel from which they receive the shutdown notification. Our proposal is to be compatible with the existing SONiC warm-boot orchestration.



### Requirements

The current design covers warm reboot orchestration daemon along with the warm shutdown and bootup sequence. It also covers the additional warmboot states introduced due to the warm shutdown sequence. It does not cover fast reboot.


### Architecture Design

_NA_


### High-Level Design



![alt_text](img/warm-reboot-overall.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: can you add svg format instead of png? The files are unreadable when zoomed in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated freeze and reconciliation images. Hope they are readable now.




NSF Manager will replace the existing [fast-reboot](https://github.com/sonic-net/sonic-utilities/blob/master/scripts/fast-reboot) and [finalize\_warmboot.sh](https://github.com/sonic-net/sonic-buildimage/blob/master/files/image_config/warmboot-finalizer/finalize-warmboot.sh) scripts to perform shutdown orchestration and reconciliation monitoring during warm reboot. It will use a registration based mechanism to determine the switch stack components that need to perform an orchestrated shutdown and bootup. This ensures that the NSF Manager design is generic, flexible and also avoids any hardcoding of component information in NSF Manager.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just calling out the concern that was raised in meeting - we still would like to keep the ability to fix shutdown path on the fly. The script-based design enabled us to add adhoc fixes (after testing) at runtime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please elaborate on the type of fixes that are introduced at runtime?



#### NSF Details Registration


```
Component = <name>
Docker = <name>
Freeze = <True>/<False>
Checkpoint = <True>/<False>
Reconciliation = <True>/<False>
```


Components that want to participate in warm-boot orchestration need to register the above details with NSF Manager. NSF Manager will use these registration details to determine the components that are going to participate in the orchestrated shutdown sequence and monitor reconciliation statuses during bootup. If a component doesn’t register with the NSF Manager then it will continue to operate normally until it is shutdown in [Phase 4](#phase-4-prepare-and-perform-reboot). Components that modify the switch state should register with NSF Manager because they can change the state of other components such as Orchagent, Syncd etc. that participate in the warm reboot orchestration and thus they can impact the warm reboot process.

Components that want to participate in an orchestrated shutdown during warm reboot need to set _freeze = true_. NSF Manager will wait for the quiescence of all components that have _freeze = true_ in their registration. If _checkpoint = True_ only then will NSF Manager wait for the component to complete checkpointing. Components that want NSF Manager to monitor their reconciliation status need to set _reconciliation = true_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a formal definition of flags used here - freeze, checkpoint, quiescence, reconciliation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These couple of lines provide the brief definition of these flags. Do more details need to be added?



#### Shutdown Orchestration

During warm reboot, NSF Manager will orchestrate switch shutdown in a multi-phased approach that ensures that the switch stack is in a stable state before the intent/state is saved. It will perform shutdown in the following phases:



* Phase 1: Freeze components & wait for switch quiescence
* Phase 2: Perform state verification (optional)
* Phase 3: Trigger checkpointing
* Phase 4: Prepare and perform reboot


##### Phase 1: Freeze Components & Wait for Switch Quiescence



![alt_text](img/freeze.png)

NSF Manager will send freeze notification to all registered components and will wait for the quiescence of only those components that have set _freeze = true_. Upon receiving the freeze notification, the component will complete processing the current request queue and stop generating new intents. Stopping new intents from being generated means that boundary components should stop processing requests from external components (external events) and all components should stop their periodic timers that generate new requests (internal events). For example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this simultaneously to all registered components may not be a good idea. We do want some components to keep operating:

  1. until a dependency is active (eg., syncd should work as long as teamd is working)
  2. as long as possible (eg., we do not want to freeze teamd even if OA is frozen or swss is stopped).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Freeze notification means that components should stop generating events for which they are the source. All components will continue to serve requests received by them but will stop generating events that are originated by them. This is similar to saying "I will not ask questions, but I will continue to answer questions". This means that components will continue to serve internal requests. After a point of time, all components will stop generating events and thus the switch will become quiescent.

This means that during freeze phase, Orchagent and Syncd will continue to serve requests from other components such as Teamd.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect all components to finish phase-1, before NSF manager trigger phase-2 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, boundary applications that are independent of state changes to other switch components such as UMF, P4RT etc. can aggregate Phase 1, 2 and 3 as a part of the freeze notification (see Application Shutdown Optimization section). But this is applicable for only a certain subset of the switch stack components. NSF Manager needs to wait for the entire switch to become quiescent before a global checkpointing can take place. For example, Syncd should only save internal states after all intent has been programmed i.e. the switch is quiescent.




* UMF will stop listening to new gRPC requests from the controller.
* P4RT will stop listening to new gRPC requests from the controller and stop packet I/O.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add Step4 under p4rt block that controller needs to be informed about warmboot initiated. Also can you please add details on how and what needs to be informed to the P4RT client from P4RT server when warmboot is initiated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This design introduces the high-level principles for the new warm-boot orchestration framework. I wanted to limit the scope of this design to these core principles. That is why this section talks about the actions of different components at a very high-level, thereby providing a brief idea of stopping self-sourced events.

From P4RT perspective, the P4Runtime spec doesn't talk about reboots in general. During cold reboot, the client observes the same behavior as a server shutdown. Ideally, the client shouldn't observe different behavior during warm reboot. This is because from the P4Runtime lens, the server was shutdown (thus unavailable for sometime) and came back up in the same state as before thereby maintaining the read-write symmetry. I would be happy to discuss this further in the System Orchestration workgroup or in a separate thread.

* xcvrd will stop listening to transceiver updates such as module presence.
* Syncd will stop listening to link state events from the BCM chip.
* Orchagent will stop periodic internal timers.
* BGP will stop exchanging packets with the peers.
* Teamd will stop exchanging LACP PDUs with the peers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this is not a good idea. Preventing teamd to stop exchanging LACPDUs in phase 1 would put a lot of pressure to reconcile within LAG session standard timeout of 90s. If we spend too much time in phase 2+ then LAGs are guaranteed to flap in the recovery path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for syncd - since we want teamd to still continue to send LACPDUs as long as possible, we should not request syncd to reach quiescence at the same time when Orchagent does.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thus, I don't think we can have all components reaching quiescence as a prerequisite for going into Phase 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the concern here. There are multiple ways to tackle this problem:

  1. After freeze notification, Teamd continue to exchange LACP PDUs but not forward any request to Orchagent/Syncd so that their quiescence doesn't change.
  2. Setup a control plane assistant that serves as a proxy for Teamd.
  3. Increase the LAG session timeout before stopping LACP PDU exchange.


After all components have been freezed, the switch would eventually reach a state wherein each component stops generating new events and thus the switch becomes quiescent. This is because:



* Switch boundaries that generate new events have been stopped.
* All timers that generate new events have been stopped.
* All components have completed processing their pending requests and thus there are no in-flight messages.

After receiving the freeze notification, the components will update their quiescent state in STATE DB when they receive a new request (i.e. they are no longer quiescent) and when they complete processing their current request queue (i.e. they become quiescent). NSF Manager will monitor the quiescent state of all components in STATE DB to determine that the switch has become quiescent and thus further state changes won’t occur in the switch. If all components are in quiescent state then NSF Manager will declare that the switch has become quiescent and thus the switch has attained its final state. NSF Manager will wait for a period of time for the switch to become quiescent after which it will determine that warm reboot failed and abort the warm reboot operation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NSF Manager will wait for a period of time for the switch to become quiescent after which it will determine that warm reboot failed and abort the warm reboot operation.

The design should also account for restoration of components state (unfreeze) in the event of failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does SONiC currently support unfreeze of applications in case of failures?



##### Phase 2: State Verification (Optional)



![alt_text](img/state-verification.png)

Since the switch is in quiescent state, this will be the final state of the switch before reboot. NSF Manager will trigger state verification to ensure that the switch is in a consistent state. Reconciling from an inconsistent state might cause traffic loss and thus it is important to ensure that the switch is in a consistent state before warm rebooting. NSF Manager will abort warm reboot operation if state verification fails.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this phase different from what was done as last step in Phase 1? Also, please define consistent state.

Copy link
Contributor Author

@akarshgupta25 akarshgupta25 Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phase 1 ensures that the switch reaches a stable state. Phase 2 (optional) audits the switch state. State verification enables auditing of the state of different components and ensuring that they are in an expected state. Consistency depends on the component. It can mean that the internal cache matches the switch state, or the switch intent matches the switch state etc. For example, BGP kernel routes match APPL DB (or ASIC DB) information, oper-status in Linux host interface matches APPL DB information etc.



##### Phase 3: Trigger Checkpointing

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can NSF merge phase-2 and phase-3 together and let each component complete state verification and save internal state together so that we reduce turn around time with additional phase?
Each component could do state verification and/or save internal state db based on the component configuration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting point :)

Some context:
For some switch stack components such as Syncd, checkpoint notification results in saving the internal state, disconnecting from the ASIC and exiting. This is similar to processing the shutdown notification being sent by /usr/bin/syncd_request_shutdown binary in the existing warm-boot orchestration.
Additionally, we are exploring whether we should support unfreeze during warm shutdown. One possible situation is that we send freeze, then state verification and unfreeze on state verification failure.

With this background, if we support state verification and checkpoint as a single phase then Syncd can exit after checkpointing and state verification can fail for some other application. In such a case, we wouldn't be able to freeze --> state verification failure --> unfreeze unless we ensure that checkpointing isn't a point of no return for a switch stack component.

So we need to analyze the pros and cons of merging these 2 phases. Let me discuss it internally and get back.




![alt_text](img/checkpoint.png)

NSF Manager will send checkpoint notification to all registered components and wait for only to those components that set _checkpoint = true_ to trigger checkpointing i.e. save internal states to either the DB or persistent file system. Checkpointing after state verification is successful ensures that the switch is reconciling from a consistent state. NSF Manager will wait for a period of time for the components to update STATE DB with their checkpointing status after which it will abort the warm reboot operation.


##### Phase 4: Prepare and Perform Reboot



![alt_text](img/phase-4-shutdown.png)

At this point, the DB will be backed up, containers will be shutdown, the platform will be prepared for reboot and the switch will be rebooted.


##### Application Shutdown Optimization



![alt_text](img/shutdown-optimization.png)

Higher layer applications are generally independent and their shutdown is not dependent on quiescence of other switch stack components. For example, upon receiving a freeze notification Teamd can stop exchanging LACP packets, perform state verification (optional) and checkpoint LACP states i.e. transition from Phase 1 to Phase 3 without waiting for NSF Manager to send further notifications. The design allows applications to transition through these phases independently as long as they continue to update their warmboot state in STATE DB. NSF Manager will monitor these states and will handle the applications’ phase transitions. In such a scenario, applications need to set _freeze = true_ and _checkpoint = false_ and thus the freeze notification will result in the application to transition from Phase 1 to Phase 3.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design allows applications to transition through these phases independently

But in warm shutdown we do need dependent transition.
independent transition can cause confusion and open gates for multiple race conditions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phase 1 ensures that the switch stack components continue to serve requests from other components. Also, having a state machine during shutdown helps alleviate the race conditions. I would be happy to look at individual examples of race conditions and check if this design is able to handle them.



#### Reconciliation Monitoring


![alt_text](img/reconciliation.png)

Upon bootup, components will detect warm-boot flag in STATE DB and reconcile to the state before reboot. Components that registered _reconciliation = true_ will update their warm-boot state = RECONCILED in STATE DB after they have completed reconciliation. NSF Manager will monitor the reconciliation status of these registered components during warm bootup. Subsequently, the components will start their normal operation and listen to new requests. The proposed reconciliation monitoring mechanism is similar to the existing mechanism except that components being monitored are based on the NSF details registration as opposed to hardcoded list of components.


#### Component Warmboot States

Based on the shutdown and reconciliation sequence mentioned above, a component will transition through the following states during warm reboot:



* Frozen (warm shutdown)
* Quiescent (warm shutdown)
* Checkpointed (optional, warm shutdown)
* Initialized (warm bootup)
* Reconciled (warm bootup)
* Failed (warm shutdown or bootup)



![alt_text](img/warmboot-states.png)

Upon receiving a freeze notification, a component will transition to _frozen_ state when it has stopped generating new intents for which it is the source. Subsequently, it will transition to _quiescent_ state when it has completed processing all requests. It might transition between _frozen_ and _quiescent_ states when it receives a new request from some other switch component and after it has completed processing all requests i.e. a component might transition between _frozen_ and _quiescent_ states if it processes an intent in freeze mode. If a component fails to perform its freeze routine i.e. fails to stop generating new intents for which it is the source then it will transition to _failed_ state. Upon receiving a checkpoint notification, a component will transition to _checkpointed_ state after it has completed checkpointing or to _failed_ state if checkpointing failed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the warmboot state failed is just a state or will it trigger any event to restart ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is an initial NSF Manager design, the current proposal is to abort warm reboot if an application reports failed warm-boot state. However, we are exploring unfreeze on warm-boot failure as well. For example, if Phase 1 (freeze) fails then we can unfreeze the switch stack and return that warm reboot failed.


After warm reboot, a component will transition to _initialized_ state after it has completed its initialization routine. It will transition to _reconciled_ state after it has completed reconciliation. It will transition to _failed_ state if its initialization or reconciliation fails.

Components will update their state in STATE DB using [setWarmStartState()](https://github.com/sonic-net/sonic-swss-common/blob/master/common/warm_restart.cpp#L223) API during the different warm reboot stages. NSF Manager will monitor these NSF states in STATE DB to determine whether it needs to proceed with the next phase of the warm-boot orchestration or not.


### SAI API

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add note that feature/new changes are independent of any SAI changes needed. There could be possible SAI changes needed for improvement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note in SAI API section.

_NA_


### Configuration and management

_NA_


### Warmboot and Fastboot Design Impact

Users of the existing warmboot and fastboot mechanism will not be impacted. All components will be compatible with the existing SONiC design. An additional flag will be added in CONFIG DB to indicate whether state verification is required during warm-boot orchestration or not.


### Restrictions/Limitations


### Testing Requirements/Design

NSF Manager will have unit and component tests to verify shutdown orchestration and reconciliation monitoring functionality. Component tests will be added to all switch stack components that will register with NSF Manager to ensure that they process notifications from NSF Manager and update STATE DB correctly.


### Open/Action items - if any

NOTE: All the sections and sub-sections given above are mandatory in the design document. Users can add additional sections/sub-sections if required.
Binary file added doc/warm-reboot/img/checkpoint.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/warm-reboot/img/freeze.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/warm-reboot/img/phase-4-shutdown.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/warm-reboot/img/reconciliation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/warm-reboot/img/shutdown-optimization.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/warm-reboot/img/state-verification.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/warm-reboot/img/warm-reboot-overall.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/warm-reboot/img/warmboot-states.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.