Gradual rollout #110

p-strusiewiczsurmacki-mobica · 2024-03-27T16:29:01Z

This PR implements gradual rollout as described in #98

There are 2 new CRDs added:

NodeConfig - which represents node configuration
NodeConfigProcess - which represents the global state of the configuration (can be provisioning or provisioned). This is used to check if previous leader did not fail in the middle of configuration process. If so, backups are restored.

New pod added - network-operator-configurator - this pod (daemonset) is is responsible for fetching vrfrouteconfigurations, layer2networkconfigurations and routingtables and combining those into NodeConfig for each node.

network-operator-worker pod instead of fetching separate config resources, will now only fetch NodeConfig. After configuration is done, and connectivity is checked, it will backup the config on disk. If connectivity is lost after deploying new config - configuration will be restored using the local backup.

For each node there can be 3 NodeConfig objects created:

<nodename> - current configuration
<nodename>-backup - backup configuration
<nodename>-invalid - last known invalid configuration

How does it work:

network-operator-configurator starts and leader election takes place.
Leader checks ~~NodeConfigProcess~~ if any config is in invalid or provisioning state to check if previous leader did not die amid the configuration process. If so, it will revert configuration for all the nodes using backup configuration.
When user deploys vrfrouteconfigurations, layer2networkconfigurations and /or routingtables object, configurator configurator will:

combine those into separate NodeConfig for each node
~~- set NodeConfigProcess state to provisioning~~

Configurator checks new configs against known invalid configs. If any new config is equal to at least one known invalid config, deployment is aborted.
Configurator backups the current config as -backup and deploys new config with status provisioning.
network-operator-worker fetches new config and configures node. It checks connectivity and:

if it is OK, it stores backup on disk, ant updates the status of the config to provisioned
if connectivity was lost, it restores the configuration from local backup and (if possible) updates the config status to invalid.

Configurator waits for the outcome of the config provisioning by checking the config status:
- if status was set by worker to provisioned - it proceeds with deploying next node(s).
- if status was set to invalid - it aborts the deployment and reverts changes on all the nodes that were changed in this iteration.
- if it times out (e.g. node was unable to update the config state for some reason) - it invalidates the config and reverts the changes on all nodes.

Configurator can be set to update more than 1 node concurrently. Number of nodes for concurrent update can be set using update-limit configuration flag (defaults to 1).

chdxD1

I sadly can't comment on the lines because Github:

https://github.com/telekom/das-schiff-network-operator/pull/110/files#diff-a96964d7107a0881e826cd2f6ac71e901d612df880362eb04cb0b54eb609b8e5L70-L90
this does not seem required anymore when the Layer2 Network Configurations are handled by the ConfigReconciler

chdxD1

Also what happens when a node joins(!) or leaves the cluster? Especially when the joining or leaving happens mid of a rollout?

p-strusiewiczsurmacki-mobica · 2024-04-17T16:49:43Z

@chdxD1

Also what happens when a node joins(!) or leaves the cluster? Especially when the joining or leaving happens mid of a rollout?

If Node joins the cluster it should be configured in next reconciliation loop iteration (I think, will check that to be sure). But, on node leave, config will be tagged as invalid (as it should timeout) and configuration will be aborted. I'll try to fix that.

chdxD1 · 2024-04-18T08:09:10Z

https://github.com/telekom/das-schiff-network-operator/pull/110/files#diff-a96964d7107a0881e826cd2f6ac71e901d612df880362eb04cb0b54eb609b8e5L70-L90

would be nice to watch node events in the central manager to create that nodeconfig before the next reconcile loop

p-strusiewiczsurmacki-mobica · 2024-04-29T16:32:51Z

@chdxD1
I had to make tons of changes, but I think the code should be much better now.
Each config now has owner reference set to node, so whenever a node is removed, all the NodeConfig objects should be removed automatically as well.
As for nodes added, I've introduced a node_reconciler which watches for nodes and whenever node is added it sends info to config_manager so it can trigger updates as soon as possible. It also tags deleted nodes as 'inactive' so those can be skipped if for example node was deleted during the config deployment process.

Signed-off-by: Patryk Strusiewicz-Surmacki <patryk-pawel.strusiewicz-surmacki@external.telekom.de>

schrej · 2024-06-11T17:10:01Z

api/v1alpha1/nodeconfig_types.go

+	}
+}
+
+func CopyNodeConfig(src, dst *NodeConfig, name string) {


Any reason not to use the generated nodeConfig.DeepCopy() functions instead?

schrej · 2024-06-11T17:20:51Z

config/manager/kustomization.yaml

-  newTag: latest
+  newTag: v301
 - name: frr-exporter
  newName: ghcr.io/telekom/frr-exporter
-  newTag: latest
+  newTag: v301


is this intentional?

schrej · 2024-06-11T17:26:49Z

pkg/config_manager/config_manager.go

@@ -0,0 +1,394 @@
+package configmanager


please change the directory to configmanager so it matches the package name (don't add an underscore to the package name instead please).

schrej · 2024-06-11T17:31:37Z

pkg/config_map/config_map.go

@@ -0,0 +1,63 @@
+package configmap


same here, please change the package name.

Apart from that, why even create two packages here? I would merge them into a config package. Then it's also possible to rename the interface to Map and keep the struct private.

schrej · 2024-06-27T12:11:50Z

pkg/config_manager/config_manager.go

+				}
+			}
+		default:
+			time.Sleep(defaultCooldownTime)


When no default is specified, select will just wait until it receives something on any of the channels. Removing this is more efficient than actively sleeping for a bit.

schrej · 2024-06-27T12:14:56Z

pkg/config_manager/config_manager.go

+}
+
+// WatchConfigs waits for cm.changes channel.
+func (cm *ConfigManager) WatchConfigs(ctx context.Context, errCh chan error) {


you could simplify this a bit by combining WatchConfigs and WatchDeletedNodes into a single method and selecting over both channels. Would of course be a bit slower if both happens at the same time, but that's probably not an issue anyway?

schrej · 2024-06-27T14:35:07Z

pkg/config_manager/config_manager.go

+			if !errors.Is(ctx.Err(), context.Canceled) {
+				errCh <- fmt.Errorf("error watching configs: %w", ctx.Err())
+			} else {
+				errCh <- nil


why even pipe in anything at all?

schrej · 2024-06-27T14:36:51Z

pkg/config_manager/config_manager.go

+		select {
+		case <-ctx.Done():
+			if !errors.Is(ctx.Err(), context.Canceled) {
+				errCh <- fmt.Errorf("error watching configs: %w", ctx.Err())


Is it really an error when the context gets cancelled? Or is this context getting called with that cancel function in L87?

schrej · 2024-06-27T14:38:27Z

pkg/config_manager/config_manager.go

+	var isDirty bool
+	var err error
+	// check previous leader's work
+	if isDirty, err = cm.isDirty(); err != nil {


I'd just do it this way, it doesn't make a difference scope wise and is shorter

Suggested change

var isDirty bool

var err error

// check previous leader's work

if isDirty, err = cm.isDirty(); err != nil {

// check previous leader's work

isDirty, err := cm.isDirty()

if err != nil {

p-strusiewiczsurmacki-mobica force-pushed the gradual-rollout branch from 401b378 to 6c79808 Compare April 3, 2024 10:11

p-strusiewiczsurmacki-mobica marked this pull request as ready for review April 12, 2024 17:27

p-strusiewiczsurmacki-mobica requested review from MaxRink, Cellebyte, chdxD1 and schrej as code owners April 12, 2024 17:27

p-strusiewiczsurmacki-mobica changed the title ~~[WIP] Gradual rollout~~ Gradual rollout Apr 12, 2024

chdxD1 requested changes Apr 17, 2024

View reviewed changes

p-strusiewiczsurmacki-mobica force-pushed the gradual-rollout branch from f73e182 to 7dd91cf Compare April 17, 2024 16:04

p-strusiewiczsurmacki-mobica requested a review from chdxD1 April 30, 2024 15:23

p-strusiewiczsurmacki-mobica force-pushed the gradual-rollout branch 2 times, most recently from a192158 to 9cbcd4f Compare May 20, 2024 14:17

chdxD1 mentioned this pull request May 23, 2024

Split network-operator into operator and agent #121

Open

p-strusiewiczsurmacki-mobica force-pushed the gradual-rollout branch from 9cbcd4f to dc5ac0f Compare May 28, 2024 13:27

This was referenced May 29, 2024

Split network-operator into operator and agent #124

Draft

network-agent netns mode #122

Open

p-strusiewiczsurmacki-mobica force-pushed the gradual-rollout branch 3 times, most recently from 4e53170 to 31cd606 Compare June 12, 2024 09:12

Added support for gradual rollout

6c8cec3

Signed-off-by: Patryk Strusiewicz-Surmacki <patryk-pawel.strusiewicz-surmacki@external.telekom.de>

p-strusiewiczsurmacki-mobica force-pushed the gradual-rollout branch from 31cd606 to 6c8cec3 Compare June 12, 2024 10:44

schrej reviewed Jul 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradual rollout #110

Gradual rollout #110

p-strusiewiczsurmacki-mobica commented Mar 27, 2024 •

edited

Loading

chdxD1 left a comment

chdxD1 left a comment

p-strusiewiczsurmacki-mobica commented Apr 17, 2024 •

edited

Loading

chdxD1 commented Apr 18, 2024

p-strusiewiczsurmacki-mobica commented Apr 29, 2024 •

edited

Loading

schrej Jun 11, 2024

schrej Jun 11, 2024

schrej Jun 11, 2024

schrej Jun 11, 2024

schrej Jun 27, 2024

schrej Jun 27, 2024

schrej Jun 27, 2024

schrej Jun 27, 2024

schrej Jun 27, 2024

Gradual rollout #110

Are you sure you want to change the base?

Gradual rollout #110

Conversation

p-strusiewiczsurmacki-mobica commented Mar 27, 2024 • edited Loading

chdxD1 left a comment

Choose a reason for hiding this comment

chdxD1 left a comment

Choose a reason for hiding this comment

p-strusiewiczsurmacki-mobica commented Apr 17, 2024 • edited Loading

chdxD1 commented Apr 18, 2024

p-strusiewiczsurmacki-mobica commented Apr 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

p-strusiewiczsurmacki-mobica commented Mar 27, 2024 •

edited

Loading

p-strusiewiczsurmacki-mobica commented Apr 17, 2024 •

edited

Loading

p-strusiewiczsurmacki-mobica commented Apr 29, 2024 •

edited

Loading