-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gradual rollout #110
base: main
Are you sure you want to change the base?
Gradual rollout #110
Conversation
401b378
to
6c79808
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sadly can't comment on the lines because Github:
https://github.com/telekom/das-schiff-network-operator/pull/110/files#diff-a96964d7107a0881e826cd2f6ac71e901d612df880362eb04cb0b54eb609b8e5L70-L90
this does not seem required anymore when the Layer2 Network Configurations are handled by the ConfigReconciler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also what happens when a node joins(!) or leaves the cluster? Especially when the joining or leaving happens mid of a rollout?
f73e182
to
7dd91cf
Compare
If Node joins the cluster it should be configured in next reconciliation loop iteration (I think, will check that to be sure). But, on node leave, config will be tagged as invalid (as it should timeout) and configuration will be aborted. I'll try to fix that. |
would be nice to watch node events in the central manager to create that nodeconfig before the next reconcile loop |
@chdxD1 |
a192158
to
9cbcd4f
Compare
9cbcd4f
to
dc5ac0f
Compare
4e53170
to
31cd606
Compare
Signed-off-by: Patryk Strusiewicz-Surmacki <patryk-pawel.strusiewicz-surmacki@external.telekom.de>
31cd606
to
6c8cec3
Compare
} | ||
} | ||
|
||
func CopyNodeConfig(src, dst *NodeConfig, name string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to use the generated nodeConfig.DeepCopy()
functions instead?
newTag: latest | ||
newTag: v301 | ||
- name: frr-exporter | ||
newName: ghcr.io/telekom/frr-exporter | ||
newTag: latest | ||
newTag: v301 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this intentional?
@@ -0,0 +1,394 @@ | |||
package configmanager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please change the directory to configmanager
so it matches the package name (don't add an underscore to the package name instead please).
@@ -0,0 +1,63 @@ | |||
package configmap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, please change the package name.
Apart from that, why even create two packages here? I would merge them into a config
package. Then it's also possible to rename the interface to Map
and keep the struct private.
} | ||
} | ||
default: | ||
time.Sleep(defaultCooldownTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When no default is specified, select will just wait until it receives something on any of the channels. Removing this is more efficient than actively sleeping for a bit.
} | ||
|
||
// WatchConfigs waits for cm.changes channel. | ||
func (cm *ConfigManager) WatchConfigs(ctx context.Context, errCh chan error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could simplify this a bit by combining WatchConfigs
and WatchDeletedNodes
into a single method and selecting over both channels. Would of course be a bit slower if both happens at the same time, but that's probably not an issue anyway?
if !errors.Is(ctx.Err(), context.Canceled) { | ||
errCh <- fmt.Errorf("error watching configs: %w", ctx.Err()) | ||
} else { | ||
errCh <- nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why even pipe in anything at all?
select { | ||
case <-ctx.Done(): | ||
if !errors.Is(ctx.Err(), context.Canceled) { | ||
errCh <- fmt.Errorf("error watching configs: %w", ctx.Err()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really an error when the context gets cancelled? Or is this context getting called with that cancel function in L87?
var isDirty bool | ||
var err error | ||
// check previous leader's work | ||
if isDirty, err = cm.isDirty(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just do it this way, it doesn't make a difference scope wise and is shorter
var isDirty bool | |
var err error | |
// check previous leader's work | |
if isDirty, err = cm.isDirty(); err != nil { | |
// check previous leader's work | |
isDirty, err := cm.isDirty() | |
if err != nil { |
This PR implements gradual rollout as described in #98
There are 2 new CRDs added:
NodeConfig
- which represents node configurationNodeConfigProcess
- which represents the global state of the configuration (can beprovisioning
orprovisioned
). This is used to check if previous leader did not fail in the middle of configuration process. If so, backups are restored.New pod added -
network-operator-configurator
- this pod (daemonset) is is responsible for fetchingvrfrouteconfigurations
,layer2networkconfigurations
androutingtables
and combining those intoNodeConfig
for each node.network-operator-worker
pod instead of fetching separate config resources, will now only fetchNodeConfig
. After configuration is done, and connectivity is checked, it will backup the config on disk. If connectivity is lost after deploying new config - configuration will be restored using the local backup.For each node there can be 3 NodeConfig objects created:
<nodename>
- current configuration<nodename>-backup
- backup configuration<nodename>-invalid
- last known invalid configurationHow does it work:
network-operator-configurator
starts and leader election takes place.if any config is inNodeConfigProcess
invalid
orprovisioning
state to check if previous leader did not die amid the configuration process. If so, it will revert configuration for all the nodes using backup configuration.vrfrouteconfigurations
,layer2networkconfigurations
and /orroutingtables
object,configurator
configurator will:NodeConfig
for each node- setNodeConfigProcess
state toprovisioning
provisioning
.network-operator-worker
fetches new config and configures node. It checks connectivity and:provisioned
invalid
.provisioned
- it proceeds with deploying next node(s).invalid
- it aborts the deployment and reverts changes on all the nodes that were changed in this iteration.Configurator can be set to update more than 1 node concurrently. Number of nodes for concurrent update can be set using
update-limit
configuration flag (defaults to 1).