Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further refactoring zedrouter to make it much more robust #3322

Merged
merged 9 commits into from
Jul 11, 2023

Conversation

milan-zededa
Copy link
Contributor

Testing of refactored zedrouter showed that while it can handle incremental changes (one change at a time, e.g.: add network instance; connect new app; etc.), there are some cracks when it comes to applying new device config containing multiple changes at once (e.g. the set of currently deployed network instances is replaced with a completely different set). However unlikely it is to encounter such complex sudden device config changes in practice, it would be nice if zedrouter was robust enough to handle anything.

An eden test was prepared to stress zedrouter under complex config changes: lf-edge/eden#870
This test would fail even before already merged zedrouter refactoring. Zedrouter was never robust enough to handle any config change (even if both the previous and the new config are valid). This PR changes that and makes the test pass.

This PR continues (and hopefully ends!) with zedrouter refactoring and adds some commits. The three main topics are:

More detailed info can be found in descriptions of individual commits.

@milan-zededa milan-zededa requested a review from uncleDecart July 6, 2023 12:53
@milan-zededa milan-zededa force-pushed the zedrouter-refactoring2 branch from 1c94cf1 to c9b0b90 Compare July 6, 2023 13:18
Copy link
Member

@uncleDecart uncleDecart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for structuring commits. Was easy to read your changes. And you have a typo in this commit

pkg/pillar/nireconciler/linux_reconciler.go Show resolved Hide resolved
pkg/pillar/nireconciler/linux_reconciler.go Outdated Show resolved Hide resolved
pkg/pillar/nireconciler/linux_reconciler.go Show resolved Hide resolved
@milan-zededa milan-zededa force-pushed the zedrouter-refactoring2 branch 2 times, most recently from cd799df to 4b23866 Compare July 7, 2023 16:30
Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you changed the sanitycheck functions for the network instance to operate on the config and not the status. Is there ever a case when we keep a status around after the config has been deleted by zedagent, or is the delete of the status always immediate when the config is gone?
(I don't know why we we previously checking this on status.)

@eriknordmark
Copy link
Contributor

eriknordmark commented Jul 8, 2023

@milan-zededa I see a golang panic when running ztests.
{"file":"/pillar/cmd/zedagent/zedagent.go:2136","func":"github.com/lf-edge/eve/pkg/pillar/cmd/zedagent.handleDPCLImpl","level":"info","msg":"handleDPCLImpl: has real change.","pid":1825,"source":"zedagent","time":"2023-07-07T23:46:32.510430094Z"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x2276290]

goroutine 242 [running]:
github.com/lf-edge/eve/pkg/pillar/nireconciler.(*LinuxNIReconciler).niBridgeIsCreatedByNIM(...)
/pillar/nireconciler/linux_config.go:1011
github.com/lf-edge/eve/pkg/pillar/nireconciler.(*LinuxNIReconciler).updateCurrentNIBridge(0xc000ae2800, {0xa5, 0x69, 0xd8, 0xa8, 0x21, 0x87, 0x4a, 0x1e, 0x95, ...})
/pillar/nireconciler/linux_currentstate.go:179 +0x210
github.com/lf-edge/eve/pkg/pillar/nireconciler.(*LinuxNIReconciler).runWatcher(0xc000ae2800, 0xc000cbe420)
/pillar/nireconciler/linux_reconciler.go:308 +0xb4e
created by github.com/lf-edge/eve/pkg/pillar/nireconciler.(*LinuxNIReconciler).init
/pillar/nireconciler/linux_reconciler.go:191 +0x50f

I've verified that this panic is due to niBridgeIsCreatedByNIM() being called with a nil argument. Guarding against that makes the ztests pass.

Signed-off-by: Milan Lenco <milan@zededa.com>
…between NIs

This ensures that reconciliation process will not enter some invalid
(even if intermediate) states, such as running multiple instances
of dnsmasq over the same bridge interface or assigning the same IP
to different bridges.

This could in theory happen if network instance is being created while
another one is still being deleted.

Dependencies of dnsmasq, HTTP server and radvd had to be better
described to disallow such invalid states.

Signed-off-by: Milan Lenco <milan@zededa.com>
Signed-off-by: Milan Lenco <milan@zededa.com>
Signed-off-by: Milan Lenco <milan@zededa.com>
The implementation is able to handle any change in the config of network
instances and is much better at detecting and collecting of status updates.

Signed-off-by: Milan Lenco <milan@zededa.com>
This avoids including deleted NIs in subnet overlap checks
(status may still exist while config is already deleted).

Signed-off-by: Milan Lenco <milan@zededa.com>
This can be very useful for troubleshooting purposes (especially in
combination with netdumps).
Note that processes are not stopped/started very often, so this will not
generate that many new log entries.

Signed-off-by: Milan Lenco <milan@zededa.com>
IPReserve is an item representing allocation and use of an IP address
(for bridge). The purpose of this item is to ensure that the same IP
address will not be used by multiple bridges at the same time
(this includes intermediate reconciliation states).
This works by having the bridge depending on the reservation and
by requesting re-creation of IPReserve when it changes, thus triggering
re-create of bridges and all higher-layers items that depend on it.

Signed-off-by: Milan Lenco <milan@zededa.com>
Inter-NI conflict (e.g. subnet overlap) maye happen even when going from
one valid intended state to another. This is because zedrouter receives
create/modify/delete NI notifications separately for each NI from
zedagent and a single change on its own may not be valid.

Signed-off-by: Milan Lenco <milan@zededa.com>
@milan-zededa milan-zededa force-pushed the zedrouter-refactoring2 branch from 4b23866 to 065a9c8 Compare July 10, 2023 09:02
@milan-zededa
Copy link
Contributor Author

I see a golang panic when running ztests.

@eriknordmark Should be fixed now. The place in the code and variable being nil should no longer be reachable.

@milan-zededa
Copy link
Contributor Author

I see you changed the sanitycheck functions for the network instance to operate on the config and not the status. Is there ever a case when we keep a status around after the config has been deleted by zedagent, or is the delete of the status always immediate when the config is gone?
(I don't know why we we previously checking this on status.)

Yes, the status lives a bit longer than config when delete of network instance is performed. Config is removed immediately by zedagent, but status disappears only after zedrouter completed removal of a NI.
Using config instead of status for validation avoids including NI that is being deleted in subnet overlap checks (against newly submitted NIs). But that's a rather special case - even more importantly it just makes more sense to validate config I think.

@eriknordmark
Copy link
Contributor

Yes, the status lives a bit longer than config when delete of network instance is performed. Config is removed immediately by zedagent, but status disappears only after zedrouter completed removal of a NI. Using config instead of status for validation avoids including NI that is being deleted in subnet overlap checks (against newly submitted NIs). But that's a rather special case - even more importantly it just makes more sense to validate config I think.

Could there be cases where this causes the conflict check to pass (on config), but then there is an error deep in the code due to the status and related kernel state not being cleaned up yet?

@milan-zededa
Copy link
Contributor Author

Yes, the status lives a bit longer than config when delete of network instance is performed. Config is removed immediately by zedagent, but status disappears only after zedrouter completed removal of a NI. Using config instead of status for validation avoids including NI that is being deleted in subnet overlap checks (against newly submitted NIs). But that's a rather special case - even more importantly it just makes more sense to validate config I think.

Could there be cases where this causes the conflict check to pass (on config), but then there is an error deep in the code due to the status and related kernel state not being cleaned up yet?

These complexities have now been moved to the reconciler layer. It handles config items of all network instances and their dependencies. If something is still being deleted for an obsolete network instance, it may delay creation of a new network instance (if some config items overlap), but eventually it will iterate into the intended state. Error from one network instance, e.g. a failed delete, may negatively affect creation/modification of another network instance and the error will in some form propagate to the status of the other NI. Or at least the other NI will show as still being in progress.

Checking status errors of network instances at this higher level would not work very well. This is because it is harder to tell at that level if there is actually any overlap between config items of NIs and if the error relates to the overlap. So I think it is OK to let the validation pass and fail later, but only if there is an actual conflict with another NI.

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring the three yetus complains on existing unmodified code.

@eriknordmark eriknordmark merged commit c5e3475 into lf-edge:master Jul 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants