-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Further refactoring zedrouter to make it much more robust #3322
Further refactoring zedrouter to make it much more robust #3322
Conversation
1c94cf1
to
c9b0b90
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for structuring commits. Was easy to read your changes. And you have a typo in this commit
cd799df
to
4b23866
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you changed the sanitycheck functions for the network instance to operate on the config and not the status. Is there ever a case when we keep a status around after the config has been deleted by zedagent, or is the delete of the status always immediate when the config is gone?
(I don't know why we we previously checking this on status.)
@milan-zededa I see a golang panic when running ztests. goroutine 242 [running]: I've verified that this panic is due to niBridgeIsCreatedByNIM() being called with a nil argument. Guarding against that makes the ztests pass. |
Signed-off-by: Milan Lenco <milan@zededa.com>
…between NIs This ensures that reconciliation process will not enter some invalid (even if intermediate) states, such as running multiple instances of dnsmasq over the same bridge interface or assigning the same IP to different bridges. This could in theory happen if network instance is being created while another one is still being deleted. Dependencies of dnsmasq, HTTP server and radvd had to be better described to disallow such invalid states. Signed-off-by: Milan Lenco <milan@zededa.com>
Signed-off-by: Milan Lenco <milan@zededa.com>
Signed-off-by: Milan Lenco <milan@zededa.com>
The implementation is able to handle any change in the config of network instances and is much better at detecting and collecting of status updates. Signed-off-by: Milan Lenco <milan@zededa.com>
This avoids including deleted NIs in subnet overlap checks (status may still exist while config is already deleted). Signed-off-by: Milan Lenco <milan@zededa.com>
This can be very useful for troubleshooting purposes (especially in combination with netdumps). Note that processes are not stopped/started very often, so this will not generate that many new log entries. Signed-off-by: Milan Lenco <milan@zededa.com>
IPReserve is an item representing allocation and use of an IP address (for bridge). The purpose of this item is to ensure that the same IP address will not be used by multiple bridges at the same time (this includes intermediate reconciliation states). This works by having the bridge depending on the reservation and by requesting re-creation of IPReserve when it changes, thus triggering re-create of bridges and all higher-layers items that depend on it. Signed-off-by: Milan Lenco <milan@zededa.com>
Inter-NI conflict (e.g. subnet overlap) maye happen even when going from one valid intended state to another. This is because zedrouter receives create/modify/delete NI notifications separately for each NI from zedagent and a single change on its own may not be valid. Signed-off-by: Milan Lenco <milan@zededa.com>
4b23866
to
065a9c8
Compare
@eriknordmark Should be fixed now. The place in the code and variable being nil should no longer be reachable. |
Yes, the status lives a bit longer than config when delete of network instance is performed. Config is removed immediately by zedagent, but status disappears only after zedrouter completed removal of a NI. |
Could there be cases where this causes the conflict check to pass (on config), but then there is an error deep in the code due to the status and related kernel state not being cleaned up yet? |
These complexities have now been moved to the reconciler layer. It handles config items of all network instances and their dependencies. If something is still being deleted for an obsolete network instance, it may delay creation of a new network instance (if some config items overlap), but eventually it will iterate into the intended state. Error from one network instance, e.g. a failed delete, may negatively affect creation/modification of another network instance and the error will in some form propagate to the status of the other NI. Or at least the other NI will show as still being in progress. Checking status errors of network instances at this higher level would not work very well. This is because it is harder to tell at that level if there is actually any overlap between config items of NIs and if the error relates to the overlap. So I think it is OK to let the validation pass and fail later, but only if there is an actual conflict with another NI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignoring the three yetus complains on existing unmodified code.
Testing of refactored zedrouter showed that while it can handle incremental changes (one change at a time, e.g.: add network instance; connect new app; etc.), there are some cracks when it comes to applying new device config containing multiple changes at once (e.g. the set of currently deployed network instances is replaced with a completely different set). However unlikely it is to encounter such complex sudden device config changes in practice, it would be nice if zedrouter was robust enough to handle anything.
An eden test was prepared to stress zedrouter under complex config changes: lf-edge/eden#870
This test would fail even before already merged zedrouter refactoring. Zedrouter was never robust enough to handle any config change (even if both the previous and the new config are valid). This PR changes that and makes the test pass.
This PR continues (and hopefully ends!) with zedrouter refactoring and adds some commits. The three main topics are:
LinuxNIReconciler
for more robustnessMore detailed info can be found in descriptions of individual commits.