9.3.0: Add missing check for SenderStatusCertMiss and more robust cert fetch #3031
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport of the #3020
cc: @eriknordmark
cc: @dautovri
FYI: branch probably has to be tagged as 9.3.1, Ruslan, you know better.
Need a careful review around this to look for any other holes where we can miss updating the controllers certificates in EVE, e.g., if there is a network outage. If helpful we can set up a call for that.
The refactoring in
985b142 missed the check for SenderStatusCertMiss.
Given that we didn't detect that in testing, the second commit runs a the fetch periodically (default every 24 hours) as extra safety.
Third commit: Turns out that the first fetch of the certs in zedagent can fail if the network isn't up yet (and the fetch on boot in client.go can be skipped if the device has already been onboarded), thus it makes sense retrying more frequently if that happens. (Note that all of this retry is a second layer of protection should the SenderStatusCertMiss checks be broken again as in the above commit.)
We also add some more logging to be able to observe at least the failures, without logging every periodic fetch.
The forth commit makes sure we clear the cache inside the zedcloud.authen.go code when we update the certs.
The fifth commit is unrelated, but when debugging and testing the above it looked odd that a device was marked as booting when it was running fine using the checkpointed config.