Only update last reconcile status when there are resource changes #1281
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
While rolling out Terranetes I noticed that on some of our clusters (the ones with a larger number of Configuration resources) terranetes was continuously reconciling all resources even when there were no changes. The
drift
controller was causing this and there was a significant increase in the CPU usage for the terranetes-controller pod. This can be seen on this graph for the workqueue add rate:Upon further investigation, I noticed this was caused by the drift controller continuously updating the timestamp at
Configuration.Status.LastReconcile.Time
. The reason the issue only manifested on some clusters was due to how long the drift reconciler takes to run.Since the
Configuration.Status.LastReconcile.Time
field is serialized with seconds resolution (e.g. "2024-03-04T14:00:19Z") if the reconciler takes less than 1 second to run, the serialized value will be the same and the Configuration resource will remain unchanged. This is the scenario I saw on some smaller clusters where this issue wasn't present.However, if reconciling takes longer than 1 second, the controller will update the resource causing the informer to notice the change and enqueue a new reconciliation, which can take more than 1s again and the process repeats ad infinitum. This could happen due to too many Configuration resources or other constraints to the controller Pod.
Now to the proposed fix, I can see two different paths for a solution here. The first would be adding an event filter to the drift controller to ignore changes to the lastReconcile timestamps. The problem with this is that it would only apply to the Terranetes controllers, and every external controllers/operators watching Terranetes resources would also have to apply that same filter to avoid being spammed with reconciliations.
The other option is the one I'm submitting on this PR. I propose the LastReconcile timestamps be updated only if there are other changes to the resource as changing only the timestamp isn't helpful to watchers. This is similar to the behavior of timestamps on Conditions which only track the latest change. I also added some basic unit tests to verify this new behavior.