Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug]: Re-triggering reconcile does not work when status is updated #554

Open
buehler opened this issue Apr 17, 2023 · 15 comments
Open

[bug]: Re-triggering reconcile does not work when status is updated #554

buehler opened this issue Apr 17, 2023 · 15 comments
Labels
bug Something isn't working

Comments

@buehler
Copy link
Owner

buehler commented Apr 17, 2023

Describe the bug

As discussed in #551, the timed reconcile return value is ignored when a status update is performed during the reconciliation loop.

To reproduce

  1. Create a controller
  2. Update the status of the instance
  3. Return with timed reconcile event
  4. No timed reconcile is called

Expected behavior

No response

Screenshots

No response

Additional Context

Events = _localEvents
.Merge(watcherEvents)
.Where(EventRetryCountIsLessThanMax)
.GroupBy(e => e.Resource.Uid())
.Select(
group => group
.Select(ProcessDelay)
.Switch())
.Merge()
.Select(UpdateResourceData)
.Merge()
.Where(EventTypeIsNotFinalizerModified);

@buehler buehler added the bug Something isn't working label Apr 17, 2023
@Karthik2893
Copy link

Following up on our discussion to get more context on this:

  1. What are the likely cases where duplicate events might be queued?
  2. As per my understanding, the "Reconciliation" loop should always check the state of the cluster and ensure if it is in the desired state or not. Ideally, controller should be handling duplicate events efficiently (resulting in a no-op for ex) and they should not be a problem. Correct me if I am wrong.

My understanding is that replacing "Switch" with "Merge" should fix this issue. I want to be sure that this won't cause any unintended side-effects.

Events = _localEvents
.Merge(watcherEvents)
.Where(EventRetryCountIsLessThanMax)
.GroupBy(e => e.Resource.Uid())
.Select(
group => group
.Select(ProcessDelay)
.Switch())
.Merge()
.Select(UpdateResourceData)
.Merge()
.Where(EventTypeIsNotFinalizerModified);

Thank you.

@prochnowc
Copy link

prochnowc commented Aug 10, 2023

@buehler I think we're also being hit by this bug - I couldnt debug our operator yet.

The issue is that after some reconcile loop's most of the resources are never reconciled again, even if the resource was updated in kubernetes (not the status). Strangely this does not happen for all resources.

After restarting the operator all resources are reconciled again for some time and stops working for most resources again.

I will open a new issue as it seems unrelated.

@tomitesh
Copy link

tomitesh commented Nov 20, 2023

@buehler I think we're also being hit by this bug - I couldnt debug our operator yet.

The issue is that after some reconcile loop's most of the resources are never reconciled again, even if the resource was updated in kubernetes (not the status). Strangely this does not happen for all resources.

After restarting the operator all resources are reconciled again for some time and stops working for most resources again.

I will open a new issue as it seems unrelated.

@buehler

We encountered the same issue last week. We perform reconciliation for custom resources every minute, during which we update the status of each custom resource. There are seven resources undergoing reconciliation every minute. However, at random intervals, such as after 2 days or 5 days, we've noticed that the reconciliation process stops for some of these seven resources.

We did not observed any error/exception in logs.

We are using eks 1.24 and kubeops 7.6.1 library.

Is there any fix/workaround available?

@Karthik2893
Copy link

@tomitesh Just wanted to double check that you are not returning "Null" for these objects (for which reconciliation process is stopping) as part of the reconcile loop.
I think the issue that I opened has been fixed in the update made by @prochnowc above.

@tomitesh
Copy link

tomitesh commented Nov 21, 2023

@tomitesh Just wanted to double check that you are not returning "Null" for these objects (for which reconciliation process is stopping) as part of the reconcile loop. I think the issue that I opened has been fixed in the update made by @prochnowc above.

Thanks @Karthik2893 for your reply.

We aim to reconcile every minute, opting for option 1 as outlined in the readme to achieve periodic reconciliation. However, after a random interval of 2-3-5 days, reconciliation ceases for certain resources.

While utilizing option 2, reconciliation occurs only once during startup (unless i am missing any configuration).

Note : we also update status as part of "// reconcile logic"

option 1

    public async Task<ResourceControllerResult?> ReconcileAsync(V1DemoEntity entity)
    {
        _logger.LogInformation($"entity {entity.Name()} called {nameof(ReconcileAsync)}.");
        await _finalizerManager.RegisterFinalizerAsync<DemoFinalizer>(entity);

      // reconcile logic

        return ResourceControllerResult.RequeueEvent(TimeSpan.FromSeconds(60));
    }

Option 2

    public async Task<ResourceControllerResult?> ReconcileAsync(V1DemoEntity entity)
    {
        _logger.LogInformation($"entity {entity.Name()} called {nameof(ReconcileAsync)}.");
        await _finalizerManager.RegisterFinalizerAsync<DemoFinalizer>(entity);

      // reconcile logic

        return null;
    }

@Karthik2893
Copy link

@tomitesh Do you happen to know what you WatcherHttpTimeout is set to? Setting it to a higher value (>120min or so) resulted in failing to reconcile after a certain period of time. But if that is the case, it should fail to reconcile for all the entities and not a fraction of entities. For your case, I am suspecting one of your codepaths might be returning "null" for the entities that are failing to reconcile? If not, then we will have to ask others to look into it and it will be helpful if we have code snippets :)

@emouawad
Copy link

I can confirm, experiencing the same issue on (8.0.0-pre.29, 8.0.0-pre.34 - didn't try other versions) - during reconcile i update the status and publish events (not sure if related) then requeue.
The reconcile is stopped for all my entity instances (only have 2 for now).
If i delete those 2 and reapply the reconcile/requeue is restarted for some time and then just stops in less than an hour.
Plz let me know if i can help with more logs or testing.
Not sure for now - will test again but i am thinking the issue is in the watcher?

@buehler
Copy link
Owner Author

buehler commented Nov 30, 2023

Hmm. Good question.
It is such a weird behaviour. Do you experience this issue when you test locally (i.e. with docker Kubernetes or minikube or whatever) as well?

Events should not impact the watcher, since they are completely different objects in Kubernetes. However, status updates could impact the requeue cache.

But when you update the status and then requeue the resource, it should retrigger the reconcile.

I'll conduct a test of my own :)

@sicavz
Copy link

sicavz commented Dec 19, 2023

[Checked on v8.0.0-pre.38]
Actually, the flow is pretty simple, and the behaviour is related to timing.

If in the ReconcileAsync one would simply

  • update the entity status and then
  • requeue it,

due to the asynchronous nature of the system, even if the entity status is changed first, the "Modified" event for it might occur AFTER the requeue.
If the spec of the entity is not changed (and this is the case), something like
"Entity "X" modification did not modify generation. Skip event."
will be logged and the requeue will never happen on a scheduled basis (see the _queue.RemoveIfQueued(entity); in the ResourceWatcher::OnEvent - https://github.com/buehler/dotnet-operator-sdk/blob/9bc1efaa94355103e33dc0ceded1a0b666b69629/src/KubeOps.Operator/Watcher/ResourceWatcher%7BTEntity%7D.cs#L168C8-L168C39

You can check the attached screenshot (

  • the first two flow have an "expected" sequence: save status, get Modified event, requeue
  • the last flow exposes the issue (the order is different): save status, requeue, get Modified event
    )

screenshot

@sicavz
Copy link

sicavz commented Dec 19, 2023

It's even more obvious that the requeue is skipped if you look at the log below (after the KubernetesClient reconnects):

[14:38:51.216 - DBG - WebhookOperator.Controller.UscSystemEntityController - requeueing...
[14:38:51.216 - VRB - KubeOps.Abstractions.Queue.EntityRequeue - Requeue entity "X" in 5000ms.
[14:38:51.216 - DBG - WebhookOperator.Controller.XEntityController - requeued

[14:38:52.883 - DBG - KubeOps.Operator.Watcher.ResourceWatcher - The watcher was closed

[14:38:52.883 - DBG - KubeOps.Operator.Watcher.ResourceWatcher - Create watcher for entity of type "WebhookOperator.Entities.XEntity".
[14:38:52.888 - VRB - KubeOps.Operator.Watcher.ResourceWatcher - Received watch event "Added" for "X/alpha".
[14:38:52.888 - DBG - KubeOps.Operator.Watcher.ResourceWatcher - Entity "X/alpha" modification did not modify generation. Skip event.

NOTHING HAPPENS ANYMORE HERE EVEN IF THE THIRD LINE REQUEUED THE ENTITY!

@buehler
Copy link
Owner Author

buehler commented Jan 18, 2024

This should be overhauled in v8.

@buehler buehler closed this as completed Jan 18, 2024
@sicavz
Copy link

sicavz commented Jan 30, 2024

Unfortunately, my previous two comments were for v8.0.0-pre.38.

@buehler
Copy link
Owner Author

buehler commented Jan 30, 2024

Hey @sicavz, so it does not work? when using the framework as documented, I could not reproduce the error. You need to use the returned entities from the client to have the updated resource versions and stuff. And status update does not update the resource version which should be fine.

@sicavz
Copy link

sicavz commented Jan 31, 2024

I've sketched the flow as I understand it (from the ResourceWatcher)

kubeops drawio (1)

As you see, the issue is dependent on timing, so it's a bit harder to reproduce... (Please review the screenshot I've posted with my first comment)

@buehler buehler reopened this Jan 31, 2024
@nullexceptiondev
Copy link

nullexceptiondev commented Feb 1, 2024

I'm struggling with the same issue in v8. When running in VS locally, everything runs perfectly fine. When deployed to kubernetes I get maybe 2 or 3 reconcile calls before it stops

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants