[bug]: Re-triggering reconcile does not work when status is updated #554

buehler · 2023-04-17T14:58:56Z

Describe the bug

As discussed in #551, the timed reconcile return value is ignored when a status update is performed during the reconciliation loop.

To reproduce

Create a controller
Update the status of the instance
Return with timed reconcile event
No timed reconcile is called

Expected behavior

No response

Screenshots

No response

Additional Context

dotnet-operator-sdk/src/KubeOps/Operator/Controller/EventQueue.cs

Lines 44 to 55 in 3f42bbc

    
                   Events = _localEvents 
        
                       .Merge(watcherEvents) 
        
                       .Where(EventRetryCountIsLessThanMax) 
        
                       .GroupBy(e => e.Resource.Uid()) 
        
                       .Select( 
        
                           group => group 
        
                               .Select(ProcessDelay) 
        
                               .Switch()) 
        
                       .Merge() 
        
                       .Select(UpdateResourceData) 
        
                       .Merge() 
        
                       .Where(EventTypeIsNotFinalizerModified);

Karthik2893 · 2023-04-19T00:16:19Z

Following up on our discussion to get more context on this:

What are the likely cases where duplicate events might be queued?
As per my understanding, the "Reconciliation" loop should always check the state of the cluster and ensure if it is in the desired state or not. Ideally, controller should be handling duplicate events efficiently (resulting in a no-op for ex) and they should not be a problem. Correct me if I am wrong.

My understanding is that replacing "Switch" with "Merge" should fix this issue. I want to be sure that this won't cause any unintended side-effects.

dotnet-operator-sdk/src/KubeOps/Operator/Controller/EventQueue.cs

Lines 44 to 55 in 3f42bbc

    
                   Events = _localEvents 
        
                       .Merge(watcherEvents) 
        
                       .Where(EventRetryCountIsLessThanMax) 
        
                       .GroupBy(e => e.Resource.Uid()) 
        
                       .Select( 
        
                           group => group 
        
                               .Select(ProcessDelay) 
        
                               .Switch()) 
        
                       .Merge() 
        
                       .Select(UpdateResourceData) 
        
                       .Merge() 
        
                       .Where(EventTypeIsNotFinalizerModified);

Thank you.

prochnowc · 2023-08-10T13:49:16Z

@buehler I think we're also being hit by this bug - I couldnt debug our operator yet.

The issue is that after some reconcile loop's most of the resources are never reconciled again, even if the resource was updated in kubernetes (not the status). Strangely this does not happen for all resources.

After restarting the operator all resources are reconciled again for some time and stops working for most resources again.

I will open a new issue as it seems unrelated.

tomitesh · 2023-11-20T13:06:12Z

@buehler I think we're also being hit by this bug - I couldnt debug our operator yet.

The issue is that after some reconcile loop's most of the resources are never reconciled again, even if the resource was updated in kubernetes (not the status). Strangely this does not happen for all resources.

After restarting the operator all resources are reconciled again for some time and stops working for most resources again.

I will open a new issue as it seems unrelated.

@buehler

We encountered the same issue last week. We perform reconciliation for custom resources every minute, during which we update the status of each custom resource. There are seven resources undergoing reconciliation every minute. However, at random intervals, such as after 2 days or 5 days, we've noticed that the reconciliation process stops for some of these seven resources.

We did not observed any error/exception in logs.

We are using eks 1.24 and kubeops 7.6.1 library.

Is there any fix/workaround available?

Karthik2893 · 2023-11-20T17:37:25Z

@tomitesh Just wanted to double check that you are not returning "Null" for these objects (for which reconciliation process is stopping) as part of the reconcile loop.
I think the issue that I opened has been fixed in the update made by @prochnowc above.

tomitesh · 2023-11-21T08:25:55Z

@tomitesh Just wanted to double check that you are not returning "Null" for these objects (for which reconciliation process is stopping) as part of the reconcile loop. I think the issue that I opened has been fixed in the update made by @prochnowc above.

Thanks @Karthik2893 for your reply.

We aim to reconcile every minute, opting for option 1 as outlined in the readme to achieve periodic reconciliation. However, after a random interval of 2-3-5 days, reconciliation ceases for certain resources.

While utilizing option 2, reconciliation occurs only once during startup (unless i am missing any configuration).

Note : we also update status as part of "// reconcile logic"

option 1

    public async Task<ResourceControllerResult?> ReconcileAsync(V1DemoEntity entity)
    {
        _logger.LogInformation($"entity {entity.Name()} called {nameof(ReconcileAsync)}.");
        await _finalizerManager.RegisterFinalizerAsync<DemoFinalizer>(entity);

      // reconcile logic

        return ResourceControllerResult.RequeueEvent(TimeSpan.FromSeconds(60));
    }

Option 2

    public async Task<ResourceControllerResult?> ReconcileAsync(V1DemoEntity entity)
    {
        _logger.LogInformation($"entity {entity.Name()} called {nameof(ReconcileAsync)}.");
        await _finalizerManager.RegisterFinalizerAsync<DemoFinalizer>(entity);

      // reconcile logic

        return null;
    }

Karthik2893 · 2023-11-21T15:33:16Z

@tomitesh Do you happen to know what you WatcherHttpTimeout is set to? Setting it to a higher value (>120min or so) resulted in failing to reconcile after a certain period of time. But if that is the case, it should fail to reconcile for all the entities and not a fraction of entities. For your case, I am suspecting one of your codepaths might be returning "null" for the entities that are failing to reconcile? If not, then we will have to ask others to look into it and it will be helpful if we have code snippets :)

emouawad · 2023-11-25T14:30:40Z

I can confirm, experiencing the same issue on (8.0.0-pre.29, 8.0.0-pre.34 - didn't try other versions) - during reconcile i update the status and publish events (not sure if related) then requeue.
The reconcile is stopped for all my entity instances (only have 2 for now).
If i delete those 2 and reapply the reconcile/requeue is restarted for some time and then just stops in less than an hour.
Plz let me know if i can help with more logs or testing.
Not sure for now - will test again but i am thinking the issue is in the watcher?

buehler · 2023-11-30T09:03:34Z

Hmm. Good question.
It is such a weird behaviour. Do you experience this issue when you test locally (i.e. with docker Kubernetes or minikube or whatever) as well?

Events should not impact the watcher, since they are completely different objects in Kubernetes. However, status updates could impact the requeue cache.

But when you update the status and then requeue the resource, it should retrigger the reconcile.

I'll conduct a test of my own :)

sicavz · 2023-12-19T11:39:03Z

[Checked on v8.0.0-pre.38]
Actually, the flow is pretty simple, and the behaviour is related to timing.

If in the ReconcileAsync one would simply

update the entity status and then
requeue it,

due to the asynchronous nature of the system, even if the entity status is changed first, the "Modified" event for it might occur AFTER the requeue.
If the spec of the entity is not changed (and this is the case), something like
"Entity "X" modification did not modify generation. Skip event."
will be logged and the requeue will never happen on a scheduled basis (see the _queue.RemoveIfQueued(entity); in the ResourceWatcher::OnEvent - https://github.com/buehler/dotnet-operator-sdk/blob/9bc1efaa94355103e33dc0ceded1a0b666b69629/src/KubeOps.Operator/Watcher/ResourceWatcher%7BTEntity%7D.cs#L168C8-L168C39

You can check the attached screenshot (

the first two flow have an "expected" sequence: save status, get Modified event, requeue
the last flow exposes the issue (the order is different): save status, requeue, get Modified event
)

sicavz · 2023-12-19T13:24:04Z

It's even more obvious that the requeue is skipped if you look at the log below (after the KubernetesClient reconnects):

[14:38:51.216 - DBG - WebhookOperator.Controller.UscSystemEntityController - requeueing...
[14:38:51.216 - VRB - KubeOps.Abstractions.Queue.EntityRequeue - Requeue entity "X" in 5000ms.
[14:38:51.216 - DBG - WebhookOperator.Controller.XEntityController - requeued

[14:38:52.883 - DBG - KubeOps.Operator.Watcher.ResourceWatcher - The watcher was closed

[14:38:52.883 - DBG - KubeOps.Operator.Watcher.ResourceWatcher - Create watcher for entity of type "WebhookOperator.Entities.XEntity".
[14:38:52.888 - VRB - KubeOps.Operator.Watcher.ResourceWatcher - Received watch event "Added" for "X/alpha".
[14:38:52.888 - DBG - KubeOps.Operator.Watcher.ResourceWatcher - Entity "X/alpha" modification did not modify generation. Skip event.

NOTHING HAPPENS ANYMORE HERE EVEN IF THE THIRD LINE REQUEUED THE ENTITY!

buehler · 2024-01-18T08:02:14Z

This should be overhauled in v8.

sicavz · 2024-01-30T06:27:47Z

Unfortunately, my previous two comments were for v8.0.0-pre.38.

buehler · 2024-01-30T08:03:55Z

Hey @sicavz, so it does not work? when using the framework as documented, I could not reproduce the error. You need to use the returned entities from the client to have the updated resource versions and stuff. And status update does not update the resource version which should be fine.

sicavz · 2024-01-31T07:18:54Z

I've sketched the flow as I understand it (from the ResourceWatcher)

As you see, the issue is dependent on timing, so it's a bit harder to reproduce... (Please review the screenshot I've posted with my first comment)

nullexceptiondev · 2024-02-01T19:26:59Z

I'm struggling with the same issue in v8. When running in VS locally, everything runs perfectly fine. When deployed to kubernetes I get maybe 2 or 3 reconcile calls before it stops

buehler added the bug Something isn't working label Apr 17, 2023

prochnowc mentioned this issue Aug 14, 2023

[bug]: Resource watcher does not reconnect #585

Closed

buehler closed this as completed Jan 18, 2024

buehler reopened this Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug]: Re-triggering reconcile does not work when status is updated #554

[bug]: Re-triggering reconcile does not work when status is updated #554

buehler commented Apr 17, 2023

Karthik2893 commented Apr 19, 2023

prochnowc commented Aug 10, 2023 •

edited

Loading

tomitesh commented Nov 20, 2023 •

edited

Loading

Karthik2893 commented Nov 20, 2023

tomitesh commented Nov 21, 2023 •

edited

Loading

Karthik2893 commented Nov 21, 2023

emouawad commented Nov 25, 2023

buehler commented Nov 30, 2023

sicavz commented Dec 19, 2023 •

edited

Loading

sicavz commented Dec 19, 2023

buehler commented Jan 18, 2024

sicavz commented Jan 30, 2024

buehler commented Jan 30, 2024

sicavz commented Jan 31, 2024

nullexceptiondev commented Feb 1, 2024 •

edited

Loading

[bug]: Re-triggering reconcile does not work when status is updated #554

[bug]: Re-triggering reconcile does not work when status is updated #554

Comments

buehler commented Apr 17, 2023

Describe the bug

To reproduce

Expected behavior

Screenshots

Additional Context

Karthik2893 commented Apr 19, 2023

prochnowc commented Aug 10, 2023 • edited Loading

tomitesh commented Nov 20, 2023 • edited Loading

Karthik2893 commented Nov 20, 2023

tomitesh commented Nov 21, 2023 • edited Loading

Karthik2893 commented Nov 21, 2023

emouawad commented Nov 25, 2023

buehler commented Nov 30, 2023

sicavz commented Dec 19, 2023 • edited Loading

sicavz commented Dec 19, 2023

buehler commented Jan 18, 2024

sicavz commented Jan 30, 2024

buehler commented Jan 30, 2024

sicavz commented Jan 31, 2024

nullexceptiondev commented Feb 1, 2024 • edited Loading

prochnowc commented Aug 10, 2023 •

edited

Loading

tomitesh commented Nov 20, 2023 •

edited

Loading

tomitesh commented Nov 21, 2023 •

edited

Loading

sicavz commented Dec 19, 2023 •

edited

Loading

nullexceptiondev commented Feb 1, 2024 •

edited

Loading