Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Pod WatchAll will fail in about half an hour #157

Closed
d4ilys opened this issue Jul 6, 2024 · 15 comments
Closed

The Pod WatchAll will fail in about half an hour #157

d4ilys opened this issue Jul 6, 2024 · 15 comments
Assignees
Labels
investigating The issue is being investigated

Comments

@d4ilys
Copy link

d4ilys commented Jul 6, 2024

Hello, how does the client communicate with api-server? Is there a retry mechanism

@d4ilys
Copy link
Author

d4ilys commented Jul 6, 2024

  var eventStream = kubeApiClient.PodsV1()
      .WatchAll(kubeNamespace: kubeNamespace);    
	eventStream.Select(resourceEvent => resourceEvent.Resource).Subscribe(subsequentEvent =>
      {
            // do ....
      },
      error => LogError($"Listening to pod fail."),
                                                                          
      () =>
      {
          LogInfo("Listening to pod completed.");
      });

The code above will execute OnCompleted after about 0.5-1 hours

@tintoy
Copy link
Owner

tintoy commented Jul 7, 2024

Does the OnError handler ever get called? Are there any events in that time?

@tintoy tintoy self-assigned this Jul 7, 2024
@tintoy tintoy added the investigating The issue is being investigated label Jul 7, 2024
@tintoy
Copy link
Owner

tintoy commented Jul 7, 2024

how does the client communicate with api-server?

The client mostly uses HTTP/HTTPS, but uses Websockets for some actions (such as Exec).

Is there a retry mechanism

Not currently but this is a good idea; we should be able to hide it behind the IObservable implementation. What may be tricky is determining whether the socket is closed because the server closed it or the client closed it; observable sequences are always meant to call OnCompleted.

Maybe we need a slightly more granular API surface here; something to optionally maintain the connection, reopening it if it is closed by the server?

It’s been a while, but I think Watch can also be done via WebSockets; maybe it’s worth looking into that.

@tintoy
Copy link
Owner

tintoy commented Jul 7, 2024

What version of Kubernetes are you using? Do you get the same behaviour when watching pods in a single namespace?

@tintoy
Copy link
Owner

tintoy commented Jul 7, 2024

Just had a look at the implementation; we use the k8s list operation with watch=true. I suspect we may need to be a bit smarter about it, and keep track of the last event seen and resume with that event as the high water mark once the connection times out and is then re-established.

@tintoy
Copy link
Owner

tintoy commented Jul 7, 2024

Possibly related:

kubernetes-client/csharp#533 (comment)

@d4ilys
Copy link
Author

d4ilys commented Jul 8, 2024

Kubernetes version

Kubernetes v1.23.6

Does the OnError handler ever get called? Are there any events in that time?

No OnError events were raised, and the connection should be disconnected normally

If the connection timeout is caused, an exception message is caught

After about 40-50 minutes, the OnCompleted will be triggered

Temporary solution

void InternalWatch()
{
    var eventStream = kubeApiClient.PodsV1()
        .WatchAll(kubeNamespace: kubeNamespace);
    eventStream.Select(resourceEvent => resourceEvent.Resource).Subscribe(subsequentEvent =>
        {
          
            // do ...
        },
        error => LogError($"Listening to pod fail."),
        () =>
        {
            LogInfo("Listening to pod completed.");
            // Retrigger Watch
            InternalWatch();
        });
}

Do you plan fix this problem in the next step? Thank you for your positive reply

@tintoy
Copy link
Owner

tintoy commented Jul 8, 2024

Thanks, that confirms my initial impression. I’ll look into adding a flag argument to automatically reconnect (since it’s a change in behaviour I’ll be making it opt-in). 🙂

@tintoy
Copy link
Owner

tintoy commented Jul 10, 2024

Could you do me a favor and call .ToString() on the error callback (where you do LogError). It would be helpful to see the exception type and stack trace.

@d4ilys
Copy link
Author

d4ilys commented Jul 11, 2024

@tintoy
This is the scenario I was debugging locally, and nothing unusual happened.

f55b86ff01725cadd642d547b9c0ef9
4f414d2f11cf2481fd6491b4aa3d7b7

tintoy added a commit that referenced this issue Jul 13, 2024
@tintoy
Copy link
Owner

tintoy commented Jul 13, 2024

This will require targeting netstandard2.0 instead of netstandard1.4 (required by more recent versions of System.Reactive), but I don't think that's likely to be a problem for consumers given how old that was.

@tintoy
Copy link
Owner

tintoy commented Jul 13, 2024

If you have time, would you be able to try out the latest development version (2.5.9-develop.3) of the package?

Development package feed is at https://www.myget.org/F/dotnet-kube-client/api/v3/index.json

@tintoy
Copy link
Owner

tintoy commented Aug 7, 2024

I’ll be publishing these changes to NuGet in a day or 2.

@raman-m
Copy link

raman-m commented Aug 8, 2024

@d4ilys, as the user who reported the issue, could you please confirm whether the bug fix from #159 was implemented a month ago?

@tintoy tintoy mentioned this issue Aug 11, 2024
@tintoy tintoy linked a pull request Aug 11, 2024 that will close this issue
@tintoy tintoy removed a link to a pull request Aug 11, 2024
@tintoy
Copy link
Owner

tintoy commented Aug 11, 2024

Fixed in v2.5.10.

@tintoy tintoy closed this as completed Aug 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigating The issue is being investigated
Projects
None yet
Development

No branches or pull requests

3 participants