Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubernetes watch API is behaving oddly #1370

Closed
sameer2800 opened this issue Nov 9, 2020 · 16 comments · Fixed by #1498
Closed

kubernetes watch API is behaving oddly #1370

sameer2800 opened this issue Nov 9, 2020 · 16 comments · Fixed by #1498

Comments

@sameer2800
Copy link

sameer2800 commented Nov 9, 2020

I am running a kubernetes watch on configmaps list. I am running watch in a background thread and looking for watch events continously and updating my cache if there is any event added/modifed. What i am observing is that sometimes the watch events are not coming at all.

Watch<V1ConfigMap> watch = Watch.createWatch(
               apiClient,
               api.listNamespacedConfigMapCall(serviceDiscoveryConfig.getNamespace(), null, null, null,
                       null, null, null, null, null, true, null),
               new TypeToken<Watch.Response<V1ConfigMap>>() {
               }.getType()); 


while(true) {
               TimeUnit.SECONDS.sleep(2);
               for (Watch.Response<V1ConfigMap> configMap : watch) {

                   if (configMap.object != null) {
                       log.info("Received configmap watch event. updating the configmap: {} , metadata: {}", configMap.object.getMetadata().getName(), configMap.object.getMetadata().toString());

                           namespaceConfigsCache.insertOrUpdateValue(configMap.object.getMetadata().getName(), configMap.object.getData());

                   }
               }
           }

This entire piece of code runs in a background thread. The moment code misses a event, i dont see any more watch events after that point of time at all. is there any chance that watch is being stopped. if so, how do i check the status.

@brendandburns
Copy link
Contributor

You should not expect a watch to run forever, you need to list/watch in a loop.

The code is actually semi-thorny to get right, you are probably better off using the Informer class in this library that handles much of this logic for you.

@sameer2800
Copy link
Author

thanks @brendandburns. I will try out Informer and will let you know.

@sameer2800
Copy link
Author

sameer2800 commented Nov 10, 2020

@brendandburns I am seeing similar behavior even with Informer class. Let me know if i have configured something wrong here. It received events for the first few minutes and then suddenly it stopped receiving changelog events.

 // configmaps informer
        SharedIndexInformer<V1ConfigMap> configInformer =
                factory.sharedIndexInformerFor(
                        (CallGeneratorParams params) -> {
                            return api.listNamespacedConfigMapCall(
                                    serviceDiscoveryConfig.getNamespace(),
                                    null,
                                    null,
                                    null,
                                    null,
                                    null,
                                    null,
                                    params.resourceVersion,
                                    params.timeoutSeconds,
                                    params.watch,
                                    null);
                        },
                        V1ConfigMap.class,
                        V1ConfigMapList.class);


configInformer.addEventHandler(
                new ResourceEventHandler<V1ConfigMap>() {
                    @Override
                    public void onAdd(V1ConfigMap configMap) {

                        log.info("Received configmap watch add event. updating the configmap: {} , metadata: {}", configMap.getMetadata().getName(), configMap.getMetadata().toString());

                       
                    }

                    @Override
                    public void onUpdate(V1ConfigMap oldConfigMap, V1ConfigMap newConfigMap) {
                        log.info("Received configmap watch update event. updating the configmap: {} , metadata: {}", newConfigMap.getMetadata().getName(), newConfigMap.getMetadata().toString());
                    
                    }

                    @Override
                    public void onDelete(V1ConfigMap configMap, boolean deletedFinalStateUnknown) {
                      
                    }
                });

        factory.startAllRegisteredInformers();

@brendandburns
Copy link
Contributor

Is it possible that your thread is throwing an exception? If you throw an uncaught exception inside the thread, the thread will terminate.

I would try:

public void run() {
  try {
    // your code here
  } catch (Throwable e) {
     e.printStackTrace();
  }
}

And see if any exceptions occur. Your code for using the informer looks correct.

@yue9944882
Copy link
Member

Is it possible that your thread is throwing an exception? If you throw an uncaught exception inside the thread, the thread will terminate.

i think so

@sameer2800
Copy link
Author

@brendandburns @yue9944882 I am not running in this a seperate thread because factory.startAllRegisteredInformers(); starts the informers in the background thread. I am running this in main method itself. Which part do u want me to put in try catch block ,because initilazing infomer and adding a event handler is one time task and i dont see errors there. And startAllRegisteredInfromers runs in background.

@sameer2800
Copy link
Author

sameer2800 commented Nov 12, 2020

@brendandburns I have not changed any timeout variables. i see default for listNamespacedConfigMapCall is set to 5 mins.

listerWatcher.watch(
                  new CallGeneratorParams(
                      Boolean.TRUE,
                      lastSyncResourceVersion,
                      Long.valueOf(Duration.ofMinutes(5).toMillis()).intValue())); 

Actually, in my case, kube API server goes to unavailable state once in a while. do u think increasing the timeout will work ?

@yue9944882
Copy link
Member

you code looks good, the informer will retry reconnecting the kube-apiserver every 1 second if the server goes unavailable. and watch connection will be re-established once the server is up.

OkHttpClient httpClient =
apiClient.getHttpClient().newBuilder().readTimeout(0, TimeUnit.SECONDS).build();

did you set the read-timeout to infinite as the example above shows?

@sameer2800
Copy link
Author

sameer2800 commented Nov 12, 2020

@yue9944882 yes.

apiClient = ClientBuilder.cluster().build();;
       // infinite timeout
       OkHttpClient httpClient =
               apiClient.getHttpClient().newBuilder().readTimeout(0, TimeUnit.SECONDS).build();
       apiClient.setHttpClient(httpClient);
       this.api = new CoreV1Api();

I tried changing the timeouts too. dint help. in my last run, I could see it working for hours. then it stopped receiving the events. I started with debug mode on. I neither see exceptions nor errors.

@brendandburns
Copy link
Contributor

I'm actually not sure if you want infinite timeout? In a flaky network, is it possible that the something is not sending a TCP reset on the severing of a network connection? I've seen situations where a TCP reset isn't sent and the system holds a TCP connection open, but there's no traffic flowing.

I would actually set a non-infinite timeout (5 minutes?) and see if that fixes things.

@sameer2800
Copy link
Author

@brendandburns thanks for the suggestion and i will try and let you know

@tony-clarke-amdocs
Copy link
Contributor

@sameer2800 where you able to resolve this?

We are starting to see the same symptoms on AKS (Azure Kubernetes Service, K8S 1.18.8) for CR instance. After about 5 minutes the informer stops seeing any updates (new/update/delete). We are running with the 9.0.1 release. We updated to 10.0.1 but no difference.

@brendanburns you suggested to run with a read timeout that is not zero, but the 10.0.0 release was updated to disallow any read timeout other than zero. See this commit. Any other suggestions to try?

@brendandburns
Copy link
Contributor

cc @yue9944882

See some related discussion here:
kubernetes/kubernetes#65012

@tony-clarke-amdocs for AKS specifically see the discussion here:
Azure/AKS#1755

I think we should:
a) re-enable non-zero timeouts
b) make sure we're sending TCP Keep-Alive

eventually:

c) switch from Web Sockets to HTTP/2 and add health checks.

@tony-clarke-amdocs
Copy link
Contributor

tony-clarke-amdocs commented Nov 25, 2020

@brendandburns @yue9944882 I noticed that the watch call sets the timeout to 5 minutes. See here. Given that we no longer see watch events after 5 minutes...I tend to think this is not a coincidence?
Any idea how we make sure to send TCP keep-alive? Looking at the code, I don't think we are doing web sockets today.

@tony-clarke-amdocs
Copy link
Contributor

@brendandburns @yue9944882 I think I have figured this out. The standard client doesn't include http2 protocol.

ApiClient apiClient = ClientBuilder.standard().build();

We need to add the following to add http2 and a pinginterval.

apiClient.setHttpClient(apiClient
                .getHttpClient()
                .newBuilder()
                    .protocols(Arrays.asList(Protocol.HTTP_2,Protocol.HTTP_1_1))
                    .readTimeout(Duration.ZERO)
                    .pingInterval(1,TimeUnit.MINUTES)
                .build())

With the above change the watch doesn't hang and it all is good.

Does it make sense that the standard build includes something like this by default? I think it should at least include HTTP_2 protocol.

@brendandburns
Copy link
Contributor

That change seems fine to me.
@yue9944882 wdyt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants