Fixes around exceptional cases in watchers #2671

bbeaudreault · 2020-12-16T20:53:43Z

Description

These changes have been running for us internally for 6 months or so, but I never got a chance to upstream them. We'd like to bring our fork more in-line with mainstream, so here is a group of changes related to watcher reliability:

If the target of a watch is deleted, it will never satisfy the watch condition
We saw this case where you are watching a resource hoping it will hit a particular condition, if the resource gets deleted while you are waiting the watcher will simply go on forever until the timeout. This change makes it fail much faster in the case where it will never succeed.
Path segment based filters have spotty support in newer kube api versions. Field selector seems to work everywhere
It looks like this was added 3.5 years ago because field selectors weren't working. The patch to make CRDs support field selectors was then added a few months later. If you take a look at the API spec (even for the really old v1.11 here), the read API url we are using here does not support the watch query param. If you want to watch an object, you need to use either the Watch API (with /api/v1/watch path) or the List API, which does support the watch query param. We noticed that watches for specific resources were not working correctly for us, instead they were spamming our API servers because they were actually short polling (basically just a get that returns instantly). Note that I linked to the Pods API, but the same seems to be true for CustomResources (or as the original PR called them, API Groups). You can test this out pretty easily:

kubectl proxy
# then in another terminal:

# this returns instantly, with no watch:
curl http://127.0.0.1:8001/apis/zk.hubspot.com/v1/namespaces/zk/zookeeperclusters/zk-test-qa\?watch=1

# this properly watches only the resource in question:
 curl http://127.0.0.1:8001/apis/zk.hubspot.com/v1/namespaces/zk/zookeeperclusters/\?watch=1\&fieldSelector=metadata.name=zk-test-qa

# this similarly works:
curl http://127.0.0.1:8001/apis/zk.hubspot.com/v1/watch/namespaces/zk/zookeeperclusters/\?watch=1\&fieldSelector=metadata.name=zk-test-qa

# this watches ALL resources of the type, as you'd expect:
curl http://127.0.0.1:8001/apis/zk.hubspot.com/v1/namespaces/zk/zookeeperclusters/\?watch=1

This is obviously a CRD we have internally, you can create your own CRD of any type to test it. It also works the same on standard APIs. I chose to simply remove this apiGroup handling here so that we always use the fieldSelector filtering. It's possible that another change we could make is to move over to the /watch API, but that would be more complicated because we'd need to use a different URL than what we have available from getNamespacedUrl(). I could hook that up if you want, but it mean somewhat wider scope changes. I'd prefer to handle that separately/later if possible.

Handle unsuccessful watch statuses better -- we should either be retrying, or passing an error event back to the user. Throwing an exception will result in a transparent failure that no user can handle, because it happens in a thread
We noticed this issue where our watchers were failing silently with no real feedback. This is because the client was throwing a failure exception in the callback which does not get executed in the main thread. So the exception really goes nowhere. I refactored callback so that when !isSuccessful() we do the same status handling that we do in the case of a successful response, we just pass in the unsuccessful Status object. This way the watcher will receive an appropriate event and the caller can be notified that the watch failed.
long polls return a response once data is available. we should only backoff in exceptional cases, otherwise immediately connect so we can receive followup data in a more timely manner
Another situation we saw is we some watch events would not make their way to us in a timely manner. This was because the WatchHTTPManager was always backing off after processing a response body. The way long polling works is, when data is ready, the connection immediately returns a response with the data in the body. At that point, the long poll needs to be reinitialized. This is not an error state, the long poll should immediately be reinitialized. So I updated the onResponse callback and scheduleReconnect to only backoff in the case of errors.

Type of change

Bug fix (non-breaking change which fixes an issue)
Feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change
Chore (non-breaking change which doesn't affect codebase;
test, version modification, documentation, etc.)

Checklist

Code contributed by me aligns with current project license: Apache 2.0
I Added CHANGELOG entry regarding this change
I have implemented unit tests to cover my changes
I have added/updated the javadocs and other documentation accordingly
No new bugs, code smells, etc. in SonarCloud report
I tested my code in Kubernetes
I tested my code in OpenShift

- If the target of a watch is deleted, it will never satisfy the watch condition - Path segment based filters have spotty support in newer kube api versions. Field selector seems to work everywhere - Handle unsuccessful watch statuses better -- we should either be retrying, or passing an error event back to the user. Throwing an exception will result in a transparent failure that no user can handle, because it happens in a thread - long polls return a response once data is available. we should only backoff in exceptional cases, otherwise immediately connect so we can receive followup data in a more timely manner

centos-ci · 2020-12-16T20:53:46Z

Can one of the admins verify this patch?

sonarcloud · 2020-12-16T22:28:22Z

SonarCloud Quality Gate failed.

0 Bugs
0 Vulnerabilities
0 Security Hotspots
1 Code Smell

3.8% Coverage
1.4% Duplication

manusa · 2020-12-17T05:19:25Z

Timing is impecable on this one. We identified #2668 yesterday.

I'm still wary about why these lines were added in the first place:

 if (baseOperation.isApiGroup()) { 
   httpUrlBuilder.addPathSegment(name); 
 }

~~Do you think it has something to do with K8s initial CRD implementations/spec?~~

Path segment based filters have spotty support in newer kube api versions. Field selector seems to work everywhere
It looks like this was added 3.5 years ago because field selectors weren't working. The patch to make CRDs support field selectors was then added a few months later. If you take a look at the API spec (even for the really old v1.11 here), the read API url we are using here does not support the watch query param. If you want to watch an object, you need to use either the Watch API (with /api/v1/watch path) or the List API, which does support the watch query param. We noticed that watches for specific resources were not working correctly for us, instead they were spamming our API servers because they were actually short polling (basically just a get that returns instantly). Note that I linked to the Pods API, but the same seems to be true for CustomResources (or as the original PR called them, API Groups). You can test this out pretty easily:

Sorry, missed this earlier.

It makes sense given the timeline.

Path segment based filters have spotty support in newer kube api versions.

I can't really think of any resource API that allows this.

As Rohan stated ni his comment this was added a long time back for certain OpenShift resources #821 (however, I can't really find any documentation for these endpoints where watches can be applied to a specific resource and not the whole list).

manusa · 2020-12-17T10:32:02Z

Relates to #2616 #2614

manusa

LGTM

manusa · 2020-12-17T11:01:11Z

I confirm this PR also fixed the issue reported in #2668

manusa · 2020-12-17T15:58:52Z

[merge]

bbeaudreault added 2 commits December 16, 2020 14:29

changelog

3c52416

fix test

283716d

manusa mentioned this pull request Dec 17, 2020

BuildConfig watch performs request on wrong endpoint #2668

Closed

manusa approved these changes Dec 17, 2020

View reviewed changes

rohanKanojia requested review from iocanel, metacosm and oscerd December 17, 2020 11:03

rohanKanojia approved these changes Dec 17, 2020

View reviewed changes

bbeaudreault marked this pull request as ready for review December 17, 2020 14:45

manusa added this to the 5.0.0 milestone Dec 17, 2020

fusesource-ci merged commit c768aa4 into fabric8io:master Dec 17, 2020

This was referenced Mar 11, 2021

SharedIndexInformer for CRD stop getting events #2798

Closed

SharedIndexInformer with label filter only watching local Namespace #2801

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes around exceptional cases in watchers #2671

Fixes around exceptional cases in watchers #2671

bbeaudreault commented Dec 16, 2020 •

edited

Loading

centos-ci commented Dec 16, 2020

sonarcloud bot commented Dec 16, 2020

manusa commented Dec 17, 2020 •

edited

Loading

manusa commented Dec 17, 2020

manusa left a comment

manusa commented Dec 17, 2020

manusa commented Dec 17, 2020

Fixes around exceptional cases in watchers #2671

Fixes around exceptional cases in watchers #2671

Conversation

bbeaudreault commented Dec 16, 2020 • edited Loading

Description

Type of change

Checklist

centos-ci commented Dec 16, 2020

sonarcloud bot commented Dec 16, 2020

manusa commented Dec 17, 2020 • edited Loading

manusa commented Dec 17, 2020

manusa left a comment

Choose a reason for hiding this comment

manusa commented Dec 17, 2020

manusa commented Dec 17, 2020

bbeaudreault commented Dec 16, 2020 •

edited

Loading

manusa commented Dec 17, 2020 •

edited

Loading