fix watching with a specified resource version #109

juliantaylor · 2018-12-11T19:44:34Z

The watch code reset the version to the last found in the
response.
When you first list existing objects and then start watching from that
resource version the existing versions are older than the version you
wanted and the watch starts from the wrong version after the first
restart.
This leads to for example already deleted objects ending in the stream
again.

Fix this by not resetting to an older version than the one specified in
the watch.
It does not handle overflows of the resource version but they are 64 bit
integers so they should not realistically overflow even in the most loaded
clusters.

Closes kubernetes-client/python#700

max-rocket-internet · 2018-12-19T12:08:12Z

/assign @yliaog

max-rocket-internet · 2019-01-02T14:32:40Z

Any update @yliaog ? Thanks!

yliaog · 2019-01-02T18:28:55Z

watch/watch.py

@@ -83,13 +83,14 @@ def unmarshal_event(self, data, return_type):
            obj = SimpleNamespace(data=json.dumps(js['raw_object']))
            js['object'] = self._api_client.deserialize(obj, return_type)
            if hasattr(js['object'], 'metadata'):
-                self.resource_version = js['object'].metadata.resource_version
+                self.resource_version = int(


resource version is an opaque string, it cannot be assumed to be an int

yliaog · 2019-01-02T18:29:08Z

watch/watch.py

            # For custom objects that we don't have model defined, json
            # deserialization results in dictionary
            elif (isinstance(js['object'], dict) and 'metadata' in js['object']
                  and 'resourceVersion' in js['object']['metadata']):
-                self.resource_version = js['object']['metadata'][
-                    'resourceVersion']
+                self.resource_version = int(


yliaog · 2019-01-02T18:30:31Z

watch/watch.py

@@ -122,6 +123,7 @@ def stream(self, func, *args, **kwargs):
        return_type = self.get_return_type(func)
        kwargs['watch'] = True
        kwargs['_preload_content'] = False
+        min_resource_version = int(kwargs.get('resource_version', 0))


resource version is an string, the resource version string "0" is kind of treated special on the server side. but please don't assume "0" is the minimum resource version

yliaog · 2019-01-02T18:33:41Z

watch/watch.py

+                # continue to watch from the requested resource version
+                # does not handle overflow though that should take a few
+                # hundred years
+                kwargs['resource_version'] = max(


similarly, no max or min can be done on resource version. Instead, there is only special "0" versus non-special other opaque strings. so what can be done is to test for equal to "0" or not

do you know where I can find the exact definition of the resource version?
There has to be be some kind of ordering possible, what else is the resource version for?

kubernetes-client/python#693 (comment)

has some useful references

do you have a suggestion how to fix this issue then?
my best idea is to ditch the while loop that caused the issue in the first place if a resource version is passed in by the caller

the while loop is still needed, i think probably check if kwargs['resource_version'] is present at the beginning, if it does, then start watching with kwargs['resource_version'], instead of starting with "0" (which is set in Init to self.resource_version)

hm assuming receive all events in order from the first watch it would work, you'd restart at a lower resource version than inputted but if you have received them all you should still be fine. Makes the tests a bit more tricky to implement but I'll post an update soon

max-rocket-internet · 2019-01-15T10:48:17Z

Any progress @juliantaylor ?

juliantaylor · 2019-01-19T12:13:52Z

finally figured out a way to write a test and updated treating the rv as opaque, please have a look again.

max-rocket-internet · 2019-01-22T11:00:16Z

@juliantaylor you need to lint the files as some of the are failing the style checks:
https://api.travis-ci.org/v3/job/481731414/log.txt

juliantaylor · 2019-01-22T18:10:11Z

should be fixed

yliaog · 2019-01-22T18:11:10Z

watch/watch_test.py

@@ -62,6 +64,69 @@ def test_watch_with_decode(self):
        fake_resp.close.assert_called_once()
        fake_resp.release_conn.assert_called_once()

+    def test_watch_resource_version_set(self):
+        # gh-700 ensure watching from a resource version does reset to resource


what is gh-700 ?

github issue 700, the convention I'm used to. How do you reference issues in this project?

better to use a direct link to the issue

yliaog · 2019-01-22T18:14:43Z

watch/watch_test.py


 from .watch import Watch

+callcount = 0


better to avoid using global

I don't know a way to avoid it, any ideas?

could you move it to class WatchTests?

right, just putting it the class works ... I had just tested a function local variable ...
updated

yliaog · 2019-01-22T18:15:37Z

watch/watch_test.py

+        fake_resp.close = Mock()
+        fake_resp.release_conn = Mock()
+        values = [
+            '{"type": "ADDED", "object": {"metadata": {"name": "test1",'


could you use a real k8s object, like a pod? that could make the test easier to understand

one could though it does not matter for the test
none of the existing tests use real objects, should all be updated?

fine to leave it as is

yliaog · 2019-01-22T18:19:12Z

watch/watch_test.py

+        fake_api.get_namespaces.__doc__ = ':return: V1NamespaceList'
+
+        w = Watch()
+        count = 1


what does count do? the number of the for loop iterations? usually it starts with 0

yes its the loop count, again a copy paste from other tests, using enumerate would be a bit nicer

yliaog · 2019-01-22T18:27:43Z

watch/watch_test.py

+            if count == len(values) * 3:
+                w.stop()
+
+        fake_api.get_namespaces.assert_has_calls(calls)


how do you simulate the case the 'connection got reset'?

via the global variable changing the return value of the mock read_chunked

if 'resource_version' in kwargs: self.resource_version = kwargs['resource_version']

===================================================
In the above test, the code in the function stream() from its beginning to "while True" is executed only once, i.e. the two lines of fix code you added taken above is executed only once.
when the above two lines are executed, kwargs is None, hence essentially the two lines you added did not make any effect. I'm wondering how the test code tests the fix? am I missing anything?

the testcode does run with kwargs['resource_version] set which sets the rv member of the watch object.
The function then calls the get_namespaces mock which would direct k8s to provide a watch from the provided resource version, in our mock it is just ignored and returns nothing simulating the k8s api not returning anything and resetting the connection.
Now instead of resetting the resource version member variable back to zero and issuing a watch from zero in the next iteration it will reuse the input resource version which is what we want.
The mock in the second call returns something but what doesn't really matter we just want to register that the second mocked call was with the input resource version and not with zero which causes the linked issues.

here the failure when you remove the two lines I added, it matches what happens in the real testcase I posted in the linked issue:

E AssertionError: Calls not found. E Expected: [call(_preload_content=False, resource_version='5', watch=True), E call(_preload_content=False, resource_version='3', watch=True), E call(_preload_content=False, resource_version='3', watch=True)] E Actual: [call(_preload_content=False, resource_version='5', watch=True), E call(_preload_content=False, resource_version=0, watch=True), <<<<<<<<< ERROR HERE E call(_preload_content=False, resource_version='3', watch=True), E call(_preload_content=False, resource_version='3', watch=True)]

yliaog · 2019-01-22T18:28:21Z

watch/watch_test.py

+                calls.append(call(_preload_content=False, watch=True,
+                                  resource_version=rv))
+            # returned
+            if count == len(values) * 3:


so you expect the loop above iterates 3*3 times?

yes, the mock needs a way to identify when it is done, this and other tests in the file use the loop iteration count

yes one could reduce it to 4, the amount of iterations doesn't matter as long as we iterate at least twice in the watch.
going a bit beyond what is needed is currently unnecessary but may make it more robust to future changes.

correct, assert_has_calls does not care about calls before and after the wanted ones, I'll update the test to be more strict

done, changed it a bit to be hopefully more clear what result we expect

The watch code reset the version to the last found in the response. When you first list existing objects and then start watching from that resource version the existing versions are older than the version you wanted and the watch starts from the wrong version after the first restart. This leads to for example already deleted objects ending in the stream again. Fix this by setting the minimum resource version to reset from to the input resource version. As long as k8s returns all objects in order in the watch this should work. We cannot use the integer value of the resource version to order it as one should be treat the value as opaque. Closes kubernetes-client/python#700

codecov-io · 2019-01-23T18:49:51Z

Codecov Report

Merging #109 into master will increase coverage by 0.2%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master     #109     +/-   ##
=========================================
+ Coverage   91.74%   91.94%   +0.2%     
=========================================
  Files          13       13             
  Lines        1187     1217     +30     
=========================================
+ Hits         1089     1119     +30     
  Misses         98       98

Impacted Files	Coverage Δ
watch/watch_test.py	`98.51% <100%> (+0.38%)`	⬆️
watch/watch.py	`100% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8497dfb...3c30a30. Read the comment docs.

yliaog · 2019-01-23T19:52:11Z

/lgtm

yliaog · 2019-01-23T21:08:33Z

/approve

k8s-ci-robot · 2019-01-23T21:08:36Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: juliantaylor, yliaog

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [yliaog]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

max-rocket-internet · 2019-03-14T13:52:09Z

This PR won't stop the 410 error, right? It will just stop the out of order processing?

juliantaylor · 2019-03-14T18:05:06Z

the 410 resource version to old error is not handled by this.
This still needs to be handled by the application e.g. by restarting the watch at the last known resource version (which this PR does fix).

k8s-ci-robot requested review from mbohlool and roycaihw December 11, 2018 19:44

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 11, 2018

juliantaylor mentioned this pull request Dec 11, 2018

CRD watch stream starts processing very old deleted resources kubernetes-client/python#693

Closed

juliantaylor force-pushed the fix-watch-reset branch 3 times, most recently from 5cab3de to 5eb6633 Compare December 12, 2018 17:51

k8s-ci-robot assigned yliaog Dec 19, 2018

yliaog reviewed Jan 2, 2019

View reviewed changes

max-rocket-internet mentioned this pull request Jan 15, 2019

"resourceVersion now: 13748114" and processing old versions of CRD resources max-rocket-internet/newrelic-controller#4

Closed

juliantaylor force-pushed the fix-watch-reset branch from c40c9cc to 3219b84 Compare January 19, 2019 12:12

juliantaylor force-pushed the fix-watch-reset branch from 3219b84 to 1e559d9 Compare January 19, 2019 12:17

juliantaylor force-pushed the fix-watch-reset branch from 1e559d9 to 9d521a1 Compare January 22, 2019 17:38

yliaog reviewed Jan 22, 2019

View reviewed changes

juliantaylor force-pushed the fix-watch-reset branch 3 times, most recently from 97174be to 82c710c Compare January 23, 2019 18:37

juliantaylor force-pushed the fix-watch-reset branch from 82c710c to 3c30a30 Compare January 23, 2019 18:38

k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 23, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 23, 2019

k8s-ci-robot merged commit 2d69e89 into kubernetes-client:master Jan 23, 2019

fix watching with a specified resource version #109

fix watching with a specified resource version #109

Conversation

juliantaylor commented Dec 11, 2018

max-rocket-internet commented Dec 19, 2018

max-rocket-internet commented Jan 2, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-rocket-internet commented Jan 15, 2019

juliantaylor commented Jan 19, 2019

max-rocket-internet commented Jan 22, 2019

juliantaylor commented Jan 22, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Jan 23, 2019

Codecov Report

yliaog commented Jan 23, 2019

yliaog commented Jan 23, 2019

k8s-ci-robot commented Jan 23, 2019

max-rocket-internet commented Mar 14, 2019

juliantaylor commented Mar 14, 2019