Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix multiListWatch resourceVersion mismatch if watch reconnected #1377

Conversation

KielChan
Copy link
Contributor

@KielChan KielChan commented Feb 8, 2021

Signed-off-by: KielChan qingya.chen520@gmail.com

What this PR does / why we need it: bug fix

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes # none

I have a problem that report:

Failed to watch *v1.Pod: expected resource version to have 2 parts to match the number of ListerWatchers

which I wanna collect two namespaces pod metrics with kube-state-metrics. The origin code will split resource version combined by List method, but if it is expired or timeout, the client-go will use the latest event.Object resource version to re-watch resources which will lead to the problem.

The same problem occurred here: link. It seemed like using same code.

Signed-off-by: KielChan <qingya.chen520@gmail.com>
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 8, 2021
@k8s-ci-robot
Copy link
Contributor

Welcome @KielChan!

It looks like this is your first PR to kubernetes/kube-state-metrics 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kube-state-metrics has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Feb 8, 2021
@lilic
Copy link
Member

lilic commented Feb 8, 2021

The same problem occurred here: link. It seemed like using same code.

Yes same engineers worked on both things. :)

Copy link
Member

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Yes I had this in the back of my head for some time as I worked on the prom-operator bug fix as well, but never actually ran into this in kube-state-metrics in either our customer or my personal clusters.

Do you mind just pasting the logs and version this error occurred in, so we know where to aim the patch towards?

@KielChan
Copy link
Contributor Author

KielChan commented Feb 9, 2021

Logs found in container std console, but now containers are deleted. :(

I found some messages like Failed to watch *v1.Pod: failed to list *v1.Pod: resourceVersion: Invalid value: "22119557269/22119557695": strconv.ParseUint: parsing "22119557269/22119557695": invalid syntax after first rough bug fix, and just set mlw watch resource version as List method provided. Note that two namespaces pod list meta resource version are not same.

The codes are in master or main branch. I will try to reproduce it today.

@KielChan
Copy link
Contributor Author

KielChan commented Feb 9, 2021

I have reproduced it now:

➜  standard git:(master) ✗ kubectl logs kube-state-metrics-69bb88f949-fq5m8 -n observability -f
time="2021-02-09T03:06:36Z" level=warning msg="ForkExec: /etc/after_start.sh, Error: no such file or directory"
I0209 03:06:36.940032       1 main.go:108] Using resources pods
I0209 03:06:36.940075       1 main.go:123] Using [arkbaseserverless paascoretestfive] namespaces
I0209 03:06:36.940108       1 main.go:138] metric allow-denylisting: Excluding the following lists that were on denylist:
I0209 03:06:36.951104       1 main.go:239] Testing communication with server
I0209 03:06:38.963175       1 main.go:244] Running with Kubernetes cluster version: v1.16+. git version: v1.16.3-553. git tree state: clean. commit: 6e4579bc8bb0c038bb8809c38568f46212b88020. platform: linux/amd64
I0209 03:06:38.963200       1 main.go:246] Communication with server successful
I0209 03:06:38.963458       1 main.go:202] Starting metrics server: [::]:8080
I0209 03:06:38.963492       1 metrics_handler.go:96] Autosharding disabled
I0209 03:06:38.963502       1 main.go:191] Starting kube-state-metrics self metrics server: [::]:8081
I0209 03:06:38.963825       1 builder.go:164] Active resources: pods
I0209 03:06:38.985824       1 main.go:63] levelinfomsgTLS is disabled.http2false
I0209 03:06:38.985931       1 main.go:63] levelinfomsgTLS is disabled.http2false


E0209 03:15:25.722644       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.3/tools/cache/reflector.go:156: Failed to watch *v1.Pod: expected resource version to have 2 parts to match the number of ListerWatchers
E0209 03:20:32.435794       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.3/tools/cache/reflector.go:156: Failed to watch *v1.Pod: expected resource version to have 2 parts to match the number of ListerWatchers

@KielChan
Copy link
Contributor Author

@lilic hi,any progresses ?

@lilic
Copy link
Member

lilic commented Mar 3, 2021

Hey, will take the time to review this very soon, sorry for the delay!

@lilic lilic self-assigned this Mar 3, 2021
@lilic
Copy link
Member

lilic commented Mar 17, 2021

Asked @s-urbaniak to have a review as he fixed it in Prometheus operator.

@s-urbaniak
Copy link

Thank you @KielChan for the contribution and the fix! As you correctly identified the existing code relies on the invariant that a Watch always happens before a List. The reality is that a Watch can happen independently, invoked by the inner workings of client-go. Hence, your fix looks good to me 👍

Having said that even with the proposed fix multilistwatcher relies on internal behavior of client-go which can again change any-time. After having talked to various maintainers of api-machiner tweaking the ResourceVersion field is a no-go and an anti-pattern as it is to be used internally (by client-go) only and not externally.

In the long-term I highly suggest to migrate to native Informers per namespace, see https://github.com/prometheus-operator/prometheus-operator/tree/master/pkg/informers.

Generally i suggest to create a help-wanted issue to factor out multilistwatcher in favor of native informers.

@lilic
Copy link
Member

lilic commented Mar 17, 2021

Thanks @s-urbaniak I proposed this already in #1413.

/lgtm

Thank you for the fix!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 17, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: KielChan, lilic

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 17, 2021
@k8s-ci-robot k8s-ci-robot merged commit 030b089 into kubernetes:master Mar 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants