Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dashboard sidecar stops working after 10 minutes #18

Closed
PeterGerrard opened this issue Sep 14, 2020 · 23 comments · Fixed by #2057
Closed

Dashboard sidecar stops working after 10 minutes #18

PeterGerrard opened this issue Sep 14, 2020 · 23 comments · Fixed by #2057

Comments

@PeterGerrard
Copy link

Describe the bug
kiwigrid/k8s-sidecar#85
After 10 minutes the dashboard sidecar stops receiving watch notifications

Originally posted on old repo: helm/charts#23565

Version of Helm and Kubernetes:
Helm version N/A
Kubernetes 1.16.10

Which chart:
stable/grafana

What happened:
After 10 minutes the dashboard sidecar stops receiving watch notifications

What you expected to happen:
The sidecar to always pick up new dashboards

How to reproduce it (as minimally and precisely as possible):
See kiwigrid/k8s-sidecar#85

@qaiserali
Copy link

+1

I have the exact same issue, and after a couple of minutes sidecar stops receiving notifications on configmap changes. Also sidecar doesn't detect and import new dashboards via configmaps.

All changes to the dashboards and new dashboards appear after restarting the grafana pod or sidecar container.

unguiculus added a commit to unguiculus/grafana-helm-charts that referenced this issue Dec 9, 2020
Signed-off-by: Reinhard Nägele <unguiculus@gmail.com>
unguiculus added a commit to unguiculus/grafana-helm-charts that referenced this issue Dec 10, 2020
Signed-off-by: Reinhard Nägele <unguiculus@gmail.com>
@dry4ng
Copy link

dry4ng commented Dec 23, 2020

Same issue.

@bergerx
Copy link

bergerx commented Jan 14, 2021

using https://github.com/OmegaVVeapon/kopf-k8s-sidecar instead of kiwigrid/k8s-sidecar may address the issue, here is a comment from @djsly indicating success with that one: kiwigrid/k8s-sidecar#85 (comment)

@qaiserali
Copy link

Any updates on this?

@bergerx
Copy link

bergerx commented Feb 8, 2021

I tried using OmegaVVeapon/kopf-k8s-sidecar as a replacement for kiwigrid/k8s-sidecar but just hit this issue: OmegaVVeapon/kopf-k8s-sidecar#14

@bergerx
Copy link

bergerx commented Feb 9, 2021

As the OmegaVVeapon/kopf-k8s-sidecar#14 turned out to be a dead end to use it along with this chart, i came up with this very ugly workaround which i'm not proud of, but in 2 hours this is what ai was able to do, we need to craft a PR to address the underlying issue in kiwigrid/k8s-sidecar#85

Here is the values file snippet i used:

rbac:
  # The extra rules below are required to run the `kubectl exec ...`
  # command in the force-grafana-dashboard-reload container below.
  extraRoleRules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get"] 
  - apiGroups: [""]
    resources: ["pods/exec"]
    verbs: ["create"]
# This is a workaround for an upstream bug,
# this will force grafana dashboards to be reloaded every 15 minutes.
# Its 15 minutes because the grafana-sc-dashboard pod will regularly be in
# CrashLoopBackOff if we dont allow it to run at least 10 minutes,
# CrashLoopBackOff timer is reset after 10 minutes without any problems.
# See these issues for details:
# https://github.com/kiwigrid/k8s-sidecar/issues/85
# https://github.com/grafana/helm-charts/issues/18
extraContainers: |
  - name: force-grafana-dashboard-reload
    image: bitnami/kubectl:1.19
    env:    
    - name: POD_NAME
      valueFrom:
        fieldRef:
          fieldPath: metadata.name
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          fieldPath: metadata.namespace
    command: ["bash", "-c"]
    args:   
    - while sleep 900; do 
        echo Forcing grafana sidecar to restart ...;
        kubectl exec -n $POD_NAMESPACE $POD_NAME -c grafana-sc-dashboard -- kill 1; 
      done    

And the status of the grafana pod:
image

@OmegaVVeapon
Copy link

OmegaVVeapon commented Feb 11, 2021

The original issue on k8s-sidecar has been open for 176 days at the time of writing so I think an update on the situation is warranted.

Having the process hang constantly hang is what prompted me to do an entire rewrite in kopf-k8s-sidecar and what made @bergerx spawn a container with exec RBAC permissions to constantly kill the original sidecar. I think it's fair to say we're reaching a point of desperation here. 😄

I've been giving this some thought and these are the possible solutions I see for this problem atm, from best to worst:

1. Fix the original kiwigrid/k8s-sidecar

I'm not hopeful this is going to happen...
After going through all the comments on the original issue my feeling is that, if it were that easy, it wouldn't have taken nearly half a year (and counting) to do. I went through the codebase several times before deciding to rewrite it...

2. Add the functionality to kopf-k8s-sidecar to run once only (issue)

I've inquired about how to accomplish such a behaviour in this question and it looks like it won't be trivial... I could give it a try regardless.

3. Avoid initContainers!

This is the solution I'm currently using since it can be readily used with minimal tweaking.

After inspecting this Helm chart I discovered that it uses the k8s-sidecar as an initContainer ONLY IF either of these 2 things happen:
a) if .Values.sidecar.datasources.enabled is true
b) if .Values.sidecar.notifiers.enabled is true

If they are, the sidecar is started as an initContainer to load up the files and die. There's no dynamic updates for Grafana datasources or notifier files. Whatever got loaded at the start is what you will keep for the remainder of that pod's life.

So, given that this is a one-off operation, I decided to simply disable .Values.sidecar.datasources.enabled use this chart's ability to set the datasources.
image

and use kopf-k8s-sidecar for the image
image

I currently have a fully-functional Grafana from which I'm able to add/modify/delete dashboards without issues via configmaps/secrets.

The only downside I see here is if for whatever reason you can't provide your datasources/notifiers via the values.yaml and you HAVE to read them from configmaps/secrets... which brings me to solution 4.

4. Allow users to use different sidecar images depending on the use case

So, if you've followed so far, the tl;dr is that the original sidecar isn't very good at staying alive. My sidecar isn't very good at dying. So why not let each sidecar do what they do best?

Rather than using the same .Values.sidecar.image for the datasources, notifiers and dashboard containers, maybe it'd be beneficial to give the user more flexibility to override each image if desired.
For example, we could set .Values.sidecar.datasources.image to use the kiwigrid sidecar and set .Values.sidecar.dashboards.image to use my sidecar and everything is able to run as intended.

I could submit a small PR for this if deemed useful.


Anyways, those are my current thoughts on the state of things. Feedback is welcome.

@bergerx
Copy link

bergerx commented Feb 11, 2021

Solution 3 seems reasonable to me but i dont get this part:

The only downside I see here is if for whatever reason you can't provide your datasources/notifiers via the values.yaml and you HAVE to read them from configmaps/secrets... which brings me to solution 4.

This exact design has already been a problem for us, as in some clusters we want to be able to define datasources on the go as we create new configmaps/secrets. Current design is preventing that, this design also kind of conflicts with the idea of declaring datasources as configmaps.

@OmegaVVeapon
Copy link

OmegaVVeapon commented Feb 11, 2021

This exact design has already been a problem for us, as in some clusters we want to be able to define datasources on the go as we create new configmaps/secrets. Current design is preventing that, this design also kind of conflicts with the idea of declaring datasources as configmaps.

Yeah... I found this to be odd as well.
It looks like Grafana is perfectly capable of loading/modifying datasources on-the-fly according to this documentation
so I'm not sure why the decision to only inject them in an initContainer was taken.

Maybe one of the Grafana maintainers can elaborate a bit more on that.

That being said, I think I know how we can add new datasources on-the-fly but I'll have to test it tomorrow.

@OmegaVVeapon
Copy link

OmegaVVeapon commented Feb 12, 2021

@bergerx I have good and bad news.

So I tested loading up datasources dynamically (as part of the normal long-running sidecar, not the init one) and it worked... sorta.

│ t=2021-02-12T22:18:37+0000 lvl=info msg="inserting datasource from configuration " logger=provisioning.datasources name=Prometheus uid=                                                                                                                                                                          │
│ t=2021-02-12T22:18:37+0000 lvl=info msg="inserting datasource from configuration " logger=provisioning.datasources name=Something uid=                                                                                                                                                                           │
│ t=2021-02-12T22:18:37+0000 lvl=info msg="HTTP Server Listen" logger=http.server address=[::]:3000 protocol=http subUrl= socket=

That Prometheus and Something datasources came from ConfigMaps that were injected by the normal sidecar and they work as expected.

However... once you reach the "HTTP Server Listen" listen part, Grafana no longer cares about what YAMLs the sidecar throws into /etc/grafana/provisioning/datasources/... you can delete, modify or add new YAMLs and they won't be picked up.

So it's a no go unless the Grafana devs add the same reloading logic for datasources as they do for dashboards I'm afraid...

This also sucks because, if it picked up datasources at any point in time, then we could ABSOLUTELY get rid of both sidecar initContainers (who cares if the files get loaded before or after Grafana starts?). That would vastly decrease the complexity of the chart and the Deployment, 1 container instead of potentially 3.

In my case, the datasource loading worked even without initContainers but I can't guarantee my sidecar (or kiwigrid's) will always beat Grafana in initializing, you'd be rolling the dice.

So the only real choice remains to:
a) Use the chart's .Values.datasources and .Values.notifiers
b) Keep using initContainers

Either way, you won't get "on the go" updates.

@zanhsieh
Copy link
Collaborator

FYI, sidecar upgrade merged.

#211

@qaiserali
Copy link

@zanhsieh Is the issue with the sidecar has been fixed with the latest merged #211 ?

@zanhsieh
Copy link
Collaborator

@qaiserali
Could you give a try yourself please?

@djsly
Copy link

djsly commented Feb 16, 2021

I looked at all diff between 1.1.0 and 1.10.0 and the only related fix would be

https://github.com/kiwigrid/k8s-sidecar/releases/tag/1.3.0

While this does not fix the kubernetes watch issues in #98 it
at least removes the unnecessary requests.

doesn't look too promising

@OmegaVVeapon
Copy link

It looks like noone is jumping at the chance to test 1.10.0 so I'll share what I'm seeing. 😅
image

As you can see, it picked up my initial testing-dashboard-2.json, then some seconds later, I created another configmap for testing-dashboard.json, that got picked up too. Great.

Then I left it alone to work on other stuff. It was around a couple of hours...

I came back and I deleted the configmap that held testing-dashboard.json. No change.
Re-added the configmap, still no change...
Did it a few more times while I was exec'ed inside the container (maybe it was still working but not logging anymore?) and the dashboard file wasn't being affected at all.
It hung again.

So I went back to my sidecar image...

@qaiserali
Copy link

Are there any updates on this issue?

@zanhsieh
Copy link
Collaborator

@OmegaVVeapon
Would you PR your sidecar image so we go from there?

@OmegaVVeapon
Copy link

@OmegaVVeapon
Would you PR your sidecar image so we go from there?

Not sure, will need to sleep on it.

My image is basically a rewrite of the k8s-sidecar so the PR would be quite massive.

I'd need confirmation from the original maintainers of the k8s-sidecar that this is something they would even be interested in accepting before trying to change their entire codebase.

Perhaps a better starting point would be for people struggling with this issue to switch to my image and see if it resolves it for them?

I documented the very simple instructions to use it with this helm chart here

If it's helping enough people then I'd say PR'ing it back to k8s-sidecar is a much easier sell.

Otherwise, I'll just leave it as an alternative.

@djsly
Copy link

djsly commented May 19, 2021

@OmegaVVeapon I think he was asking to PR it into the helm-chart repo. Maybe we could add a toggle to switch between the original sidecar and yours.

@OmegaVVeapon
Copy link

Ah, I see.

If that's the case, I think I'll still leave it as an opt-in via the values.yaml for now.

There's a few more things I want to add before PR'ing it as a default.

@gowrisankar22
Copy link
Contributor

Issue is reproducible with 1.14.2 sidecar as well.

[2021-11-10 03:47:48] ProtocolError when calling kubernetes: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

@prein
Copy link

prein commented Feb 7, 2022

Reading the comments on two issues linked above leaves me unsure about current status of this. Is it fixed in kiwigrid v1.15.4? Is there a PR coming for the chart to allow setting .Values.sidecar.datasources.image to one sidecar and .Values.sidecar.dashboards.image to another?

@Arabus
Copy link
Contributor

Arabus commented Dec 8, 2022

The issue seems to persist with the 1.19.2 Sidecar. Upgrading to 1.20 made the issue disappear.

Arabus added a commit to syseleven/helm-charts that referenced this issue Dec 8, 2022
Fixes grafana#18

Signed-off-by: Bernd May <b.may@syseleven.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants