streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1; INTERNAL_ERROR #676

juan-lee · 2018-10-03T16:45:29Z

Symptoms
Pods using the in cluster config to perform a watch on a resource will see intermittent timeouts and the following error in the pod log.

streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1; INTERNAL_ERROR

If the client performing the watch isn't handling errors gracefully, applications can get into an inconsistent state. Impacted applications include, but are not limited to, nginx-ingress and tiller (helm).

A specific manifestation of this bug is the following error when attempting a helm deployment.

Error: watch closed before Until timeout

Root Cause

The configuration of azureproxy results in unexpected timeouts for outbound watches targeting kubernetes.svc.cluster.local.
On the client side, the client-go implementation of watch.Until does not handle intermittent network failures gracefully. See watch closed before Until timeout, Fix broken watches, and Fix waiting in kubectl rollout status for more details.

Workaround
For the pods/containers that see the INTERNAL_ERROR in their logs add the following environment variables to the container spec. Be sure to replace <your-fqdn-prefix> and <region> so the aks kube-apiserver FQDN is correct.

spec:
  containers:
    env:
    - name: KUBERNETES_PORT_443_TCP_ADDR
      value: <your-fqdn-prefix>.hcp.<region>.azmk8s.io
    - name: KUBERNETES_PORT
      value: tcp://<your-fqdn-prefix>.hcp.<region>.azmk8s.io:443
    - name: KUBERNETES_PORT_443_TCP
      value: tcp://<your-fqdn-prefix>.hcp.<region>.azmk8s.io:443
    - name: KUBERNETES_SERVICE_HOST
      value: <your-fqdn-prefix>.hcp.<region>.azmk8s.io

The text was updated successfully, but these errors were encountered:

DenisBiondic · 2018-10-04T07:28:52Z

Is this not the same root cause as to performance degradation issue which is also still opened? #620

necevil · 2018-10-04T14:11:53Z

I saw this tagged from the TLS Handshake error thread — #14
Can anyone confirm this as one of the issues leading to the handshake timeout?

Would the above work around with environmental variables be a potential workaround for the timeout issue?

juan-lee · 2018-10-04T19:13:11Z

I saw this tagged from the TLS Handshake error thread — #14
Can anyone confirm this as one of the issues leading to the handshake timeout?

Would the above work around with environmental variables be a potential workaround for the timeout issue?

@necevil I don't think this is related to #14. This issue is about calling kube-apiserver from inside of the cluster vs. outside.

digeler · 2018-10-07T11:12:45Z

any Eta for permanent fix ?

juan-lee · 2018-10-07T16:50:16Z

any Eta for permanent fix ?

I don't have an ETA yet, but the fix is actively being worked on. The permanent fix will be to apply the same mitigation automatically for customer pods. I recommend applying the mitigation if you're running into this issue until we can have the fix automatically applied.

DenisBiondic · 2018-10-08T20:52:01Z

Wait, are you saying the workaround from above is the proper fix for this issue, and you are not working on a solution to reliably access the master plane without being routed over the public internet?

juan-lee · 2018-10-08T23:26:19Z

Wait, are you saying the workaround from above is the proper fix for this issue, and you are not working on a solution to reliably access the master plane without being routed over the public internet?

That's correct @DenisBiondic, we will apply the same change automatically. Just to clarify, the traffic will be routed over the azure backplane similar to how it works now except we will bypass the hop through azureproxy (kube-svc-redirect).

DenisBiondic · 2018-10-09T07:30:58Z

@juan-lee Will tunnelfront be removed then?

holmesb · 2018-10-09T09:18:18Z

This is affecting cert-manager too. Errors continue after deleting & recreating the pod.

DenisBiondic · 2018-10-09T09:21:08Z

Yeah, it affects each container that has a connection to the master plane (different operators, external dns, ingress controller, cert manager etc.)

mattfarina · 2018-10-10T13:50:42Z

@holmesb if you're installing cert-manager via Helm you can add the following to your custom values.yaml (passed in with the -f flag):

extraEnv:
- name: KUBERNETES_PORT_443_TCP_ADDR
  value: <your-fqdn-prefix>.hcp.<region>.azmk8s.io
- name: KUBERNETES_PORT
  value: tcp://<your-fqdn-prefix>.hcp.<region>.azmk8s.io:443
- name: KUBERNETES_PORT_443_TCP
  value: tcp://<your-fqdn-prefix>.hcp.<region>.azmk8s.io:443
- name: KUBERNETES_SERVICE_HOST
  value: <your-fqdn-prefix>.hcp.<region>.azmk8s.io

You can get your specific URL from the portal.

Note, tiller uses client-go so it inherits the bugs client-go has as was previously pointed out.

ttelfer · 2018-10-15T19:01:58Z

Also running into this problem using https://github.com/argoproj/argo-events. Currently using Azure AKS 1.11.3.

Does anyone know if there is a fix planned?

dkipping · 2018-10-16T09:56:12Z

Workaround
For the pods/containers that see the INTERNAL_ERROR in their logs add the following environment variables to the container spec. Be sure to replace <your-fqdn-prefix> and <region> so the aks kube-apiserver FQDN is correct.

        env:
        - name: KUBERNETES_PORT_443_TCP_ADDR
          value: <your-fqdn-prefix>.hcp.<region>.azmk8s.io
        - name: KUBERNETES_PORT
          value: tcp://<your-fqdn-prefix>.hcp.<region>.azmk8s.io:443
        - name: KUBERNETES_PORT_443_TCP
          value: tcp://<your-fqdn-prefix>.hcp.<region>.azmk8s.io:443
        - name: KUBERNETES_SERVICE_HOST
          value: <your-fqdn-prefix>.hcp.<region>.azmk8s.io

Is there a possibility to inject this into a tiller-deploy pod already running on my cluster or does this have to be done at initialisation time?
Can I even pass these environment variables as parameters or is this only possbile by changing the yaml?

EDIT:

Nevermind, I found out that you have to add these lines to the deployment by kubectl edit deployment <deployment-name> -n <namespace>. Afterwards the pods will get updated.
BTW: using the standard editor (vi) caused problems for updating in my terminal, thus using KUBE_EDITOR="nano" kubectl edit deployment <deployment-name> -n <namespace> might be a good idea if you are having issues too.

evandervalk · 2018-10-16T14:52:51Z

I'm running into the same issue.

I've tried the workaround described by @dkipping to no avail. Setting the environment variables on the tiller-deploy doesn't seem to work.

Edit:
I am also having trouble identifying the pod that is having the issues. So far I've only seen it happen in the logs of kube-ingress-nginx-ingress-controller. Setting the environment variables on the tiller-deploy doesn't feel right. I would expect these variables on the kube-ingress-nginx-ingress-controller. Then again setting these variables in general just feels like a hack.

dkipping · 2018-10-16T15:11:16Z

@evandervalk could you apply the values to the tiller-deploy deployment at least? so are they saved and still there when you print out or enter the specs of the tiller-deploy again?

If they got applied, could you check that they got applied to a (newly spawned) tiller-deploy-<id> pod too?

Edit:
Regarding your edit: you should set these values not in the tiller-deploy but on the kube-ingress-nginx-ingress-controller if the error only occurs in the logs of this pod according to @juan-lee.
But if the error occurs during the install via helm / tiller, you should carefully check tiller's logs. If it occurs there, setting these values on tiller-deploy might be needed too.

evandervalk · 2018-10-16T15:22:53Z

Editing the deployment went as you described. I can edit the values in the tiller-deploy and it prints them out when describing the resource.

After editing the configuration it created a new pod with the earlier mentions environment variables defined. So that also seems to work. I have the variables defined in the tiller-deploy-<id> pod.

However, after editing the configuration and making sure everything propagated, I am still seeing:
streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1677; INTERNAL_ERROR multiple times on the kube-ingress-nginx-ingress-controller.

If I describe the pod for kube-ingress-nginx-ingress-controller I do not see these environment variables.

I would assume these environment variables should be defined on kube-ingress-nginx-ingress-controller rather than tiller-deploy?

dkipping · 2018-10-16T15:28:10Z

I would assume these environment variables should be defined on kube-ingress-nginx-ingress-controller rather than tiller-deploy?

Yes, sounds like it.
I guess you can't break more than not having it running by trying to add the environment variables there.

evandervalk · 2018-10-16T15:29:25Z

I will try and report back.

Edit: I've removed the environment variables from tiller-deploy and added them to kube-ingress-nginx-ingress-controller. Monitoring the logs to see if I come across the error message again. So far it's looking good.

Edit 2: Setting the environment variables on the kube-ingress-nginx-ingress-controller solved my issues. Thank you @dkipping for taking the time to walk me through it.

wutingbupt · 2018-10-16T15:38:24Z

Workaround works for me actually

…

On Tue, Oct 16, 2018, 5:29 PM E. van der Valk ***@***.***> wrote: I will try and report back. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#676 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABHXfQC1ZEEACLthOd9N-ddSDqbJhVjRks5ulftigaJpZM4XGbpl> .

fhoy · 2018-11-05T08:52:17Z

@weinong : Good to know. Is this being rolled out region by region, and if so, is there a set sequence? I am particularly interested in when this will be rolled out in region westeurope.

weinong · 2018-11-05T19:27:28Z

West Europe will be later this week. It will be applied to all new clusters at first. Then we will roll it out to all existing clusters shortly after.

alibengtsson · 2018-11-06T14:09:24Z

Hi,when will the fix be released for
centralus and southeastasia ?

thanks in advanced

martinhartig · 2018-11-09T12:42:15Z

fix has reached westeurope.
Indeed, it has solved several issues. Thx.

LanceTheDev · 2018-11-12T07:59:52Z

Still getting the Issue:
2018-11-12T07:49:47.964622Z error Unable to decode an event from the watch stream: stream error: stream ID 1529; INTERNAL_ERROR
2018-11-12T07:49:47.964882Z error Unable to decode an event from the watch stream: stream error: stream ID 1519; INTERNAL_ERROR

Do I need to reinstall Istio for this fix to work?

alibengtsson · 2018-11-12T08:03:42Z

Still getting the Issue:
2018-11-12T07:49:47.964622Z error Unable to decode an event from the watch stream: stream error: stream ID 1529; INTERNAL_ERROR
2018-11-12T07:49:47.964882Z error Unable to decode an event from the watch stream: stream error: stream ID 1519; INTERNAL_ERROR

Do I need to reinstall Istio for this fix to work?
try to only restart/delete the envoy ingress gateway and reboot/delete your pods check the logs again in ingress controller.

wutingbupt · 2018-11-12T08:59:14Z

Hi,

Do we need to do anything for our cluster? Or it will be applied automatically?

Br,
Tim

andig · 2018-11-12T09:46:00Z

I understood:

@strtdusty It's a feature flag they set on AKS. For now only applied to new clusters. Not sure if the feature can be disabled manually.

So you'll have to recreate the cluster.

mmosttler · 2018-11-12T15:28:50Z

@adinunzio84 You mentioned that with this fix there may need to be an additional ServiceEntry needed for Istio. We are having the issue described and also have Istio installed so I am curious if you have the fix and what ServiceEntry you had to add?

juan-lee · 2018-11-12T23:05:16Z

b

Hi,

Do we need to do anything for our cluster? Or it will be applied automatically?

Br,
Tim

The fix is no longer behind a feature flag. All new clusters will get it automatically. Existing clusters will need to do a scale or upgrade to get the fix.

wutingbupt · 2018-11-13T09:06:27Z

@juan-lee Thanks for your reply, we rebuilt our cluster yesterday, however, I still can see this problem this morning.
We are in westeurope, and the cluster version is 1.11.3

Br,
Tim

alibengtsson · 2018-11-13T11:44:41Z

Hi, i redeployed with terraform in westeurope and centralus region. with nginx ingress controller. I can confirm from my side that the messages are gone from the log file.
i used version 1.11.2.

thanks for the fix.

juan-lee · 2018-11-13T15:00:12Z

@juan-lee Thanks for your reply, we rebuilt our cluster yesterday, however, I still can see this problem this morning.
We are in westeurope, and the cluster version is 1.11.3

Br,
Tim

Can you provide some more details? Does the pod in question have the appropriate KUBERNETES_ environment variables set.

m1o1 · 2018-11-13T18:03:23Z

@mmosttler Not fully tested, but this kinda works:

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: azmk8s-ext
  namespace: default
spec:
  hosts:
  - ${FQDN}
  location: MESH_EXTERNAL
  ports:
  - name: https
    number: 443
    protocol: HTTPS
resolution: DNS

I did cat serviceentry.yml | envsubst | kubectl apply -f - with FQDN=$(az aks show -n ${CLUSTER_NAME} -g ${CLUSTER_RG} --query "fqdn" --output tsv)

This is the closest I got to it working. The javascript Kubernetes client is happy with this, but the Go client is not.

Please let me know if you come up with something better though

DenisBiondic · 2018-12-02T11:24:31Z

I can confirm that the problem is gone from our clusters. (west EU)

weinong · 2018-12-03T18:03:40Z

thanks for reporting back. i'm closing the issue for now

dharmeshkakadia · 2018-12-04T18:41:04Z

I am still seeing this issue even this week in east us.

juan-lee · 2018-12-04T23:20:50Z

I am still seeing this issue even this week in east us.

Can you elaborate on your scenario? Also, keep in mind that pods will need to be restarted in order to get the fix. You can check to see if a pod has the fix by seeing if KUBERNETES_PORT, etc env variables are set for each container.

juan-lee added the known-issue label Oct 3, 2018

EamonKeane mentioned this issue Oct 3, 2018

kubectl get svc streamwatcher.go:109] Unable to decode an event from the watch stream: stream error kubernetes/kubernetes#66153

Closed

jskulavik mentioned this issue Oct 3, 2018

API calls fails with timeout for logs / port-forward and others #232

Closed

juan-lee mentioned this issue Oct 3, 2018

Custom controllers appear to not get updates from shared informer after some time #522

Closed

jskulavik mentioned this issue Oct 4, 2018

Unable to connect to the server: net/http: TLS handshake timeout #14

Closed

garethmorris78 mentioned this issue Oct 4, 2018

Unable to decode an event from the watch stream: stream error kubernetes/kubernetes#69147

Closed

jskulavik mentioned this issue Oct 9, 2018

Helm deploy fails on pre-install hooks (AKS only) #455

Closed

saykumar mentioned this issue Oct 9, 2018

[ AKS / dysk / csi ] MountVolume.WaitForAttach failed Azure/kubernetes-volume-drivers#6

Closed

dtomcej mentioned this issue Oct 15, 2018

Unable to decode an event from the watch stream traefik/traefik#4044

Closed

brantb mentioned this issue Oct 15, 2018

Helm operator on AKS randomly deletes releases fluxcd/flux#1447

Closed

asridharan mentioned this issue Oct 16, 2018

Occasionally Ingress Controller introduces significant delays in process ingress events Azure/application-gateway-kubernetes-ingress#45

Closed

schneider-lazar mentioned this issue Nov 14, 2018

Cert bundle not properly returned on AKS kubernetes/ingress-nginx#3419

Closed

weinong closed this as completed Dec 3, 2018

m1o1 mentioned this issue Dec 5, 2018

Intermittent connectivity to api-server when using with Istio kubernetes/client-go#527

Closed

mmourafiq mentioned this issue Jan 3, 2019

Polyaxon fails to finish jobs polyaxon/polyaxon#303

Closed

jskulavik mentioned this issue Jan 7, 2019

Frequent problems with timeouts, solved by restarting the tunnelfront pod #642

Closed

ellieayla mentioned this issue Jan 11, 2019

Flux can get stuck, producing no output/work, with no liveness check on AKS fluxcd/flux#1648

Closed

kkmsft mentioned this issue Jan 25, 2019

Unable to decode an event from the watch stream Azure/aad-pod-identity#144

Closed

grampelberg mentioned this issue Jan 31, 2019

Can't see metrics for services on AKS linkerd/linkerd2#2182

Closed

akshaysngupta mentioned this issue Jun 17, 2019

chore: Remove api server param from helm Azure/application-gateway-kubernetes-ingress#285

Merged

aramase mentioned this issue Jan 22, 2020

[Azure Firewall + add-pod-identity] watching problem Azure/aad-pod-identity#467

Closed

aramase mentioned this issue Jul 13, 2020

Ability to set the masternode api address Azure/secrets-store-csi-driver-provider-azure#148

Closed

ghost locked as resolved and limited conversation to collaborators Jul 30, 2020

streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1; INTERNAL_ERROR #676

streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1; INTERNAL_ERROR #676

Comments

juan-lee commented Oct 3, 2018 • edited Loading

DenisBiondic commented Oct 4, 2018

necevil commented Oct 4, 2018 • edited Loading

juan-lee commented Oct 4, 2018

digeler commented Oct 7, 2018

juan-lee commented Oct 7, 2018

DenisBiondic commented Oct 8, 2018

juan-lee commented Oct 8, 2018

DenisBiondic commented Oct 9, 2018

holmesb commented Oct 9, 2018

DenisBiondic commented Oct 9, 2018

mattfarina commented Oct 10, 2018 • edited Loading

ttelfer commented Oct 15, 2018 • edited Loading

dkipping commented Oct 16, 2018 • edited Loading

evandervalk commented Oct 16, 2018 • edited Loading

dkipping commented Oct 16, 2018 • edited Loading

evandervalk commented Oct 16, 2018

dkipping commented Oct 16, 2018

evandervalk commented Oct 16, 2018 • edited Loading

wutingbupt commented Oct 16, 2018 via email

fhoy commented Nov 5, 2018

weinong commented Nov 5, 2018

alibengtsson commented Nov 6, 2018

martinhartig commented Nov 9, 2018

LanceTheDev commented Nov 12, 2018

alibengtsson commented Nov 12, 2018

wutingbupt commented Nov 12, 2018

andig commented Nov 12, 2018

mmosttler commented Nov 12, 2018

juan-lee commented Nov 12, 2018

wutingbupt commented Nov 13, 2018

alibengtsson commented Nov 13, 2018

juan-lee commented Nov 13, 2018

m1o1 commented Nov 13, 2018 • edited Loading

DenisBiondic commented Dec 2, 2018

weinong commented Dec 3, 2018

dharmeshkakadia commented Dec 4, 2018

juan-lee commented Dec 4, 2018

juan-lee commented Oct 3, 2018 •

edited

Loading

necevil commented Oct 4, 2018 •

edited

Loading

mattfarina commented Oct 10, 2018 •

edited

Loading

ttelfer commented Oct 15, 2018 •

edited

Loading

dkipping commented Oct 16, 2018 •

edited

Loading

evandervalk commented Oct 16, 2018 •

edited

Loading

dkipping commented Oct 16, 2018 •

edited

Loading

evandervalk commented Oct 16, 2018 •

edited

Loading

m1o1 commented Nov 13, 2018 •

edited

Loading