Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent connectivity to api-server when using with Istio #527

Closed
m1o1 opened this issue Dec 5, 2018 · 17 comments
Closed

Intermittent connectivity to api-server when using with Istio #527

m1o1 opened this issue Dec 5, 2018 · 17 comments
Assignees

Comments

@m1o1
Copy link

m1o1 commented Dec 5, 2018

Client-go seems to have a problem while running on AKS with Istio. We've spent a lot of time trying to figure out which of the 3 causes the issue, but we suspect it to be client-go at this point.

Environment:

  • Azure Kubernetes Service (Kubernetes 1.11.5)
  • Istio 1.0.2 (bugs in later versions)
  • client-go (tried 7.0.0 and 9.0.0)

Description:
We noticed that if our client makes calls to the API server every 10 seconds, it continues to work - however, after (pretty close to if not exactly) 5 minutes of idle time, requests start failing. After another period of time (about 10-15 minutes), it works again before repeating the issue. This is noticeable in (a slightly modified version of) the in-cluster example.

Error log:

...
2018-12-05 18:37:27: Making request
E1205 18:37:37.333173       8 round_trippers.go:169] CancelRequest not implemented by *rest.tokenSourceTransport
2018-12-05 18:37:37: Panic
Error Get https://${FQDN}:443/api/v1/pods: net/http: request canceled (Client.Timeout exceeded while awaiting headers):

(replaced my apiserver with ${FQDN})

We're also seeing:

...
Error Get https://${FQDN}:443/api/v1/pods: unexpected EOF:

(replaced my apiserver with ${FQDN})

The main factors involved seem to be a combination of:

  • Using environment variables on the Pod to access the API server
  • Using a ServiceEntry with Istio to access the API server due to the previous factor
  • Using client-go to access the API server

The ServiceEntry that needs to be created looks like this:

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: azmk8s-ext
  namespace: default
spec:
  hosts:
  - ${FQDN}
  location: MESH_EXTERNAL
  ports:
  - name: https
    number: 443
    protocol: HTTPS
  resolution: DNS

Why is this ServiceEntry necessary?

Why client-go?

  • The issue occurs using client-go but not client-node. Even though it only happens on AKS and not on GKE, this is probably due to the ServiceEntry requirement, which is necessary on AKS but not on GKE. In both environments, client-node works, provided the ServiceEntry is there on AKS.

Other things we've noticed:

  • Making raw REST calls to the API server with the token and cacerts stored on the pod seems to work fine, both via curl and Go's http.Client.
  • The problems are pod-specific - that is, I can have one pod making requests every 10 seconds while another is every 6 minutes, and the 10-second one works while the 6-minute one does not
  • Deleting the ServiceEntry and recreating it seems to "restart" the error cycle, so it works soon after, but fails again after that brief period.
  • We can use any 2 of AKS, Istio, and client-go, but not all 3
    • If we do not use Istio, the sample app works okay, even with the AKS environment variables set
    • If we do not use AKS (and therefore no ServiceEntry), it works okay
    • If we do not use client-go, but use client-node, it works okay
@m1o1 m1o1 changed the title Intermittent connectivity when using with Istio Intermittent connectivity to api-server when using with Istio Dec 5, 2018
@liggitt
Copy link
Member

liggitt commented Dec 5, 2018

cc @mikedanese

@liggitt
Copy link
Member

liggitt commented Dec 5, 2018

looks like the token source transport doesn't implement optional transport methods

@liggitt
Copy link
Member

liggitt commented Dec 5, 2018

/assign @mikedanese

@liggitt
Copy link
Member

liggitt commented Dec 5, 2018

looks like it also doesn't implement RoundTripperWrapper

@mikedanese
Copy link
Member

Thanks for debugging this.

@m1o1
Copy link
Author

m1o1 commented Dec 5, 2018

No problem - if you have any trouble reproducing the issue, I'm happy to help demonstrate it

@liggitt
Copy link
Member

liggitt commented Dec 5, 2018

note that I'm not sure the missing method is the cause of the error, just that I noticed it was missing. the 5 minute delay is suspicious, given the expiry time on the token transport

@mikedanese
Copy link
Member

Yup, that was my thought as well. I'll work on a repro but cancellation seems like a nice to have in the token transport (if only to get rid of the error messages).

@m1o1
Copy link
Author

m1o1 commented Dec 5, 2018

I should mention that the individual request doesn't fail after 5 minutes. When the request works, it returns quickly. It's when the app goes 5 minutes without making any requests to the apiserver - that's when the errors start to occur on any requests made after that point (from client-go).

@m1o1
Copy link
Author

m1o1 commented Dec 7, 2018

We spoke with folks at Microsoft, and they said it's possibly due to the fact that Azure load balancing will drop a connection after a certain amount of idle time but does not send a TCP reset, and certain clients that don't handle this case will experience problems. I'm not much of a networking person so I'm not really sure what implications this has, but wanted to share.

They're not sure why this would happen only in the case where Istio is installed though. I've sent some instructions to help them reproduce the issue, so they will be investigating from their side as well.

Hope this helps!

@mikedanese
Copy link
Member

Is envoy in the datapath between your application and azure? is it in the datapath between your application and the kubernetes api server?

@m1o1
Copy link
Author

m1o1 commented Dec 10, 2018

Yes, the traffic should be routing through the envoy proxy, in both the client-node and client-go examples I've tried. Also, I think it's the same case, as it's on AKS, which is managed. So the requests are sent to the (Azure-hosted) api-server.

Edit, to add: I also tried creating a VirtualService, which Istio requires for egress traffic that uses TLS (#2), but that did not solve the problem.

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: azmk8s-vs
spec:
  hosts:
  - ${FQDN}
  tls:
  - match:
    - port: 443
      sni_hosts:
      - ${FQDN}
    route:
    - destination:
        host: ${FQDN}
        port:
          number: 443
      weight: 100

@m1o1
Copy link
Author

m1o1 commented Dec 18, 2018

@mikedanese I can send you a couple sample apps that reproduces the issue directly if you'd like (though open-sourcing it might not be an option at this point without having to jump through some hurdles). I made it so you just need an AKS cluster and Azure Container Registry, then it runs the az CLI commands necessary to connect to them (once you've logged in to the right subscription).

@m1o1
Copy link
Author

m1o1 commented Jan 15, 2019

We had some help from Microsoft again, and they helped us get it working. Details of the cause:

The issue appears to be caused by the SNAT that happens to all outbound traffic from your instances. When traffic leaves your instances the load balancer performs a SNAT and has a default timeout of 5 minutes. Unfortunately that timeout value can't be changed. What this means is the load balancer terminates any outbound connections that are idle for 5 minutes.

They also were able to replicate the issue with client-go without Istio. I have not, but if that's the case, these fixes might be required to work on AKS (though TCP resets might fix it from AKS side).

There are two pieces to the fix (both of which we successfully implemented):

  • Disable KeepAlives on calls to the apiserver in the Go client (using a custom http transport with DisableKeepAlives: true)
  • Create a DestinationRule (Istio CRD) as shown below so that the Envoy proxies don't try to reuse the same connection for multiple requests

DestinationRule:

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: azmk8s-dr
spec:
  host: $FQDN
  trafficPolicy:
    connectionPool:
      http:
        maxRequestsPerConnection: 1

Microsoft said that TCP Reset is available in preview for standard load balancers, but AKS does not use standard load balancers yet. They should soon though.

Does this fix affect client-go's ability to "watch" resources (or anything else)? If not, then I think the issue is resolved with these fixes.

Edit* One last thing I can think of that could be causing this is that Envoy proxies don't send TCP keepalive packets (envoyproxy/envoy#3634). I assume that without TCP keepalives and without TCP resets, the connection actually is idle for 5 minutes, so the LB closes it.

@m1o1
Copy link
Author

m1o1 commented Apr 3, 2019

Update regarding this - we just decided to disable egress traffic in Istio, but from what I heard, setting the "IdleConnTimeout" on the http transport to something less than 5 minutes solves the issue. I haven't confirmed this though.

@prune998
Copy link

prune998 commented Apr 17, 2019

Azure Load-Balancers timeout after 4 minutes (not configurables at this time). I would set the IdleConnTimeout to 3 mins to be sure not to hit the threshold.
This will not save you from the case when you are using too many connexions, then the LB may drop some of them sooner.

Note that sending keepalive packets is a workaround to the issue. It will not solve it. AKS is dropping TCP connexions without RESET, which is dirty and break TCP normal behaviour. This is something you could consider occasional on external (WAN) networks, and NOT when trying to reach the K8s API itself.

This is a serious issue with AKS and is, by itself, a NO GO to using AKS.

@m1o1
Copy link
Author

m1o1 commented Jun 6, 2019

I imagine this can be closed now - in my case the problem was with Azure LBs not sending a TCP Reset and my applications having long timeout periods. I think the fix is to use low IdleConnTimeouts along with low request timeouts and wait for AKS to support TCP Resets in its LBs.

I don't know if there is anything client-go can do (maybe change the default IdleConnTimeout to 3 or 4 mins, but as @prune998 described that will not solve the issue of the LBs closing the connections due to high congestion, and as LBs are not obligated to do anything in response to TCP keepalives, that won't solve the issue either). Maybe if there's a way to force a new TCP connection when a request fails due to a broken connection (if there even is a way to do this), that could be something client-go could do. Otherwise, any further issues might be best placed in Azure/AKS's issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants