-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent connectivity to api-server when using with Istio #527
Comments
cc @mikedanese |
looks like the token source transport doesn't implement optional transport methods |
/assign @mikedanese |
looks like it also doesn't implement RoundTripperWrapper |
Thanks for debugging this. |
No problem - if you have any trouble reproducing the issue, I'm happy to help demonstrate it |
note that I'm not sure the missing method is the cause of the error, just that I noticed it was missing. the 5 minute delay is suspicious, given the expiry time on the token transport |
Yup, that was my thought as well. I'll work on a repro but cancellation seems like a nice to have in the token transport (if only to get rid of the error messages). |
I should mention that the individual request doesn't fail after 5 minutes. When the request works, it returns quickly. It's when the app goes 5 minutes without making any requests to the apiserver - that's when the errors start to occur on any requests made after that point (from client-go). |
We spoke with folks at Microsoft, and they said it's possibly due to the fact that Azure load balancing will drop a connection after a certain amount of idle time but does not send a TCP reset, and certain clients that don't handle this case will experience problems. I'm not much of a networking person so I'm not really sure what implications this has, but wanted to share. They're not sure why this would happen only in the case where Istio is installed though. I've sent some instructions to help them reproduce the issue, so they will be investigating from their side as well. Hope this helps! |
Is envoy in the datapath between your application and azure? is it in the datapath between your application and the kubernetes api server? |
Yes, the traffic should be routing through the envoy proxy, in both the client-node and client-go examples I've tried. Also, I think it's the same case, as it's on AKS, which is managed. So the requests are sent to the (Azure-hosted) api-server. Edit, to add: I also tried creating a VirtualService, which Istio requires for egress traffic that uses TLS (#2), but that did not solve the problem.
|
@mikedanese I can send you a couple sample apps that reproduces the issue directly if you'd like (though open-sourcing it might not be an option at this point without having to jump through some hurdles). I made it so you just need an AKS cluster and Azure Container Registry, then it runs the |
We had some help from Microsoft again, and they helped us get it working. Details of the cause:
They also were able to replicate the issue with client-go without Istio. I have not, but if that's the case, these fixes might be required to work on AKS (though TCP resets might fix it from AKS side). There are two pieces to the fix (both of which we successfully implemented):
DestinationRule: apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: azmk8s-dr
spec:
host: $FQDN
trafficPolicy:
connectionPool:
http:
maxRequestsPerConnection: 1 Microsoft said that TCP Reset is available in preview for standard load balancers, but AKS does not use standard load balancers yet. They should soon though. Does this fix affect client-go's ability to "watch" resources (or anything else)? If not, then I think the issue is resolved with these fixes. Edit* One last thing I can think of that could be causing this is that Envoy proxies don't send TCP keepalive packets (envoyproxy/envoy#3634). I assume that without TCP keepalives and without TCP resets, the connection actually is idle for 5 minutes, so the LB closes it. |
Update regarding this - we just decided to disable egress traffic in Istio, but from what I heard, setting the "IdleConnTimeout" on the http transport to something less than 5 minutes solves the issue. I haven't confirmed this though. |
Azure Load-Balancers timeout after 4 minutes (not configurables at this time). I would set the Note that sending This is a serious issue with AKS and is, by itself, a NO GO to using AKS. |
I imagine this can be closed now - in my case the problem was with Azure LBs not sending a TCP Reset and my applications having long timeout periods. I think the fix is to use low IdleConnTimeouts along with low request timeouts and wait for AKS to support TCP Resets in its LBs. I don't know if there is anything client-go can do (maybe change the default IdleConnTimeout to 3 or 4 mins, but as @prune998 described that will not solve the issue of the LBs closing the connections due to high congestion, and as LBs are not obligated to do anything in response to TCP keepalives, that won't solve the issue either). Maybe if there's a way to force a new TCP connection when a request fails due to a broken connection (if there even is a way to do this), that could be something client-go could do. Otherwise, any further issues might be best placed in Azure/AKS's issues. |
Client-go seems to have a problem while running on AKS with Istio. We've spent a lot of time trying to figure out which of the 3 causes the issue, but we suspect it to be client-go at this point.
Environment:
Description:
We noticed that if our client makes calls to the API server every 10 seconds, it continues to work - however, after (pretty close to if not exactly) 5 minutes of idle time, requests start failing. After another period of time (about 10-15 minutes), it works again before repeating the issue. This is noticeable in (a slightly modified version of) the in-cluster example.
Error log:
(replaced my apiserver with
${FQDN}
)We're also seeing:
(replaced my apiserver with
${FQDN}
)The main factors involved seem to be a combination of:
The ServiceEntry that needs to be created looks like this:
Why is this ServiceEntry necessary?
Why client-go?
Other things we've noticed:
curl
and Go'shttp.Client
.The text was updated successfully, but these errors were encountered: