-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Watch Method stops watching #533
Comments
Having spent a lot of time looking into this i believe its due to a bug in the Ssl stream, potentially a deadlock but essentially the await calls on the read stream method just hand indefinitely and never get called back. |
did you server close the connection? |
Not that i can see, either way were that the case i would expect and exception to be thrown which doesnt happen. |
It's quite possible also that something in the network path timed out. You need to send keep-alives in many cases to keep the TCP stream open. (or you need to timeout the watch at a lower number like 5 minutes) |
Maybe, but im using the default implementation of this client and by default it isnt working, it doesn't throw an exception just seems to get stuck in an await cycle. |
@pbalexlear I think I have the same issue. Just out of curiosity, are you using Windows or Linux? |
I was doing this on mac initially but as the debugger for tasks is better in VS for windows i tried on windows with the same results but was able to see more in detail what was happening. Although I've since deleted the windows VM so no longer have the information, given this happens to me every-time after maybe 5mins of no events i would have thought it would be fairly easy to reproduce. It does look to be like its potentially a bug in SSL streams as when you do a kube proxy and watch locally using the http address this behaviour doesn't happen. |
Ideally there should be some exception thrown and then i could restart the watcher but as there isnt i cannot capture the event and as the indicator 'Watching' is set in the same method thats stuck awaiting a task result this never gets changed either so i cannot capture if a watcher truly is watching or not. |
could you please create a test case in e2e project? and i will try to understand and fix it |
@tg123 Could you explain a bit more by what you mean by create a test case in e2e project? I'm happy to help assist in any way i can although i think the best way for you to understand this issue is to recreate it. |
create a test case for your bug csharp/tests/E2E.Tests/MnikubeTests.cs Line 19 in a614d95
|
I have the same issue when using https://github.com/falox/csharp-operator-sdk which is based on this project. It works for 5-10 minutes, then hangs - no errors thrown. |
@tg123 i have created a test here: |
Also please be aware that in order to run this test there needs to be a namespace in the k8s cluster on your current config called 'testns' |
@pbalexlear i am trying to understand your case here is what i got after cherry pick your testcase should it fail? |
Hello, its not possible that the case can run in 1.7 mins as it has a 5 min task.delay which is when the issue occurs. the scenario is supposed to be
|
I have spent some time on this today. First, I can not find any specific reason for this to happen, but I did find a similar case in the JAVA SDK, kubernetes-client/java#1370. I am not sure this will be the solution as I have not had time to confirm it. |
Yes i think that there is a bug in dotnet core particularly as this only happens over ssl streams as well. |
allow me some time |
I still cannot repo this Here is how I test it (latest master 0e3cb94)
Server
Client code (from watch example)
create pod periodically (random delay)
here is logs output,
I did same thing on Windows, cant repo either
|
Hi, strange that you can not reproduce. I reproduced it using the same example. My app is running inside an AKS service, but it also fail on Windows. I am pretty confident that a long duration without any changes is the trigger(about 5min). I will publish an example later today or tomorrow. |
I did a tcpdump on the connection to monitor what happened after ~30 min the client shutdown the connection please also do the tcpdump when you test the connection stuck.
|
When onClose is triggered, everything is good. So I copied @tg123's approach with a minikube installation, and it worked just fine. So I booted an AKS-cluster, and it did not. I expected nothing from the tcpdump, but I got a reset without onClose triggered.
Nothing more in the tcpdump after this point I logged the status of watcher.Watching every minute
|
see my later reply and workaround AWS LB (maybe) sent RST to close the connection after 5 min idle
lets do this and see if it mitigates the issue Underlayer connection did not honor RSTi do not have i will try to mirror the behavior of RST and update findings here. |
hello, so i have only experienced this issue when using AKS and a HTTPS connection, if you use the same set up i would expect you should be able to replicate this as it happens every time for me using AKS and a HTTPS connection. Thanks Alex |
I created an AKS and tested We can set tcp keep alive to avoid remote kick us and set tcp keep alive params
here is my workaround,
update log after 10min idle
|
kube-apiserver has an option |
Thanks @tg123 i am happy to upgrade to dotnet 5, will this fix be being added to a release? |
@tg123 great work, sorry for being so quiet, but Christmas... |
@roblapp did you try my workaround? |
@roblapp I followed the changes suggested by @tg123and got it running stable. There might still be an issue when running it in a hosted service. In our case, the root cause was Azure LB, not AWS. To make this bomb-proof, I added a timer. When the duration between events passes the TTL limit, we recreate the watch. This only occurs in stale environments, but it provides another layer of stability as it will resolve all network issues causing a half-open connection. It is not beautiful :) |
@eskaufel Are you passing the resource version in when establishing the watch? I added support for it but then I wasn't sure how you were supposed obtain it initially. I'd be very curious to see your solution, even if it's in pseudo code... I am interested in seeing how you are doing it. My watch functionality was kicked off by using an
This code calls this method:
If anyone sees something wrong with this please let me know and I will be eternally grateful. I ended up moving to a polling based solution, which works for my use case. But I would much rather get the watch working correctly. P.S. It might be worth noting that my cluster is in EKS. I am curious to see if that is going to be a commonality for those experiencing this issue. |
@tg123 I did not try that solution yet. I may give it a shot within the next week or so. Thanks! P.S. It might be worth noting that my cluster is in EKS. I am curious to see if that is going to be a commonality for those experiencing this issue. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
Rotten issues close after 30d of inactivity. Send feedback to sig-contributor-experience at kubernetes/community. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hello, I am encountering something very similar to this issue using version 5.0.13 of the KubernetesClient using .net5. Could somebody clarify whether this issue is expected to be fixed or not? If not, I can try and dig deeper and determine whether it is the same issue. |
@paulsavides Unfortunately, I kept running into this issue. The watch would disconnect and there was nothing to indicate what happened. After days of not being able to resolve it I abandoned the watch functionality completely in lieu of polling. I posted the code that I was using to perform the watch here. In my new solution, I have a background process that polls the V1Jobs API on a 15 second interval. For my use case, this is good enough as I don't need to capture events instantly. For testing, I simulated thousands of requests per minute against the API server and saw no noticeable impact on the cluster or with the Kubernetes API Server. Unfortunately, I don't think a solution was found nor was I able to figure it out which is why I changed my approach to polling. |
@tg123, is the modification for underlying message handler still required (referring to #533 (comment) ) to enable TCP KeepAlive messages ? Or is it enabled by default as part of creation of |
@karok2m it is enabled by default on net5+ |
I have not resolved this issue myself but if it helps anyone this code is a workaround when using the CLI/kubectl: slm-watch(){
if [[ -z $1 ]]; then
echo "\nCurrent context has these CRDs:"
i=-1
kubectl get crd -A | awk '{ print $1 }' | while read line; do
i=$((i + 1))
if [[ $i -eq 0 ]]; then
continue
fi
CHOICES[$i]=$line
echo $i. $line;
done
echo
read "CHOICE?Select a number between 1 and $i: "
CHOICE=$CHOICES[$CHOICE]
else
CHOICE=$1
fi
echo OK, getting $CHOICE objects ...
while true; do
kubectl get $CHOICE -A -w | while read line; do
echo "$line $(date +%T)"
done
echo "\nDamn, lost the watcher! What happened? ..."
sleep 2
done
}
# example: `slm-watch storagesnapshots` Quite obviously a hack, and you lose the declarativeness because events "rebuild" on re-connection, but it works for me. This is on a MacBook running minikube |
When using a watch method it appears to stop watching after a certain time period, not sure if this is a task deadlock or a stream timeout but after around 30 minutes new events are no longer triggered even if changes are made on the cluster, also the watch object says its still watching but when debugging and looking at the tasks they are awaiting the streamreader.readline
The text was updated successfully, but these errors were encountered: