-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More unstable cluster #329
Comments
hmm while looking into causes I see your comment here :^) kubernetes/kubernetes#66689 (comment) |
Yes. :-) I have such waiting. But it does not really resolve it. I thought it is just a race condition, but in fact it seems it just does not create properly the namespace. I am guessing some core service dies or something. |
So I watch all events and print them out. I am noticing during a run of my CI such events unrelated to what I am doing in our tests:
Not sure why would new node be registered in the middle of CI? Maybe because it died before and now it was recreated? Also, I have an example situation which is a showcase of this issue, and I would like just to make sure it is not something I am doing wrong. So this is example log output in my CI script I wrote:
So after systems are up in their pods, I start tests against them. This is done by creating a job. After I create a job, I use I am assuming some core service has issues running on the kind cluster and this is why no pod appears and why my CI script complains. The question is, which service has issues and why. |
@mitar are you reusing the same cluster for all tests or each test gets its own cluster? If things start hanging then the first thing to look at is the api server. You should be able to jump in the control plane container and start looking around at the different containers. Also it would be useful to track your machines resources if you deploy kind for a long period of time. |
So there is a series of tests inside one CI run. And cluster is made only once per CI run. I cannot really jump into "CI run" because it is running on GitLab workers. So maybe there is some resource starvation or anything, but it would be useful if that would be somehow reported or something. Or made so that core systems are never removed. I would understand and easier debug if my test pods gets killed with "out of resources" by the cluster. But not core pods. |
Definitely agree on the core systems never being removed, unfortunately there are some Kubernetes limitations there where core workloads can be rescheduled, k8s is designed to have reserved overhead but we can't fully do that in kind right now. I spent some time looking at another user's logs with some similar issues but we haven't pinned it down yet. We need to debug and solve this as much as though, and the tooling for that needs improvement. |
This may be related: #303 |
Rescheduled to where? If this is one node cluster, where would it go? :-) |
re-created on the same node. some of these (like daemonsets) are expected to change I think... RE: #303, I think these mounts may fix some of the issues with repeated nesting. The other user seemed to have tracked it down to the host systemd killing things in their environment. I can't replicate this yet (not enough details, doesn't happen in my environments so far...). |
Hm, would there be anything in the log if hosts kills a container? |
You may see a signal handler log in the pod / container logs IIRC, I haven't first-hand seen this occur yet. It would not be normal on a "real" cluster. Another user with issues: #136 (comment) I will spend some time later this week looking at how to improve debug-ability and bring this up in our subproject meeting tomorrow. If it's related to mounting the host groups it may be particularly tricky to identify though... 😬 kind has been remarkably stable in the environments I have access too ... which is of course wildly unhelpful for identifying causes of instability 😞 |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
I think we can close this for now. It looks relatively stable recently. |
Aside: I've disabled that bot in this repo, and I do hope we've made progress on ensuring stability in more configurations. |
So now I am running kind on our CI (GitLab) and I am noticing more than I would hope instability. What we do is create a kind cluster and then create a namespace, in the namespace few pods, run some jobs (one job takes generally around and hour or so) inside the namespace, cleanup the namespace, and repeat with another namespace. We do this few times. I use Python Kubernetes client.
What happens sometimes is that at some point commands do not seem to get through. A typical example is that after creating a namespace, commands starts failing with "default service account dos not exist" error message. I added a check to wait for it to be created but it seems it never is. And this happens only occasionally.
Another example is that I have a watch observing and waiting for some condition (like all pods ready) and that just dies on me and connection gets closed.
I am attaching kind logs for one such failed CI session.
results-43316.zip
I see errors like:
The text was updated successfully, but these errors were encountered: