Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout waiting for process kube-apiserver to stop #1571

Closed
ysksuzuki opened this issue Jun 29, 2021 · 14 comments · Fixed by kubernetes-sigs/kubebuilder#2379
Closed

Timeout waiting for process kube-apiserver to stop #1571

ysksuzuki opened this issue Jun 29, 2021 · 14 comments · Fixed by kubernetes-sigs/kubebuilder#2379
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@ysksuzuki
Copy link

What happened:

envtest.Environment.Stop() raises the error timeout waiting for process kube-apiserver to stop with kube-apiserver 1.21.2 that set up using setup-envtest use -p env.

  Unexpected error:
      <errors.aggregate | len:1, cap:1>: [
          <*errors.errorString | 0xc000328c00>{
              s: "timeout waiting for process kube-apiserver to stop",
          },
      ]
      timeout waiting for process kube-apiserver to stop
  occurred

How to reproduce it (as minimally and precisely as possible):

Run this envtest with kube-apiserver 1.21.2.

Environment:

  • controller-runtime: 0.9.2
  • kube-apiserver: 1.21.2
@stijndehaes
Copy link

using 1.20.2 for example works fine

@varshaprasad96
Copy link
Member

+1, hit the same issue and using 1.20 binaries works fine.

@erikgb
Copy link
Contributor

erikgb commented Aug 21, 2021

+1, I also tested with the 1.22 binaries, and still same issue.

It would be nice if someone with the skills could take a look ASAP. 😉

/priority important-soon

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Aug 21, 2021
@joelanford
Copy link
Member

I'm not able to reproduce this issue with the envtest 1.22.0 binaries and controller-runtime v0.9.6.

To try to reproduce, I bumped to the latest k8s, envtest, and c-r versions here

@hickeyma
Copy link

hickeyma commented Sep 7, 2021

Tested using the cronjob-tutorial, results as follows:

  • Kubernetes v1.20.2: No issue
  • Kubernetes v1.21.4: Issue as stated
  • Kubernetes v1.22.0: Issue as stated

So, looks like issue in Kubernetes v1.21, v1.22.

@hickeyma
Copy link

hickeyma commented Sep 7, 2021

This is what I have found so far. From Kubernetes 1.21+, when it tries to cleanup the test environment, there is a clash if a custom controller is created during testing. It would seem that the controller is still running and kube-apiserver will not respond to shutdown during tear down.

I can get it to work by, changing controller goroutine:

go func() {
		err = k8sManager.Start(ctrl.SetupSignalHandler())
		Expect(err).ToNot(HaveOccurred())
}()

to:

go func() {
		defer GinkgoRecover()
		err = k8sManager.Start(ctrl.SetupSignalHandler())
		Expect(err).ToNot(HaveOccurred(), "failed to run manager")
		gexec.KillAndWait(4 * time.Second)

		// Teardown the test environment once controller is fnished.
		// Otherwise from Kubernetes 1.21+, teardon timeouts waiting on
		// kube-apiserver to return
		err := testEnv.Stop()
		Expect(err).ToNot(HaveOccurred())
}()

and disabling the teardown in AfterSuite.

@tenstad
Copy link
Contributor

tenstad commented Oct 13, 2021

Is it a mistake that the ctx with cancel in suite_test.go#L65 is replaced by context.Background() in suite_test.go#L97?

@erikgb
Copy link
Contributor

erikgb commented Oct 13, 2021

Thx @tenstad! I think this issue can ble closed, as not a problem (here), ref. cybozu-go/coil#189.

I will prepare a PR to fix the skaffolding (and doc) in kubebuilder, ref. the workaround implemented in kubernetes-sigs/kubebuilder#2302. CC: @hickeyma

@tenstad
Copy link
Contributor

tenstad commented Oct 13, 2021

Wondering if canceling the manager context is a workaround for ps.Cmd.Process.Signal(syscall.SIGTERM) not beeing able to stop the kube-apiserver process 🤔

@tenstad
Copy link
Contributor

tenstad commented Oct 18, 2021

I think the problem source is a pending watch request, initiated by the controller after
the kube-apiserver has received SIGTERM. The request does not appear to have a timeout,
and is not answered by the apiserver. Graceful shutdown introduced in 1.21 appears to
wait for the pending request to terminate before shutting the apiserver completely.

To avoid the error, one must thus avoid sending watch requests after SIGTERMing the apiserver,
or avoid the watch request pending for more than StopTimeout (default 20s).
Each of the following methods avoid the "timeout waiting for process kube-apiserver to stop" error:

  1. cancel() the context used in k8sManager.Start(ctx) before stopping the test environment / apiserver
  2. Set Timeout in rest.Config
  3. Start the kube-apiserver with --request-timeout (probably due to it controlling ShutdownTimeout)
  4. Disable graceful shutdown (--feature-gates=GracefulNodeShutdown=false? - I have not yet been able to verify that this works)

1 will avoid sending requests to the terminating apiserver,
while 2 and 3 will ensure that the apiserver is terminated inside the allowed StopTimeout interval (still causing unnecessary delays).
4 will result in the same behavior as in 1.20, terminating the apiserver regardless of pending requests.

IMO, 1 is the cleanest solution. It is already implemented in kubernetes-sigs/kubebuilder#2379. A context that is canceled in envtest.Environment.Stop() could however be integrated into envtest, making it a bit easier for users.

FYI: I am not experienced with this codebase and the core of k8s, please do not take these as hard facts without verifying yourself.

jsanda added a commit to jsanda/k8ssandra-operator that referenced this issue Oct 18, 2021
burmanm pushed a commit to burmanm/k8ssandra-operator that referenced this issue Oct 18, 2021
jsanda added a commit to k8ssandra/k8ssandra-operator that referenced this issue Oct 18, 2021
* Add multigroup structure to apis

Move controllers to multigroup struct, refactor some testing

Fix create-clientconfig

Rename apis/core to apis/k8ssandra

* fix api server timeouts during shutdown

Need to cancel the context. It is discussed in
kubernetes-sigs/controller-runtime#1571 (comment).

Co-authored-by: John Sanda <john.sanda@gmail.com>
JohnStrunk added a commit to JohnStrunk/volsync that referenced this issue Nov 8, 2021
This cancels the context to cause the API server to properly shut down
at the conclusion of the tests.

Ref: kubernetes-sigs/controller-runtime#1571
Ref: kubernetes-sigs/kubebuilder#2379

Signed-off-by: John Strunk <jstrunk@redhat.com>
@Aisuko
Copy link

Aisuko commented Jan 5, 2022

Hi, guys. Hit the same issue here with 1.22.1-darwin-amd64

I have been try to add the cancel() into the AfterSuite(), but it looks like can not solve this issue.

In my situation it looks like the watch mechanism is pending by the server.

var _ = AfterSuite(func() {
	// https://github.com/kubernetes-sigs/controller-runtime/issues/1571
	cancel()
	By("tearing down the test environment,but I do nothing here.")
	err := testEnv.Stop()
	Expect(err).NotTo(HaveOccurred())
})
W0105 18:10:20.398854   63353 reflector.go:441] pkg/mod/k8s.io/client-go@v0.22.1/tools/cache/reflector.go:167: watch of *v1.ReservedIP ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
STEP: tearing down the test environment,but I do nothing here.

@Aisuko
Copy link

Aisuko commented Jan 5, 2022

I know this is not a solution, if anyone hit same issue like me. This may work for you, I try the solution above, migrate the testEnv.Stop from AfterSuite to BeforeSuite may cause the kube-apiserver process killed failed.

var _ = AfterSuite(func() {
	// https://github.com/kubernetes-sigs/controller-runtime/issues/1571
	cancel()
	By("tearing down the test environment,but I do nothing here.")
	err := testEnv.Stop()
        // Set 4 with random
	if err != nil {
		time.Sleep(4 * time.Second)
	}
	err = testEnv.Stop()
	Expect(err).NotTo(HaveOccurred())
})

SchSeba added a commit to SchSeba/sriov-network-operator-1 that referenced this issue Jul 28, 2022
issue: kubernetes-sigs/controller-runtime#1571
Signed-off-by: Sebastian Sch <sebassch@gmail.com>
SchSeba added a commit to SchSeba/sriov-network-operator-1 that referenced this issue Jul 28, 2022
issue: kubernetes-sigs/controller-runtime#1571
Signed-off-by: Sebastian Sch <sebassch@gmail.com>
SchSeba added a commit to SchSeba/sriov-network-operator-1 that referenced this issue Jul 28, 2022
issue: kubernetes-sigs/controller-runtime#1571
Signed-off-by: Sebastian Sch <sebassch@gmail.com>
SchSeba added a commit to SchSeba/sriov-network-operator that referenced this issue Aug 2, 2022
issue: kubernetes-sigs/controller-runtime#1571
Signed-off-by: Sebastian Sch <sebassch@gmail.com>
SchSeba added a commit to SchSeba/sriov-network-operator that referenced this issue Aug 24, 2022
issue: kubernetes-sigs/controller-runtime#1571
Signed-off-by: Sebastian Sch <sebassch@gmail.com>
stanistan pushed a commit to cashapp/cmmc that referenced this issue Jan 5, 2023
stanistan pushed a commit to cashapp/cmmc that referenced this issue Jan 5, 2023
- go@1.19
- hermit
- linter (and configuration so things pass)
- kubebuilder Makefile (so it's consistent with changes)
- fixed tests kubernetes-sigs/controller-runtime#1571
- controller-runtime@0.11.1
SchSeba added a commit to SchSeba/sriov-network-operator that referenced this issue Jan 16, 2023
issue: kubernetes-sigs/controller-runtime#1571
Signed-off-by: Sebastian Sch <sebassch@gmail.com>
vassilismourikis pushed a commit to web-servers/jws-operator that referenced this issue Jan 17, 2023
@psaia
Copy link

psaia commented Feb 20, 2023

I know this is not a solution, if anyone hit same issue like me. This may work for you, I try the solution above, migrate the testEnv.Stop from AfterSuite to BeforeSuite may cause the kube-apiserver process killed failed.

var _ = AfterSuite(func() {
	// https://github.com/kubernetes-sigs/controller-runtime/issues/1571
	cancel()
	By("tearing down the test environment,but I do nothing here.")
	err := testEnv.Stop()
        // Set 4 with random
	if err != nil {
		time.Sleep(4 * time.Second)
	}
	err = testEnv.Stop()
	Expect(err).NotTo(HaveOccurred())
})

For the impatient:

	err := (func() (err error) {
		// Need to sleep if the first stop fails due to a bug:
		// https://github.com/kubernetes-sigs/controller-runtime/issues/1571
		sleepTime := 1 * time.Millisecond
		for i := 0; i < 12; i++ { // Exponentially sleep up to ~4s
			if err = testEnv.Stop(); err == nil {
				return
			}
			sleepTime *= 2
			time.Sleep(sleepTime)
		}
		return
	})()
        Expect(err).NotTo(HaveOccurred())

RyanMillerC added a commit to RyanMillerC/cat-facts-operator that referenced this issue Feb 28, 2023
The test works fine but for some reason the AfterSuite tear down
function isn't working as expected. It times out after trying to shut
down the API server. The test comes back as passed but because the API
server won't shut down in time, `make test` fails.

kubernetes-sigs/controller-runtime#1571
justinsb added a commit to justinsb/k8s-config-connector that referenced this issue Jan 10, 2024
As reported in
kubernetes-sigs/controller-runtime#1571, not
doing this can cause timeout errors waiting for kube-apiserver to
stop.
@UgurTheG
Copy link

Maybe if someone still encounters this problem - setting ControlPlaneStopTimeout manually helped me:

	testEnv = &envtest.Environment{
		CRDDirectoryPaths: []string{
			filepath.Join("..", "..", "config", "crd", "bases"),
		},
		ErrorIfCRDPathMissing:   true,
		ControlPlaneStopTimeout: 1 * time.Minute, // Or any duration that suits your tests
	}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.