Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful shutdown for Nginx #1257

Merged
merged 1 commit into from
Aug 29, 2017
Merged

Conversation

maxlaverse
Copy link
Contributor

@maxlaverse maxlaverse commented Aug 28, 2017

Issue

Nginx is not being gracefully shut down when a Nginx Ingress Controller container is stopped, leading to broken connections. It looks like the SIGTERM sent to the controller to stop it is simply forwarded to the Nginx server.

The problem is that SIGTERM means "Exit as quick as you can" for Nginx (Nginx doc. For a graceful shutdown we should rather send a SIGQUIT.

A solution to this issue is to prevent the Nginx Ingress Controller to forward all the signals it gets to Nginx, allowing us to control what's sent to the Nginx process. When shutting down, the controller would then send a SIGQUIT to Nginx. Nginx itself will wait 10secondes (worker_shutdown_timeout) for the worker to shutdown.

Testing

You need a backend exposed on the Ingress that responds slowly. I used one with a 10 secondes response time. Trigger an Ingress deployment and call this slow URL. If your slow URL get an answer before the Nginx is tear down, re-execute it.

We expected this request to be answered to. You should see in the Ingress logs that it's not immediately exiting but waiting for the request to finish.
Additionally, you can strace the Nginx master process to make sure SIGQUIT is not being sent instead of SIGTERM.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 28, 2017
@k8s-reviewable
Copy link

This change is Reviewable

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 28, 2017
@aledbf
Copy link
Member

aledbf commented Aug 28, 2017

@maxlaverse I am not sure we should add this to the interface. We cannot assume the implementations are running a process.
This is the point where we handle the SIGTERM
https://github.com/kubernetes/ingress/blob/master/controllers/nginx/pkg/cmd/controller/main.go#L34

// TODO: If we keep the modified Controller interface
// should this be moved as a generic timeout control for the backend to stop
// on time ?
timer := time.NewTimer(60 * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we cannot rely using a timer because if the user sets terminationGracePeriodSeconds: 5 this will not work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. The value should then be configurable.

While running more tests this morning I realized that Nginx has a timeout already kicking in for that scenario:
worker_shutdown_timeout]

This might not be needed at all

@aledbf
Copy link
Member

aledbf commented Aug 28, 2017

@maxlaverse also please check https://github.com/kubernetes/ingress/blob/master/controllers/nginx/pkg/cmd/controller/nginx.go#L179
We already have some code to start a new master process if it dies

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 43.479% when pulling 79a3195 on maxlaverse:graceful_shutdown into a8bfdc2 on kubernetes:master.

@aledbf
Copy link
Member

aledbf commented Aug 28, 2017

@maxlaverse please test quay.io/aledbf/nginx-ingress-controller:0.198
This image contains #1258

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://github.com/kubernetes/kubernetes/wiki/CLA-FAQ to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 29, 2017
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 29, 2017
@maxlaverse maxlaverse force-pushed the graceful_shutdown branch 2 times, most recently from 3d0dbcf to f8c4d63 Compare August 29, 2017 11:44
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Aug 29, 2017
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 43.734% when pulling f8c4d6373506effa1f242e18cfef627725092861 on maxlaverse:graceful_shutdown into 7844415 on kubernetes:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 43.734% when pulling f8c4d6373506effa1f242e18cfef627725092861 on maxlaverse:graceful_shutdown into 7844415 on kubernetes:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 43.709% when pulling f8c4d6373506effa1f242e18cfef627725092861 on maxlaverse:graceful_shutdown into 7844415 on kubernetes:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 43.734% when pulling f8c4d6373506effa1f242e18cfef627725092861 on maxlaverse:graceful_shutdown into 7844415 on kubernetes:master.

@maxlaverse
Copy link
Contributor Author

maxlaverse commented Aug 29, 2017

@aledbf I change the description of this PR to explain how I test it. I pushed some changes to main.go I was thinking of after I had read your first comment.

I had a look at your pull-request and made some comments. It's not working because of the way we detect that Nginx has stopped after we gracefully ask it to. Checking if it's answering is probably not an option as it's not accepting incoming connections.

I'm not sure what's the best way to disable the automatic restart process and get notified about Nginx having being stopped. Except if we find another way to detect that Nginx is properly shut down ?

@aledbf
Copy link
Member

aledbf commented Aug 29, 2017

Except if we find another way to detect that Nginx is properly shut down ?

We can use https://github.com/mitchellh/go-ps to check if there's a nginx process running.

@aledbf
Copy link
Member

aledbf commented Aug 29, 2017

Merging. I will fix the dependencies issue in another PR

@aledbf
Copy link
Member

aledbf commented Aug 29, 2017

@maxlaverse thanks!

@aledbf aledbf merged commit d4e86fe into kubernetes:master Aug 29, 2017
@maxlaverse maxlaverse deleted the graceful_shutdown branch August 29, 2017 19:29
@aledbf aledbf mentioned this pull request Aug 29, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants