Zero downtime upgrade #6928

kamilzzz · 2021-03-04T10:16:43Z

Docs (https://github.com/kubernetes/ingress-nginx/tree/master/charts/ingress-nginx#upgrading-with-zero-downtime-in-production) are mentioning that by default during an upgrade there is a service interruption caused by pods being restarted, and another article is linked with explanation how one can fix that.

But when looking at values.yml I can see both terminationGracePerios and preStop hook is set correctly in order to drain connections.

ingress-nginx/charts/ingress-nginx/values.yaml

Line 237 in a7fb791

terminationGracePeriodSeconds: 300

ingress-nginx/charts/ingress-nginx/values.yaml

Lines 583 to 587 in a7fb791

    
           lifecycle: 
        
             preStop: 
        
               exec: 
        
                 command: 
        
                   - /wait-shutdown

Is the documentation a little bit outdated or there is something else that I should set to get near zero downtime upgrades?

The text was updated successfully, but these errors were encountered:

thirdeyenick · 2021-05-20T10:57:49Z

I was actually wondering about the same thing, but when looking at the command which gets executed ("wait-shutdown") and looking at the corresponding code, I noticed that still a SIGTERM signal gets sent. The linked blog post says that a SIQQUIT signal needs to be used in order for a graceful shutdown. I wonder if changing the signal sent in the wait-shutdown command would already lead to zero downtime deployments?

gamunu · 2021-05-23T07:34:58Z

@thirdeyenick the SIGTERM is for the nginx controller process not nginx itself. Nginx controller issues SIGQUIT to the nginx process.

ingress-nginx/internal/ingress/controller/nginx.go

Line 397 in 59a7f51

cmd := n.command.ExecCommand("-s", "quit")

thirdeyenick · 2021-05-25T07:21:53Z

Thank you very much for the information. This means that zero downtime deployments are already implemented, right?

Baklap4 · 2021-06-11T10:35:31Z

This means that zero downtime deployments are already implemented, right?

I'd like to see a confirmation from the team, but according to the linked sources it should already be implemented i think...

Maybe a pull request to remove it from the readme doc if that's the case

k8s-triage-robot · 2021-09-09T11:27:19Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jontro · 2021-09-21T22:31:16Z

/remove-lifecycle stale

svonliebenstein · 2021-10-11T09:52:56Z

Does anyone know the answers to @Baklap4's questions, would be great to update the readme if this is the case. Especially useful for everyone deciding on running this controller in production environments.

longwuyuan · 2021-10-11T10:50:40Z

https://kubernetes.github.io/ingress-nginx/how-it-works/#when-a-reload-is-required

jontro · 2021-10-11T14:11:28Z

@longwuyuan I'm not sure this answers the question. According to the readme there is a possibility for disruption when doing an upgrade. However this might not be the case and the doc just needs to be updated.

kfox1111 · 2021-10-11T16:25:35Z

There may be other factors at play. Say for example, you are exposing ingress-nginx out of the cluster via svc.type=loadbalancer. If also using metallb, the migration of the vip can cause connection issues.

longwuyuan · 2021-10-12T07:50:16Z

There are fewer eyes here than on slack. So I suggest close this issue and come discuss this at kubernetes.slack.com in the ingress-nginx-users channel. There are seasoned developers and users there. This is a vast complicated topic that can not be easily triaged on github.

rouke-broersma · 2021-10-12T10:01:13Z

Why would this be closed? The question is if the current documentation is up to date or not, because it looks like the documentation on zero downtime upgrades is out of date and that the required changes have since been incorporated into the chart.

longwuyuan · 2021-10-12T10:45:16Z

I think I listed all reasons that I could think of in the earlier comment. But apologies if that does not seem like a great idea. Just that tracking in one place would help. This is not a trivial matter and if one can fathom the intricacies, then a educated guess is also not out of reach. To begin with, a change in ingress object configuration can result in a need to reload the nginx.conf of the controller pod. Some other more elaborate text is at the link I posted earlier.

Keep this issue open if that seems to work well. Just note that there are many more developers and engineers on slack, so higher likelihood of getting comments on this, is on slack.

rouke-broersma · 2021-10-12T12:10:27Z

If the current docs are not sufficient then perhaps they should be amended with the suggestion to carefully consider other facets of upgrades and to consult with experienced engineers on slack.

jontro · 2021-10-12T13:12:26Z

@longwuyuan I have posted on slack, and happy to discuss there. However the readme refers to https://medium.com/codecademy-engineering/kubernetes-nginx-and-zero-downtime-in-production-2c910c6a5ed8 . All the steps already seems incorporated into the main chart already. There might be other factors in play but at least that problem is solved.

k8s-triage-robot · 2022-01-10T13:39:14Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jontro · 2022-01-10T14:44:02Z

/remove-lifecycle stale

strongjz · 2022-03-29T16:17:46Z

/assign @theunrealgeek

k8s-ci-robot · 2022-03-29T16:17:47Z

@strongjz: GitHub didn't allow me to assign the following users: theunrealgeek.

Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @theunrealgeek

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

iamNoah1 · 2022-04-12T08:44:20Z

/triage accepted
/priority important-soon

heylongdacoder · 2022-05-30T06:32:40Z

Some information sharing. By default, when controller received SIGTERM, it will directly issue a SIGQUIT to nginx process without waiting(https://github.com/kubernetes/ingress-nginx/blob/main/internal/ingress/controller/nginx.go#L373). I set --shutdown-grace-period to 15 and the downtime seems to be reduced.

mtparet · 2022-05-30T16:15:58Z

Yes @heylongdacoder and I think this value should be at least by default at 5s to mitigate common cases. The documentation should indicate that if there are any external load balancer directly connected to the pod, it should be increased to allow external draining before terminating nginx.

mtparet · 2022-06-01T10:14:48Z

On AWS load balancer, this must be 120s accordly to the support. We do not have issues since we set this value.
kubernetes-sigs/aws-load-balancer-controller#2366 (comment)

iamNoah1 · 2022-06-14T16:05:11Z

/help

k8s-ci-robot · 2022-06-14T16:05:13Z

@iamNoah1:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2022-09-12T17:06:13Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jontro · 2022-09-13T08:53:39Z

/remove-lifecycle stale

k8s-triage-robot · 2023-02-08T00:19:55Z

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-ci-robot · 2023-02-08T00:19:59Z

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Ghilteras · 2023-07-13T17:58:23Z

@thirdeyenick the SIGTERM is for the nginx controller process not nginx itself. Nginx controller issues SIGQUIT to the nginx process.
ingress-nginx/internal/ingress/controller/nginx.go
Line 397 in 59a7f51
cmd := n.command.ExecCommand("-s", "quit")

Doesn't look like this happens though, with this wait-shutdown the container immediately dies while still getting traffic which is not graceful at all, we had to change the preStop to sleep 30s to make all the upstream errors go away

mosoka · 2024-02-13T09:04:03Z

If someone is still looking for a simple solution to achieve upgrade without any downtime these parameters are working perfectly for us:

controller:
  minReadySeconds: 10
  extraArgs:
    shutdown-grace-period: 30

also if you're using injected linkerd proxy you'll also need:

controller:
  podAnnotations:
    config.alpha.linkerd.io/proxy-wait-before-exit-seconds: 30

Without these set there was always some minor downtime during upgrade.

dmotte · 2024-08-27T16:33:28Z

@mosoka it worked!!! I spent hours trying to find a solution to this annoying downtime issue, and yours is the only thing that worked for me. Thank you so much, you made my day 😄

kamilzzz added the kind/support Categorizes issue or PR as a support question. label Mar 4, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 9, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 21, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 10, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 10, 2022

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Apr 12, 2022

mtparet mentioned this issue May 30, 2022

Replace SIGTERM by SIGQUIT in waitshutdown to enable connection draining #8650

Closed

9 tasks

mtparet mentioned this issue Jun 1, 2022

pod termination might cause dropped connections kubernetes-sigs/aws-load-balancer-controller#2366

Open

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jun 14, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 12, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 13, 2022

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Feb 8, 2023

nvtkaszpir mentioned this issue Jul 13, 2023

Distinguish wait-shutdown command from standard k8s SIGTERM #6287

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero downtime upgrade #6928

Zero downtime upgrade #6928

kamilzzz commented Mar 4, 2021

thirdeyenick commented May 20, 2021

gamunu commented May 23, 2021

thirdeyenick commented May 25, 2021

Baklap4 commented Jun 11, 2021

k8s-triage-robot commented Sep 9, 2021

jontro commented Sep 21, 2021

svonliebenstein commented Oct 11, 2021

longwuyuan commented Oct 11, 2021

jontro commented Oct 11, 2021

kfox1111 commented Oct 11, 2021

longwuyuan commented Oct 12, 2021

rouke-broersma commented Oct 12, 2021

longwuyuan commented Oct 12, 2021

rouke-broersma commented Oct 12, 2021

jontro commented Oct 12, 2021

k8s-triage-robot commented Jan 10, 2022

jontro commented Jan 10, 2022

strongjz commented Mar 29, 2022

k8s-ci-robot commented Mar 29, 2022

iamNoah1 commented Apr 12, 2022

heylongdacoder commented May 30, 2022 •

edited

Loading

mtparet commented May 30, 2022

mtparet commented Jun 1, 2022

iamNoah1 commented Jun 14, 2022

k8s-ci-robot commented Jun 14, 2022

k8s-triage-robot commented Sep 12, 2022

jontro commented Sep 13, 2022

k8s-triage-robot commented Feb 8, 2023

k8s-ci-robot commented Feb 8, 2023

Ghilteras commented Jul 13, 2023 •

edited

Loading

mosoka commented Feb 13, 2024

dmotte commented Aug 27, 2024

Zero downtime upgrade #6928

Zero downtime upgrade #6928

Comments

kamilzzz commented Mar 4, 2021

thirdeyenick commented May 20, 2021

gamunu commented May 23, 2021

thirdeyenick commented May 25, 2021

Baklap4 commented Jun 11, 2021

k8s-triage-robot commented Sep 9, 2021

jontro commented Sep 21, 2021

svonliebenstein commented Oct 11, 2021

longwuyuan commented Oct 11, 2021

jontro commented Oct 11, 2021

kfox1111 commented Oct 11, 2021

longwuyuan commented Oct 12, 2021

rouke-broersma commented Oct 12, 2021

longwuyuan commented Oct 12, 2021

rouke-broersma commented Oct 12, 2021

jontro commented Oct 12, 2021

k8s-triage-robot commented Jan 10, 2022

jontro commented Jan 10, 2022

strongjz commented Mar 29, 2022

k8s-ci-robot commented Mar 29, 2022

iamNoah1 commented Apr 12, 2022

heylongdacoder commented May 30, 2022 • edited Loading

mtparet commented May 30, 2022

mtparet commented Jun 1, 2022

iamNoah1 commented Jun 14, 2022

k8s-ci-robot commented Jun 14, 2022

Guidelines

k8s-triage-robot commented Sep 12, 2022

jontro commented Sep 13, 2022

k8s-triage-robot commented Feb 8, 2023

k8s-ci-robot commented Feb 8, 2023

Ghilteras commented Jul 13, 2023 • edited Loading

mosoka commented Feb 13, 2024

dmotte commented Aug 27, 2024

heylongdacoder commented May 30, 2022 •

edited

Loading

Ghilteras commented Jul 13, 2023 •

edited

Loading