Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement delayed termination to achieve zero downtime upgrades #1159

Merged
merged 4 commits into from
Oct 20, 2023

Conversation

pleshakov
Copy link
Contributor

@pleshakov pleshakov commented Oct 18, 2023

Proposed changes

Problem:
During an upgrade of NGF, external clients can experience downtime.

Solution:

  • Introduce configurable delayed termination.
    • Add sleep subcommand to gateway binary
    • Add lifecycle paramaters to helm to both nginx-gateway and nginx
      containers.
    • Add terminationGracePeriodSeconds parameter to helm.
    • Add affinity parameter to helm (primary needed for testing to
      prevent pods running on the same node).
  • Rerun zero downtime non-functional tests.

Testing:

  • Manual testing

Closes #1155

Checklist

Before creating a PR, run through this checklist and mark each as complete.

  • I have read the CONTRIBUTING doc
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked that all unit tests pass after adding my changes
  • I have updated necessary documentation
  • I have rebased my branch onto main
  • I will ensure my PR is targeting the main branch and pulling from my branch from my own fork

@github-actions github-actions bot added documentation Improvements or additions to documentation enhancement New feature or request labels Oct 18, 2023
cmd/gateway/commands.go Outdated Show resolved Hide resolved
cmd/gateway/commands.go Outdated Show resolved Hide resolved
deploy/helm-chart/values.yaml Show resolved Hide resolved
deploy/helm-chart/values.yaml Show resolved Hide resolved
cmd/gateway/commands_test.go Outdated Show resolved Hide resolved
Problem:
During an upgrade of NGF, external clients can experience downtime.

Solution:
- Introduce configurable delayed termination.
  - Add sleep subcommand to gateway binary
  - Add lifecycle paramaters to helm to both nginx-gateway and nginx
    containers.
  - Add terminationGracePeriodSeconds parameter to helm.
  - Add affinity parameter to helm (primary needed for testing to
    prevent pods running on the same node).
- Rerun zero downtime non-functional tests.

Testing:
- Manual testing

SOLVES nginxinc#1155
@pleshakov pleshakov marked this pull request as ready for review October 19, 2023 21:00
@pleshakov pleshakov requested a review from a team as a code owner October 19, 2023 21:00
@bjee19
Copy link
Contributor

bjee19 commented Oct 19, 2023

Do you think it would be a good idea to set a maximum sleep duration?

docs/cli-help.md Outdated Show resolved Hide resolved
Copy link
Contributor

@kate-osborn kate-osborn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:nice!

Copy link
Member

@ciarams87 ciarams87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@pleshakov
Copy link
Contributor Author

@bjee19

Do you think it would be a good idea to set a maximum sleep duration?

It is hard to come up with a reasonable maximum.
It cannot be more than configured grace termination period, but determining it dynamically is not simple, and not really worth it.
Also, the sleep command we use for nginx container doesn't seem to have maximum, so limiting one without another doesn't make sense.

@pleshakov pleshakov requested a review from bjee19 October 20, 2023 14:42
…al.md

Co-authored-by: Saylor Berman <s.berman@f5.com>
Copy link
Contributor

@bjee19 bjee19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@pleshakov pleshakov merged commit 29b45e3 into nginxinc:main Oct 20, 2023
23 checks passed
@pleshakov pleshakov deleted the enh/zero-downtime-upgrade branch October 20, 2023 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Implement delayed termination to achieve zero downtime upgrades
5 participants