Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nginx-ingress pod crashloop if it can not redownload GeoIP2 database #8059

Closed
jsalatiel opened this issue Dec 20, 2021 · 24 comments
Closed

nginx-ingress pod crashloop if it can not redownload GeoIP2 database #8059

jsalatiel opened this issue Dec 20, 2021 · 24 comments
Labels
kind/support Categorizes issue or PR as a support question. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@jsalatiel
Copy link

NGINX Ingress controller version : 1.1.0

Kubernetes version : v1.21.7

Environment:

  • Cloud provider or hardware configuration: BareMetal
  • OS (e.g. from /etc/os-release): Ubuntu
  • Kernel (e.g. uname -a): 5.4.0-91-generic
  • How was the ingress-nginx-controller installed: helm ingress-nginx-4.0.13 1.1.0

What happened:
If for any reason during the pod start nginx-ingress-controller fails to download the geoip2 database from https://download.maxmind.com/app/geoip_download... the pod will crashloop even if there is a previous database already downloaded and present in /etc/nginx/geoip using a persistent volume.

What you expected to happen:
It should fail to start only if it can not download the geoip2 database AND it can not find a previous downloaded database on /etc/nginx/geoip. In the current way, If there is any problem with maxmind website, the ingresses will refuse to start.

How to reproduce it:

  1. Deploy nginx-ingress to use a persistent volume mounted at /etc/nginx/geoip/ and also set the maxmindLicenseKey.
  2. After the first successfully run, the geoip2 databases will be on /etc/nginx/geoip/
  3. Cut access to download.maxmind.com using a firewall to simulate a failure on their website
  4. Delete the nginx ingress pod to force it to be recreated. It will never start until it can connect to download.maxmind.com.

/kind bug

@jsalatiel jsalatiel added the kind/bug Categorizes issue or PR as related to a bug. label Dec 20, 2021
@k8s-ci-robot
Copy link
Contributor

@jsalatiel: This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Dec 20, 2021
@longwuyuan
Copy link
Contributor

longwuyuan commented Dec 22, 2021

/remove-kind bug
/kind support

Please add more information. It will help to elaborate why its a bug in the ingress-nginx controller, if download of the controller image or the download of a dependency like geoip db is broken due to firewall

@k8s-ci-robot k8s-ci-robot added kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels Dec 22, 2021
@jsalatiel
Copy link
Author

The container image is download and run, but in the logs it shows failing downloading the geoip db and thus the container will never be in ready state.

@longwuyuan
Copy link
Contributor

If your firewall is blocking the download, its obviously not a bug in the controller. Can you elaborate what you want the controller to do if your firewall is blocking the download of the geoip db.

@jsalatiel
Copy link
Author

Hi @longwuyuan , sorry but I probably not expressed myself the right way. I used the firewall example just to show how the bug could be replicated. I have no firewall blocking that.
The real problem is: In the last couple months, for 3 times the download.maxmind.com URL was down while I was doing some maintenance in my Cluster and due to that the nginx controller was not able to start EVEN though it had already downloaded the databases on previous run and they were saved on a persistent volume. If a previous database exists AND nginx can not update those database on container start, that should not be fatal IMHO.

@longwuyuan
Copy link
Contributor

longwuyuan commented Dec 24, 2021 via email

@longwuyuan
Copy link
Contributor

@theunrealgeek, have you worked with maxmind geoip db internals before

@theunrealgeek
Copy link
Contributor

have you worked with maxmind geoip db internals before
Not really with the internals of Maxmind db itself.

But I think the problem here is some place else, probably in the controller code before nginx starts up where the initial download is happening. I haven’t looked at that part of the code yet.

@longwuyuan
Copy link
Contributor

longwuyuan commented Dec 25, 2021 via email

@jsalatiel
Copy link
Author

I don't think running with a potentially outdated geoip is really a problem. If that geoip is there in the persistent volume it means that it is the one that was actively in use before the nginx pod gets restarted/killed. Also since there is a database there, any nginx rules relying on geoip will work anyways. And this is much better than being unable to recover your cluster after a node failure ( without disabling geoip download) if maxmind ( external site ) is offline. Of course this is also just my opinion.
I cant code to submit a PR.

@longwuyuan
Copy link
Contributor

longwuyuan commented Dec 25, 2021 via email

@NeckBeardPrince
Copy link

NeckBeardPrince commented Jan 25, 2022

Running into this same issue when running locally for dev after upgrading from 0.45.0 to 1.1.1.

This issue started in release 0.49.0 - https://github.com/kubernetes/ingress-nginx/releases/tag/controller-v0.49.0. Issue #7242 looks to maybe be the cause.

0.45.0 - Locally, we define a dummy maxmind license. You can see that though nginx did error, it didn't crash, and it continues to boot without problems.

flags.go:310] "downloading maxmind GeoIP2 databases"
flags.go:312] "unexpected error downloading GeoIP2 database" err="HTTP status 401 Unauthorized"
store.go:886] The GeoIP2 feature is enabled but the databases are missing. Disabling

1.1.1 - Same set up and configuration, just updated version. The pod crashes and keeps looping. If I remove the --maxmind-license-key it starts without issue.

I0125 20:17:01.611725       8 flags.go:344] "downloading maxmind GeoIP2 databases"
E0125 20:17:01.789163       8 flags.go:346] "unexpected error downloading GeoIP2 database" err="HTTP status 401 Unauthorized"
F0125 20:17:01.789425       8 main.go:67] HTTP status 401 Unauthorized
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0x1)
    k8s.io/klog/v2@v2.9.0/klog.go:1026 +0x8a
k8s.io/klog/v2.(*loggingT).output(0x28a9ac0, 0x3, {0x0, 0x0}, 0xc0004138f0, 0x1, {0x1f86a9b, 0xc000062400}, 0xc00001ba70, 0x0)
    k8s.io/klog/v2@v2.9.0/klog.go:975 +0x63d
k8s.io/klog/v2.(*loggingT).printDepth(0x1, 0x1, {0x0, 0x0}, {0x0, 0x0}, 0x445131, {0xc00001ba70, 0x1, 0x1})
    k8s.io/klog/v2@v2.9.0/klog.go:735 +0x1ba
k8s.io/klog/v2.(*loggingT).print(...)
    k8s.io/klog/v2@v2.9.0/klog.go:717
k8s.io/klog/v2.Fatal(...)
    k8s.io/klog/v2@v2.9.0/klog.go:1494
main.main()
    k8s.io/ingress-nginx/cmd/nginx/main.go:67 +0x1d3

goroutine 6 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x0)
    k8s.io/klog/v2@v2.9.0/klog.go:1169 +0x6a
created by k8s.io/klog/v2.init.0
    k8s.io/klog/v2@v2.9.0/klog.go:420 +0xfb

goroutine 15 [IO wait]:
internal/poll.runtime_pollWait(0x7f2f773eb530, 0x72)
    runtime/netpoll.go:234 +0x89
internal/poll.(*pollDesc).wait(0xc00015eb00, 0xc0000de480, 0x0)
    internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
    internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc00015eb00, {0xc0000de480, 0x1a52, 0x1a52})
    internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc00015eb00, {0xc0000de480, 0xc0000de57f, 0x1a})
    net/fd_posix.go:56 +0x29
net.(*conn).Read(0xc000418028, {0xc0000de480, 0x417186, 0xc0004cd7f8})
    net/net.go:183 +0x45
crypto/tls.(*atLeastReader).Read(0xc0002b20c0, {0xc0000de480, 0x0, 0x409dcd})
    crypto/tls/conn.go:777 +0x3d
bytes.(*Buffer).ReadFrom(0xc00003c5f8, {0x1b70980, 0xc0002b20c0})
    bytes/buffer.go:204 +0x98
crypto/tls.(*Conn).readFromUntil(0xc00003c380, {0x1b73000, 0xc000418028}, 0x1958)
    crypto/tls/conn.go:799 +0xe5
crypto/tls.(*Conn).readRecordOrCCS(0xc00003c380, 0x0)
    crypto/tls/conn.go:606 +0x112
crypto/tls.(*Conn).readRecord(...)
    crypto/tls/conn.go:574
crypto/tls.(*Conn).Read(0xc00003c380, {0xc0000fe000, 0x1000, 0xc000402b60})
    crypto/tls/conn.go:1277 +0x16f
bufio.(*Reader).Read(0xc000396e40, {0xc0004c88f8, 0x9, 0xc0003fede0})
    bufio/bufio.go:227 +0x1b4
io.ReadAtLeast({0x1b707e0, 0xc000396e40}, {0xc0004c88f8, 0x9, 0x9}, 0x9)
    io/io.go:328 +0x9a
io.ReadFull(...)
    io/io.go:347
net/http.http2readFrameHeader({0xc0004c88f8, 0x9, 0xc000094db0}, {0x1b707e0, 0xc000396e40})
    net/http/h2_bundle.go:1555 +0x6e
net/http.(*http2Framer).ReadFrame(0xc0004c88c0)
    net/http/h2_bundle.go:1813 +0x95
net/http.(*http2clientConnReadLoop).run(0xc0004cdf98)
    net/http/h2_bundle.go:8608 +0x130
net/http.(*http2ClientConn).readLoop(0xc000492480)
    net/http/h2_bundle.go:8531 +0x6f
created by net/http.(*http2Transport).newClientConn
    net/http/h2_bundle.go:7325 +0xb85

@kd7lxl
Copy link
Contributor

kd7lxl commented Jan 26, 2022

The current behavior is to fatal when Maxmind license is defined and the database cannot be downloaded. There are various reasons this is desirable, such as compliance requirements that certain countries be blocked, etc.

One way to work around a Maxmind outage is to run a Maxmind mirror or caching proxy for your organization. This is recommended as a best practice by Maxmind to minimize load on their infrastructure. It works well for us.

Another solution could be some way for you to declare that download failures are allowed in your systems, or allowed only when a file is already present. Maybe a new config option could be added to allow you to define this policy.

@NeckBeardPrince
Copy link

NeckBeardPrince commented Jan 26, 2022

The current behavior is to fatal when Maxmind license is defined and the database cannot be downloaded. There are various reasons this is desirable, such as compliance requirements that certain countries be blocked, etc.

I don't believe that, when the database was unable to be downloaded, nginx would disable Maxmind in 0.45.0, without fatal.

This The GeoIP2 feature is enabled but the databases are missing. Disabling message alone does not support your claim. As I pointed out above in 0.49.0 a new feature #7242 to retry Maxmind downloads was added. This feature would output a message saying that it was retrying the download maxmind.go:114] "download failed on attempt 1". In the error I posted above, that doesn't even show up.

My post #8059 (comment) does not show a controlled exit of the process, but a full crash and stacktrace. Why would a stacktrace be the desired behavior?

@kd7lxl
Copy link
Contributor

kd7lxl commented Jan 26, 2022

The current behavior is to fatal when Maxmind license is defined and the database cannot be downloaded. There are various reasons this is desirable, such as compliance requirements that certain countries be blocked, etc.

I don't believe that, when the database was unable to be downloaded, nginx would disable Maxmind in 0.45.0, without fatal.

This The GeoIP2 feature is enabled but the databases are missing. Disabling message alone does not support your claim. As I pointed out above in 0.49.0 a new feature #7242 to retry Maxmind downloads was added. This feature would output a message saying that it was retrying the download maxmind.go:114] "download failed on attempt 1". In the error I posted above, that doesn't even show up.

There was no retry attempt in your test case because 401 Unauthorized is an unrecoverable error. 4xx errors mean the client has done something wrong and needs to fix it. In this case, the license key is wrong. No amount of retrying will fix that, so it doesn't retry this case. This is best practice when writing API clients. The desired and implemented behavior for this case in ingress-nginx versions 1.x is to print an error and exit. Yes, this behavior changed as part of the v0.x to v1.x release.

My post #8059 (comment) does not show a controlled exit of the process, but a full crash and stacktrace. Why would a stacktrace be the desired behavior?

It does indeed make a controlled call to Fatal(). Your stacktrace indicates this happened on line 67 of main.go, which is easy to read:

if err != nil {
klog.Fatal(err)

Stacktraces are desirable so that you can trace the problem back in the code. Sometimes reading the relevant line of code can help you understand how to change your config to a valid one.

What did you think about my idea to allow configuring a Maxmind DB failure policy?

@jsalatiel
Copy link
Author

"Another solution could be some way for you to declare that download failures are allowed in your systems, or allowed only when a file is already present. Maybe a new config option could be added to allow you to define this policy."

For me the best solution would be this one.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 27, 2022
@NeckBeardPrince
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 27, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 26, 2022
@jsalatiel
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 27, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 25, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 24, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 24, 2022
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

7 participants