Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIC removes all listeners when rejecting a new one on a reserved port #4775

Closed
2 tasks
brad0000 opened this issue Dec 10, 2023 · 5 comments · Fixed by #5205
Closed
2 tasks

NIC removes all listeners when rejecting a new one on a reserved port #4775

brad0000 opened this issue Dec 10, 2023 · 5 comments · Fixed by #5205
Assignees
Labels
backlog Pull requests/issues that are backlog items proposal An issue that proposes a feature request
Milestone

Comments

@brad0000
Copy link

brad0000 commented Dec 10, 2023

Describe the bug
When we add a new listener on a reserved port (e.g. 9113), NIC correctly rejects the listener but also tears down all existing listeners

To Reproduce
Steps to reproduce the behavior:

  1. Deploy a TCP service in AKS / NIC with a listener in GlobalConfiguration and a TransportServer
  2. Edit GlobalConfiguration and add a second listener with a reserved port (e.g. 9113)

KubeEvent log entry: "GlobalConfiguration XXXXX is invalid and was rejected: spec.listeners[110].port: Forbidden: port 9113 is forbidden"
KubeEvent log entry: "Listener XXXX doesn't exist" - repeated for all working listeners

Expected behavior

I expected in this case that NGINX would reject the bad config and revert to last-good config, and the docs suggest this is what should happen:
https://docs.nginx.com/nginx-ingress-controller/configuration/global-configuration/globalconfiguration-resource/#:~:text=the%20Ingress%20Controller%20will%20ignore%20the%20new%20version

Your environment

  • NIC 3.0.2
  • K8s 1.25
  • AKS
  • NGINX open source

Additional context

We're using NGINX Ingress Controller 3.0.2 (NGINX 1.23.3) in AKS on a couple AKSUbuntu-2204gen2containerd-202309.06.0 nodes. We do regular helm release installs of a single-tenanted TCP & HTTP service for our customers. We had a P1 issue when we added a listener for a new customer to GlobalConfiguration and set the port number to 9113. NGINX rejected the change because 9113 is reserved for prometheus - which is fair enough. But in response it immediately deleted all other existing listeners, which broke 100 TransportServers and blocked access to 100 customers. Surely this is not the intended behaviour.

Tasks

Copy link

Hi @brad0000 thanks for reporting!

Be sure to check out the docs and the Contributing Guidelines while you wait for a human to take a look at this 🙂

Cheers!

@danielnginx danielnginx assigned danielnginx and jjngx and unassigned danielnginx Dec 11, 2023
@vepatel vepatel assigned vepatel and unassigned jjngx Dec 11, 2023
@vepatel
Copy link
Contributor

vepatel commented Dec 11, 2023

Hi @brad0000 the behaviour you mentioned is coming from the fact that due to presence of a reserved port GlobalConfiguration resource failed structural validation which resulted in GC getting invalidated which is equivalent to a non-existing GlobalConfiguration as far as TransportServers are concerned.
The older config is restored only when the IC is running and structural validation is passed but an error occurs in further validation by IC. Hope this makes it more clear.

Now I do understand the issues this behaviour has caused and we will be making further improvements on this.

@brianehlert
Copy link
Collaborator

Thank you for bringing this to our attention @brad0000 and I am terribly sorry for how this was discovered.
No, this is not the experience we want customers to have.

And while the behavior is consistent with how the project processes changes to individual resources, the GlobalConfiguration resource is unique and special, and it should be treated as such.

We are definitely taking a serious look at how we validate and apply listeners and will be making some enhancements here.

@brianehlert brianehlert added the proposal An issue that proposes a feature request label Dec 12, 2023
@brad0000
Copy link
Author

OK thanks for the reply @vepatel, but that sounds even stranger to me. If the config fails structural validation then it's clearly and more obviously wrong, so surely in that case it should fall-back to last-known-good config.

If it's structurally valid but fails comprehensive validation then it's less obviously wrong and perhaps in that case I could understand the IC failing to fall-back to last-known-good config. Although even then it seems strange.

Surely no config changes should be made until all validation passes?

@brad0000
Copy link
Author

Thanks @brianehlert, good to hear.

@brianehlert brianehlert added the backlog Pull requests/issues that are backlog items label Jan 2, 2024
@danielnginx danielnginx added this to the v3.5.0 milestone Feb 7, 2024
@danielnginx danielnginx assigned haywoodsh and unassigned vepatel Feb 7, 2024
@danielnginx danielnginx modified the milestones: v3.5.0, v3.6.0 Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog Pull requests/issues that are backlog items proposal An issue that proposes a feature request
Projects
Archived in project
6 participants