Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service latency degradation #1905

Closed
Lookyan opened this issue Jan 16, 2018 · 13 comments
Closed

Service latency degradation #1905

Lookyan opened this issue Jan 16, 2018 · 13 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@Lookyan
Copy link

Lookyan commented Jan 16, 2018

Is this a BUG REPORT or FEATURE REQUEST?: BUG REPORT

NGINX Ingress controller version: 0.9.0 (quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.9.0)

Kubernetes version:

Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.2", GitCommit:"bdaeafa71f6c7c04636251031f93464384d54963", GitTreeState:"clean", BuildDate:"2017-10-24T19:48:57Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.6", GitCommit:"6260bb08c46c31eea6cb538b34a9ceb3e406689c", GitTreeState:"clean", BuildDate:"2017-12-21T06:23:29Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration: bare-metal
  • OS (e.g. from /etc/os-release):
PRETTY_NAME="Debian GNU/Linux 9 (stretch)"
NAME="Debian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
  • Kernel:
Linux srv0551 4.9.0-4-amd64 #1 SMP Debian 4.9.65-3+deb9u1 (2017-12-23) x86_64 GNU/Linux
  • Others: calico v2.6.3

What happened:
We use ingress controller for almost all our services. And one of them has big load of 30k rps. This traffic is handled by 3 ingress controllers (there are three separate physical servers for them). And we experience a problem with service deploy process. When one service which is using the same ingress controller updates its pods (endpoints) ingress reloads its configuration. And at the same time we get increasing latency for all network communications on each service which uses these ingress controllers (even latency from service to database and to localhost). What we see in our monitoring: increasing number of established connections and increased latency.

What you expected to happen:
Deploy process shouldn't affect other services.

How to reproduce it (as minimally and precisely as possible):
The easiest way to reproduce it is to deploy service, start any perf tool to increase load to this service and then start nginx -t on ingress. Because ingress tests configuration before reload it invokes it each time during deploy of each service. So you should run nginx -t hundreds times and you will see the degradation. These are our metrics:
2018-01-16 12 17 00
2018-01-16 12 17 36

Anything else we need to know:
This problem can be reproduced only at big load, starting from 5k rps to one ingress controller.
Also I saw this issue and we have this patch, but still have problems: kubernetes/kubernetes#48358

@Lookyan
Copy link
Author

Lookyan commented Jan 16, 2018

Also, removing option reuseport from ingress template slightly improves situation. Latency increases to 15ms from 2-3ms, but we still observe degradation.

@aledbf
Copy link
Member

aledbf commented Jan 17, 2018

Also, removing option reuseport from ingress template slightly improves situation

I will add an option to make this configurable with the Configmap

@aledbf
Copy link
Member

aledbf commented Jan 17, 2018

This problem can be reproduced only at big load, starting from 5k rps to one ingress controller.
Also I saw this issue and we have this patch, but still have problems: kubernetes/kubernetes#48358

The only way we can fix this is to remove the need to reload nginx on endpoints changes. Right now the only way to do this is to use lua, more specifically balancer_by_lua_block
This is the reason why I added lua support back in the custom nginx image #1852
After the next release I will start a POC to test/prove if using lua we can solve once and for all this issue.

@aledbf aledbf added enhancement help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jan 17, 2018
@valeriano-manassero
Copy link

valeriano-manassero commented Mar 14, 2018

Since PR #2174 will implement the "no reload" feature, I'm here asking on what will be next steps.
I mean, this issue is still present if you add virtualhosts or SSL certs while you have many virtualhosts running. In worst case scenario, not only latency increases but the result can be completely stuck nginx processes continuing to serve old configurations for so much time (the only way to fix is to kill ingress controller container).

We already discussed this issue while I was proposing to implement LUA on virtualhosts and ssl too.

@aledbf @ElvinEfendi were involved in discussion. What are next steps?

@ElvinEfendi
Copy link
Member

@valeriano-manassero thanks for starting the discussion. IMHO the next immediate step after #2174 gets merged should be coming up with a way to test Lua code. One option here would be to use current e2e test framework.

After the test process is in place I suggest we focus on the remaining tasks in the following priority:

  1. Matching all the LB functionalities Nginx offers
  2. Add support for dynamic SSL(using certificate_by_lua)
  3. Implementing dynamic virtual hosts changes.

@aledbf
Copy link
Member

aledbf commented Mar 18, 2018

@Lookyan can you test quay.io/aledbf/nginx-ingress-controller:0.343 adding the flag
--enable-dynamic-configuration in the deployment?
This image contains the next release 0.12.0.

@Lookyan
Copy link
Author

Lookyan commented Mar 20, 2018

Yes, thank you, I'll test it.

@valeriano-manassero
Copy link

@ElvinEfendi @aledbf sorry, I was very busy in the last days. Regarding dynamic-reconfiguration I was in hurry to deploy something usable for my company so I was forced to create my fork:
https://github.com/NCCloud/fluid

There are, obviously huge differences from the point where I forked from here but I hope, in future, we can do it. Feel free to comment or give me a way to merge that functionalities here.

@ElvinEfendi
Copy link
Member

@valeriano-manassero some good stuff in that fork! It would be great to get consistent hashing and dynamic certificate features ported to this repository from your fork.

This week I'm going to refactor balancer.lua a bit and document how to add a new load balancer algorithm.

@aledbf
Copy link
Member

aledbf commented Apr 8, 2018

Closing. This is fixed with the new dynamic configuration feature. We will enable this feature by default in a couple releases.

@ElvinEfendi
Copy link
Member

@Lookyan have you tried --enable-dynamic-configuration? It would be great to hear your experience with it.

@Lookyan
Copy link
Author

Lookyan commented Jul 29, 2018

Sorry, forgot to write here.
Yes, dynamic reconfiguration helps. We have problems only when new services deployed and there is still nginx conf reload.

@valeriano-manassero
Copy link

@Lookyan Did you try Fluid https://github.com/NCCloud/fluid fork? We got same issues and tried to avoid reload at all. The fork is not perfect but maybe it can help you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

4 participants