Service latency degradation #1905

Lookyan · 2018-01-16T09:26:09Z

Is this a BUG REPORT or FEATURE REQUEST?: BUG REPORT

NGINX Ingress controller version: 0.9.0 (quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.9.0)

Kubernetes version:

Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.2", GitCommit:"bdaeafa71f6c7c04636251031f93464384d54963", GitTreeState:"clean", BuildDate:"2017-10-24T19:48:57Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.6", GitCommit:"6260bb08c46c31eea6cb538b34a9ceb3e406689c", GitTreeState:"clean", BuildDate:"2017-12-21T06:23:29Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

Environment:

Cloud provider or hardware configuration: bare-metal
OS (e.g. from /etc/os-release):

PRETTY_NAME="Debian GNU/Linux 9 (stretch)"
NAME="Debian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Kernel:

Linux srv0551 4.9.0-4-amd64 #1 SMP Debian 4.9.65-3+deb9u1 (2017-12-23) x86_64 GNU/Linux

Others: calico v2.6.3

What happened:
We use ingress controller for almost all our services. And one of them has big load of 30k rps. This traffic is handled by 3 ingress controllers (there are three separate physical servers for them). And we experience a problem with service deploy process. When one service which is using the same ingress controller updates its pods (endpoints) ingress reloads its configuration. And at the same time we get increasing latency for all network communications on each service which uses these ingress controllers (even latency from service to database and to localhost). What we see in our monitoring: increasing number of established connections and increased latency.

What you expected to happen:
Deploy process shouldn't affect other services.

How to reproduce it (as minimally and precisely as possible):
The easiest way to reproduce it is to deploy service, start any perf tool to increase load to this service and then start nginx -t on ingress. Because ingress tests configuration before reload it invokes it each time during deploy of each service. So you should run nginx -t hundreds times and you will see the degradation. These are our metrics:

Anything else we need to know:
This problem can be reproduced only at big load, starting from 5k rps to one ingress controller.
Also I saw this issue and we have this patch, but still have problems: kubernetes/kubernetes#48358

The text was updated successfully, but these errors were encountered:

Lookyan · 2018-01-16T11:23:54Z

Also, removing option reuseport from ingress template slightly improves situation. Latency increases to 15ms from 2-3ms, but we still observe degradation.

aledbf · 2018-01-17T17:25:37Z

Also, removing option reuseport from ingress template slightly improves situation

I will add an option to make this configurable with the Configmap

aledbf · 2018-01-17T17:30:18Z

This problem can be reproduced only at big load, starting from 5k rps to one ingress controller.
Also I saw this issue and we have this patch, but still have problems: kubernetes/kubernetes#48358

The only way we can fix this is to remove the need to reload nginx on endpoints changes. Right now the only way to do this is to use lua, more specifically balancer_by_lua_block
This is the reason why I added lua support back in the custom nginx image #1852
After the next release I will start a POC to test/prove if using lua we can solve once and for all this issue.

valeriano-manassero · 2018-03-14T07:20:58Z

Since PR #2174 will implement the "no reload" feature, I'm here asking on what will be next steps.
I mean, this issue is still present if you add virtualhosts or SSL certs while you have many virtualhosts running. In worst case scenario, not only latency increases but the result can be completely stuck nginx processes continuing to serve old configurations for so much time (the only way to fix is to kill ingress controller container).

We already discussed this issue while I was proposing to implement LUA on virtualhosts and ssl too.

@aledbf @ElvinEfendi were involved in discussion. What are next steps?

ElvinEfendi · 2018-03-15T20:22:37Z

@valeriano-manassero thanks for starting the discussion. IMHO the next immediate step after #2174 gets merged should be coming up with a way to test Lua code. One option here would be to use current e2e test framework.

After the test process is in place I suggest we focus on the remaining tasks in the following priority:

Matching all the LB functionalities Nginx offers
Add support for dynamic SSL(using certificate_by_lua)
Implementing dynamic virtual hosts changes.

aledbf · 2018-03-18T23:39:07Z

@Lookyan can you test quay.io/aledbf/nginx-ingress-controller:0.343 adding the flag
--enable-dynamic-configuration in the deployment?
This image contains the next release 0.12.0.

Lookyan · 2018-03-20T04:45:21Z

Yes, thank you, I'll test it.

valeriano-manassero · 2018-03-28T16:20:16Z

@ElvinEfendi @aledbf sorry, I was very busy in the last days. Regarding dynamic-reconfiguration I was in hurry to deploy something usable for my company so I was forced to create my fork:
https://github.com/NCCloud/fluid

There are, obviously huge differences from the point where I forked from here but I hope, in future, we can do it. Feel free to comment or give me a way to merge that functionalities here.

ElvinEfendi · 2018-03-28T16:37:07Z

@valeriano-manassero some good stuff in that fork! It would be great to get consistent hashing and dynamic certificate features ported to this repository from your fork.

This week I'm going to refactor balancer.lua a bit and document how to add a new load balancer algorithm.

aledbf · 2018-04-08T20:18:22Z

Closing. This is fixed with the new dynamic configuration feature. We will enable this feature by default in a couple releases.

ElvinEfendi · 2018-07-29T03:17:13Z

@Lookyan have you tried --enable-dynamic-configuration? It would be great to hear your experience with it.

Lookyan · 2018-07-29T06:29:17Z

Sorry, forgot to write here.
Yes, dynamic reconfiguration helps. We have problems only when new services deployed and there is still nginx conf reload.

valeriano-manassero · 2018-07-29T11:18:28Z

@Lookyan Did you try Fluid https://github.com/NCCloud/fluid fork? We got same issues and tried to avoid reload at all. The fork is not perfect but maybe it can help you.

aledbf added enhancement help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jan 17, 2018

valeriano-manassero mentioned this issue Feb 27, 2018

Dynamic reload first implementation #2152

Closed

aledbf closed this as completed Apr 8, 2018

aledbf mentioned this issue May 3, 2018

enable reuse port by default #2466

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service latency degradation #1905

Service latency degradation #1905

Lookyan commented Jan 16, 2018

Lookyan commented Jan 16, 2018

aledbf commented Jan 17, 2018

aledbf commented Jan 17, 2018

valeriano-manassero commented Mar 14, 2018 •

edited

Loading

ElvinEfendi commented Mar 15, 2018

aledbf commented Mar 18, 2018

Lookyan commented Mar 20, 2018

valeriano-manassero commented Mar 28, 2018

ElvinEfendi commented Mar 28, 2018

aledbf commented Apr 8, 2018

ElvinEfendi commented Jul 29, 2018

Lookyan commented Jul 29, 2018

valeriano-manassero commented Jul 29, 2018

Service latency degradation #1905

Service latency degradation #1905

Comments

Lookyan commented Jan 16, 2018

Lookyan commented Jan 16, 2018

aledbf commented Jan 17, 2018

aledbf commented Jan 17, 2018

valeriano-manassero commented Mar 14, 2018 • edited Loading

ElvinEfendi commented Mar 15, 2018

aledbf commented Mar 18, 2018

Lookyan commented Mar 20, 2018

valeriano-manassero commented Mar 28, 2018

ElvinEfendi commented Mar 28, 2018

aledbf commented Apr 8, 2018

ElvinEfendi commented Jul 29, 2018

Lookyan commented Jul 29, 2018

valeriano-manassero commented Jul 29, 2018

valeriano-manassero commented Mar 14, 2018 •

edited

Loading