[WIP] EWMA based on utilization, not response time #162

kirs · 2018-11-25T16:53:50Z

Right now the EWMA module in ingress-nginx operates with the assumption that an endpoint that served a request faster is less busy and less utilized. It calculates the EWMA score based on response_time and connect_time.

The hypothesis is that response time is not the only thing that we could use for EWMA. At the end, a request could take longer because of a slow downstream / SQL query while the app server CPU was not used at all.

If we tried CPU utilization for calculating EWMA score, we could potentially waste less CPU resources across the platform, by sending more traffic to under-utilized app servers.

Here's the distribution of utilization of Shopify's endpoints:

You can see that while most of them in the middle are "hot", there are always endpoints that are under-utilized (those in the bottom of the distribution).

There are also Google's SRE resources that argue that packing CPU usage is super important.

From https://landing.google.com/sre/book/chapters/handling-overload.html:

This example illustrates how poor in-datacenter load balancing practices artificially limit resource availability: you may be reserving 1,000 CPUs for your service in a given datacenter, but be unable to actually use more than, say, 700 CPUs.

My hypothesis is that the effect of this could be similar to one on the illustration above.

The idea is based on conversations with @hkdsun. All credits should go to Hormoz :)

Implementation

I suggest that we use the Server-Timing header and pass the util (== utilization) value:

Server-Timing: util;dur=0.0625

It will be up to application developers to calculate and expose the utilization header.

This header is already used at Shopify for other metrics, and IMO it's standard enough to be adopted by other users of ingress-nginx.

I implemented it in a way that the module uses the Server-Timing header if it's available, and falls back to use response time if it's not, but I'm open to other ideas.

For reviews

Is there a historical reason why we used response time and not CPU utilization when we implemented EWMA? @csfrancis
Is there a way to make it better configurable for other ingress-nginx users?
How do we test this change? I couldn't find any existing tests for updating EWMA score.

hkdsun

Thanks for starting this, I have a feeling this indeed a useful optimization. Though, I'm not convinced that the fallback to response time results is something we can do here – at least not without some normalization. Could you explain correctness?

For testing, we should now be able to send a percentage of traffic to the web-perf worker cluster (proxied through the staging ingress)

hkdsun · 2018-11-26T11:09:22Z

rootfs/etc/nginx/lua/balancer/ewma.lua

+  end
+
+  return durationTable
+end


see duration_parsing.lua, there's a helper for this already

ElvinEfendi · 2018-11-26T11:16:48Z

@kirs can you post the DD link for that dashboard? Are you sure you are looking at only shopify core app servers per cluster? Here's what I'm seeing for a specific cluster over past 1 day:

You can see the same pattern for the other clusters.

csfrancis · 2018-11-26T20:55:24Z

Is there a historical reason why we used response time and not CPU utilization when we implemented EWMA?

I think it's because it was the easiest thing to get that didn't require any upstream modifications, and it also demonstrated an improvement on the load balancing algorithm we were using at the time (I think it was weighted round robin).

If we tried CPU utilization for calculating EWMA score, we could potentially waste less CPU resources across the platform, by sending more traffic to under-utilized app servers.

Because of Unicorn's single threaded architecture, I don't think we should be using CPU utilization to calculate EWMA scores. It's easy for a Unicorn to be tied up even if CPU utilization is low (if it's talking to a slow downstream resource), which could incorrectly cause more load to be send to a node.

I think it would be better to use the same signal as the load shedder and look at utilization from the perspective of active workers (as a percentage). So if 8 out of 16 workers are available, we'd have a utilization of 50%. It might be possible to write this as a patch to Unicorn and/or other Ruby application servers so that we could use it everywhere.

kirs · 2018-11-27T11:16:09Z

I think it would be better to use the same signal as the load shedder and look at utilization from the perspective of active workers (as a percentage). So if 8 out of 16 workers are available, we'd have a utilization of 50%. It might be possible to write this as a patch to Unicorn and/or other Ruby application servers so that we could use it everywhere.

I see, so that EWMA would send a request to a Unicorn pod with available workers based on the metric? Not based on response time?

csfrancis · 2018-11-27T16:34:25Z

I see, so that EWMA would send a request to a Unicorn pod with available workers based on the metric?

Yeah that's what I was thinking. I think it may be better than response time - I believe the current implementation is trying to infer utilization from the response time (slower average response time indicates higher utilization), but I think we can get a better signal by just measuring the utilization directly through available workers.

I still think that EWMA would be better than round robin, given the high variance in response times of Shopify Core requests. I am curious if/how we may need to change the EWMA decay if we were to make this change. I think the optimal decay would have some relation to the maximum request time, but I'm not sure of the math to determine that.

pushrax

I agree this will work well with the same metric as the load shedder (active+queued conns). This is the most direct proxy for concurrent capacity. It provides a very low-latency view of consumption, since it shares state across workers. Response time requires integrating across a longer window to filter noise.

+1 to "I'm not convinced that the fallback to response time results is something we can do here – at least not without some normalization." – and if we go with the load shedder metric, such a normalization doesn't exist.

EWMA is better than round robin when some servers may be slower than others due to request variance, hardware variance, pressure on the host, poor k8s scheduling etc. This is clearest when using response time as the feedback metric, but should still work when using concurrency. A slower server will use higher concurrency for the same throughput. Request variance is handled even better than with response time, due to the low latency of the metric.

One downside is that all upstreams must have the same theoretical capacity provisioned or this balancer could overload low-capacity upstreams sooner. Response time works around that. I don't see this as a blocker though, it's generally assumed in LB algorithms. This can also be worked around by treating active requests and queued requests differently, but we can probably avoid that complication altogether.

hkdsun · 2018-12-14T16:48:19Z

We just had an interesting data point for this PR I think (correct me please if I'm making wrong assumptions)

We observed different worker utilizations in east-4 vs east-5 during a flash sale

Even though the throughput to clusters was exactly 50-50

You can see that there was a tighter packing in east-5 compared to east-4

Which is further supported by how many more overloaded rails workers there was in east-4

Datadog link

csfrancis · 2019-01-07T16:22:13Z

What are we doing with this PR? I don't think we can just ship it as-is without coming up with a testing and rollout strategy.

kirs · 2019-01-07T17:22:58Z

Agree with Scott. This would be great to investigate but I'm not sure if I'll have enough cycles any time soon 😬

csfrancis · 2019-01-07T17:24:22Z

It's all good @kirs, I just wanted to get an update 👍

hkdsun · 2019-09-23T12:04:34Z

Just wanted to drop a few resources related to this that I've seen recently. Could say this is almost becoming a standard in the industry

[WIP] EWMA based on utilization, not response time

1b0d1f2

kirs requested review from ElvinEfendi, csfrancis, hkdsun and pushrax November 26, 2018 10:24

hkdsun reviewed Nov 26, 2018

View reviewed changes

rootfs/etc/nginx/lua/balancer/ewma.lua

end

return durationTable

end

Copy link

hkdsun Nov 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see duration_parsing.lua, there's a helper for this already

pushrax reviewed Nov 27, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] EWMA based on utilization, not response time #162

[WIP] EWMA based on utilization, not response time #162

kirs commented Nov 25, 2018

hkdsun left a comment •

edited

Loading

hkdsun Nov 26, 2018

ElvinEfendi commented Nov 26, 2018 •

edited

Loading

csfrancis commented Nov 26, 2018

kirs commented Nov 27, 2018

csfrancis commented Nov 27, 2018

pushrax left a comment •

edited

Loading

hkdsun commented Dec 14, 2018 •

edited

Loading

csfrancis commented Jan 7, 2019

kirs commented Jan 7, 2019

csfrancis commented Jan 7, 2019

hkdsun commented Sep 23, 2019

[WIP] EWMA based on utilization, not response time #162

Are you sure you want to change the base?

[WIP] EWMA based on utilization, not response time #162

Conversation

kirs commented Nov 25, 2018

Implementation

For reviews

hkdsun left a comment • edited Loading

Choose a reason for hiding this comment

hkdsun Nov 26, 2018

Choose a reason for hiding this comment

ElvinEfendi commented Nov 26, 2018 • edited Loading

csfrancis commented Nov 26, 2018

kirs commented Nov 27, 2018

csfrancis commented Nov 27, 2018

pushrax left a comment • edited Loading

Choose a reason for hiding this comment

hkdsun commented Dec 14, 2018 • edited Loading

We observed different worker utilizations in east-4 vs east-5 during a flash sale

Even though the throughput to clusters was exactly 50-50

You can see that there was a tighter packing in east-5 compared to east-4

Which is further supported by how many more overloaded rails workers there was in east-4

Datadog link

csfrancis commented Jan 7, 2019

kirs commented Jan 7, 2019

csfrancis commented Jan 7, 2019

hkdsun commented Sep 23, 2019

hkdsun left a comment •

edited

Loading

ElvinEfendi commented Nov 26, 2018 •

edited

Loading

pushrax left a comment •

edited

Loading

hkdsun commented Dec 14, 2018 •

edited

Loading