Skip to content
This repository has been archived by the owner on Apr 27, 2021. It is now read-only.

[WIP] EWMA based on utilization, not response time #162

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kirs
Copy link

@kirs kirs commented Nov 25, 2018

Right now the EWMA module in ingress-nginx operates with the assumption that an endpoint that served a request faster is less busy and less utilized. It calculates the EWMA score based on response_time and connect_time.

The hypothesis is that response time is not the only thing that we could use for EWMA. At the end, a request could take longer because of a slow downstream / SQL query while the app server CPU was not used at all.

If we tried CPU utilization for calculating EWMA score, we could potentially waste less CPU resources across the platform, by sending more traffic to under-utilized app servers.

Here's the distribution of utilization of Shopify's endpoints:

screen shot 2018-11-25 at 16 27 14

You can see that while most of them in the middle are "hot", there are always endpoints that are under-utilized (those in the bottom of the distribution).

There are also Google's SRE resources that argue that packing CPU usage is super important.

From https://landing.google.com/sre/book/chapters/handling-overload.html:

This example illustrates how poor in-datacenter load balancing practices artificially limit resource availability: you may be reserving 1,000 CPUs for your service in a given datacenter, but be unable to actually use more than, say, 700 CPUs.

image

My hypothesis is that the effect of this could be similar to one on the illustration above.

The idea is based on conversations with @hkdsun. All credits should go to Hormoz :)

Implementation

I suggest that we use the Server-Timing header and pass the util (== utilization) value:

Server-Timing: util;dur=0.0625

It will be up to application developers to calculate and expose the utilization header.

This header is already used at Shopify for other metrics, and IMO it's standard enough to be adopted by other users of ingress-nginx.

I implemented it in a way that the module uses the Server-Timing header if it's available, and falls back to use response time if it's not, but I'm open to other ideas.

For reviews

  • Is there a historical reason why we used response time and not CPU utilization when we implemented EWMA? @csfrancis
  • Is there a way to make it better configurable for other ingress-nginx users?
  • How do we test this change? I couldn't find any existing tests for updating EWMA score.

Copy link

@hkdsun hkdsun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting this, I have a feeling this indeed a useful optimization. Though, I'm not convinced that the fallback to response time results is something we can do here – at least not without some normalization. Could you explain correctness?

For testing, we should now be able to send a percentage of traffic to the web-perf worker cluster (proxied through the staging ingress)

end

return durationTable
end
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see duration_parsing.lua, there's a helper for this already

@ElvinEfendi
Copy link

ElvinEfendi commented Nov 26, 2018

@kirs can you post the DD link for that dashboard? Are you sure you are looking at only shopify core app servers per cluster? Here's what I'm seeing for a specific cluster over past 1 day:

screen shot 2018-11-26 at 3 15 08 pm

You can see the same pattern for the other clusters.

@csfrancis
Copy link

Is there a historical reason why we used response time and not CPU utilization when we implemented EWMA?

I think it's because it was the easiest thing to get that didn't require any upstream modifications, and it also demonstrated an improvement on the load balancing algorithm we were using at the time (I think it was weighted round robin).

If we tried CPU utilization for calculating EWMA score, we could potentially waste less CPU resources across the platform, by sending more traffic to under-utilized app servers.

Because of Unicorn's single threaded architecture, I don't think we should be using CPU utilization to calculate EWMA scores. It's easy for a Unicorn to be tied up even if CPU utilization is low (if it's talking to a slow downstream resource), which could incorrectly cause more load to be send to a node.

I think it would be better to use the same signal as the load shedder and look at utilization from the perspective of active workers (as a percentage). So if 8 out of 16 workers are available, we'd have a utilization of 50%. It might be possible to write this as a patch to Unicorn and/or other Ruby application servers so that we could use it everywhere.

@kirs
Copy link
Author

kirs commented Nov 27, 2018

I think it would be better to use the same signal as the load shedder and look at utilization from the perspective of active workers (as a percentage). So if 8 out of 16 workers are available, we'd have a utilization of 50%. It might be possible to write this as a patch to Unicorn and/or other Ruby application servers so that we could use it everywhere.

I see, so that EWMA would send a request to a Unicorn pod with available workers based on the metric? Not based on response time?

@csfrancis
Copy link

I see, so that EWMA would send a request to a Unicorn pod with available workers based on the metric?

Yeah that's what I was thinking. I think it may be better than response time - I believe the current implementation is trying to infer utilization from the response time (slower average response time indicates higher utilization), but I think we can get a better signal by just measuring the utilization directly through available workers.

I still think that EWMA would be better than round robin, given the high variance in response times of Shopify Core requests. I am curious if/how we may need to change the EWMA decay if we were to make this change. I think the optimal decay would have some relation to the maximum request time, but I'm not sure of the math to determine that.

Copy link

@pushrax pushrax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this will work well with the same metric as the load shedder (active+queued conns). This is the most direct proxy for concurrent capacity. It provides a very low-latency view of consumption, since it shares state across workers. Response time requires integrating across a longer window to filter noise.

+1 to "I'm not convinced that the fallback to response time results is something we can do here – at least not without some normalization." – and if we go with the load shedder metric, such a normalization doesn't exist.

EWMA is better than round robin when some servers may be slower than others due to request variance, hardware variance, pressure on the host, poor k8s scheduling etc. This is clearest when using response time as the feedback metric, but should still work when using concurrency. A slower server will use higher concurrency for the same throughput. Request variance is handled even better than with response time, due to the low latency of the metric.

One downside is that all upstreams must have the same theoretical capacity provisioned or this balancer could overload low-capacity upstreams sooner. Response time works around that. I don't see this as a blocker though, it's generally assumed in LB algorithms. This can also be worked around by treating active requests and queued requests differently, but we can probably avoid that complication altogether.

@hkdsun
Copy link

hkdsun commented Dec 14, 2018

We just had an interesting data point for this PR I think (correct me please if I'm making wrong assumptions)

We observed different worker utilizations in east-4 vs east-5 during a flash sale

image

Even though the throughput to clusters was exactly 50-50

image

You can see that there was a tighter packing in east-5 compared to east-4

image

Which is further supported by how many more overloaded rails workers there was in east-4

image

Datadog link

@csfrancis
Copy link

What are we doing with this PR? I don't think we can just ship it as-is without coming up with a testing and rollout strategy.

@kirs
Copy link
Author

kirs commented Jan 7, 2019

Agree with Scott. This would be great to investigate but I'm not sure if I'll have enough cycles any time soon 😬

@csfrancis
Copy link

It's all good @kirs, I just wanted to get an update 👍

@hkdsun
Copy link

hkdsun commented Sep 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants