-
Notifications
You must be signed in to change notification settings - Fork 25
[WIP] EWMA based on utilization, not response time #162
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for starting this, I have a feeling this indeed a useful optimization. Though, I'm not convinced that the fallback to response time results is something we can do here – at least not without some normalization. Could you explain correctness?
For testing, we should now be able to send a percentage of traffic to the web-perf
worker cluster (proxied through the staging ingress)
end | ||
|
||
return durationTable | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see duration_parsing.lua
, there's a helper for this already
@kirs can you post the DD link for that dashboard? Are you sure you are looking at only shopify core app servers per cluster? Here's what I'm seeing for a specific cluster over past 1 day: You can see the same pattern for the other clusters. |
I think it's because it was the easiest thing to get that didn't require any upstream modifications, and it also demonstrated an improvement on the load balancing algorithm we were using at the time (I think it was weighted round robin).
Because of Unicorn's single threaded architecture, I don't think we should be using CPU utilization to calculate EWMA scores. It's easy for a Unicorn to be tied up even if CPU utilization is low (if it's talking to a slow downstream resource), which could incorrectly cause more load to be send to a node. I think it would be better to use the same signal as the load shedder and look at utilization from the perspective of active workers (as a percentage). So if 8 out of 16 workers are available, we'd have a utilization of 50%. It might be possible to write this as a patch to Unicorn and/or other Ruby application servers so that we could use it everywhere. |
I see, so that EWMA would send a request to a Unicorn pod with available workers based on the metric? Not based on response time? |
Yeah that's what I was thinking. I think it may be better than response time - I believe the current implementation is trying to infer utilization from the response time (slower average response time indicates higher utilization), but I think we can get a better signal by just measuring the utilization directly through available workers. I still think that EWMA would be better than round robin, given the high variance in response times of Shopify Core requests. I am curious if/how we may need to change the EWMA decay if we were to make this change. I think the optimal decay would have some relation to the maximum request time, but I'm not sure of the math to determine that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this will work well with the same metric as the load shedder (active+queued conns). This is the most direct proxy for concurrent capacity. It provides a very low-latency view of consumption, since it shares state across workers. Response time requires integrating across a longer window to filter noise.
+1 to "I'm not convinced that the fallback to response time results is something we can do here – at least not without some normalization." – and if we go with the load shedder metric, such a normalization doesn't exist.
EWMA is better than round robin when some servers may be slower than others due to request variance, hardware variance, pressure on the host, poor k8s scheduling etc. This is clearest when using response time as the feedback metric, but should still work when using concurrency. A slower server will use higher concurrency for the same throughput. Request variance is handled even better than with response time, due to the low latency of the metric.
One downside is that all upstreams must have the same theoretical capacity provisioned or this balancer could overload low-capacity upstreams sooner. Response time works around that. I don't see this as a blocker though, it's generally assumed in LB algorithms. This can also be worked around by treating active requests and queued requests differently, but we can probably avoid that complication altogether.
We just had an interesting data point for this PR I think (correct me please if I'm making wrong assumptions) We observed different worker utilizations in east-4 vs east-5 during a flash saleEven though the throughput to clusters was exactly 50-50You can see that there was a tighter packing in east-5 compared to east-4Which is further supported by how many more overloaded rails workers there was in east-4Datadog link |
What are we doing with this PR? I don't think we can just ship it as-is without coming up with a testing and rollout strategy. |
Agree with Scott. This would be great to investigate but I'm not sure if I'll have enough cycles any time soon 😬 |
It's all good @kirs, I just wanted to get an update 👍 |
Just wanted to drop a few resources related to this that I've seen recently. Could say this is almost becoming a standard in the industry
|
Right now the EWMA module in ingress-nginx operates with the assumption that an endpoint that served a request faster is less busy and less utilized. It calculates the EWMA score based on
response_time
andconnect_time
.The hypothesis is that response time is not the only thing that we could use for EWMA. At the end, a request could take longer because of a slow downstream / SQL query while the app server CPU was not used at all.
If we tried CPU utilization for calculating EWMA score, we could potentially waste less CPU resources across the platform, by sending more traffic to under-utilized app servers.
Here's the distribution of utilization of Shopify's endpoints:
You can see that while most of them in the middle are "hot", there are always endpoints that are under-utilized (those in the bottom of the distribution).
There are also Google's SRE resources that argue that packing CPU usage is super important.
From https://landing.google.com/sre/book/chapters/handling-overload.html:
My hypothesis is that the effect of this could be similar to one on the illustration above.
The idea is based on conversations with @hkdsun. All credits should go to Hormoz :)
Implementation
I suggest that we use the Server-Timing header and pass the
util
(== utilization) value:Server-Timing: util;dur=0.0625
It will be up to application developers to calculate and expose the utilization header.
This header is already used at Shopify for other metrics, and IMO it's standard enough to be adopted by other users of ingress-nginx.
I implemented it in a way that the module uses the Server-Timing header if it's available, and falls back to use response time if it's not, but I'm open to other ideas.
For reviews