Auto-scale DB write capacity based on ingester queue size #735

bboreham · 2018-03-07T11:39:03Z

DynamoDB is provisioned at a certain ops/sec level which can be scaled up and down. For the past several months we have been using AWS' auto-scaler (#507) but it doesn't really meet the requirement: it will sometimes scale up after some brief peaks in throughput, and hotspotting (#733) can reduce throughput which provokes it to scale down.

The key thing in Cortex is the flush queue - we should scale up when the queue is building, and can scale down when it is below some reasonable length (10K?). Note DynamoDB has limits on how often you can scale down in a 24-hour period; check the docs.

Previous issue: #318
Related: #464
Somewhat related to #665

bboreham · 2018-03-07T12:19:14Z

What do we do around the change-over from old to new weekly table? We would need independent queue lengths for each. (Some chunks have index entries written to both tables)

Instead of independent queue lengths, we could look at the consumed capacity and error metrics, which are per-table. We only need to scale up tables that are hitting over-provision errors.

bboreham · 2018-03-09T09:03:20Z

Thinking about this some more, the flush queue length is not the whole picture. Chunks enter the queue for different reasons: either the chunk is full, or it went above the max age (default 12 hours), or it went stale (default 5 minutes).

The high-level objective is something like "avoid blowing up in memory", so a bunch of stale chunks that are not replaced do not matter as much as a bunch of full or aged chunks which have been replaced. We could add a metric for the rate at which chunks are created, and care more about a large queue if the creation rate is higher than the flush rate.

bboreham · 2020-10-06T12:55:01Z

Bit more historical background (message written iin 2017 in a private repo):

It appears that, although the auto-scaler will spot there is a problem, it increases capacity on just one of the tables, leaving the other one as a bottleneck, so the system remains constrained until 10:22Z in the example, around 40 minutes after the increased load began.

I don't think there is a way to tell AWS to scale the two together, so maybe we have to write something to do it.

bboreham added the component/ingester label Mar 7, 2018

bboreham mentioned this issue Jun 7, 2018

Auto-scale DynamoDB provision based on Prometheus metrics #841

Merged

11 tasks

bboreham self-assigned this Jun 20, 2018

bboreham closed this as completed in #841 Aug 7, 2018

bboreham mentioned this issue May 9, 2019

remove AWS auto-scaler code #1379

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-scale DB write capacity based on ingester queue size #735

Auto-scale DB write capacity based on ingester queue size #735

bboreham commented Mar 7, 2018 •

edited

Loading

bboreham commented Mar 7, 2018 •

edited

Loading

bboreham commented Mar 9, 2018

bboreham commented Oct 6, 2020

Auto-scale DB write capacity based on ingester queue size #735

Auto-scale DB write capacity based on ingester queue size #735

Comments

bboreham commented Mar 7, 2018 • edited Loading

bboreham commented Mar 7, 2018 • edited Loading

bboreham commented Mar 9, 2018

bboreham commented Oct 6, 2020

bboreham commented Mar 7, 2018 •

edited

Loading

bboreham commented Mar 7, 2018 •

edited

Loading