-
Notifications
You must be signed in to change notification settings - Fork 808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto-scale DB write capacity based on ingester queue size #735
Comments
What do we do around the change-over from old to new weekly table? We would need independent queue lengths for each. (Some chunks have index entries written to both tables) Instead of independent queue lengths, we could look at the consumed capacity and error metrics, which are per-table. We only need to scale up tables that are hitting over-provision errors. |
Thinking about this some more, the flush queue length is not the whole picture. Chunks enter the queue for different reasons: either the chunk is full, or it went above the max age (default 12 hours), or it went stale (default 5 minutes). The high-level objective is something like "avoid blowing up in memory", so a bunch of stale chunks that are not replaced do not matter as much as a bunch of full or aged chunks which have been replaced. We could add a metric for the rate at which chunks are created, and care more about a large queue if the creation rate is higher than the flush rate. |
Bit more historical background (message written iin 2017 in a private repo): It appears that, although the auto-scaler will spot there is a problem, it increases capacity on just one of the tables, leaving the other one as a bottleneck, so the system remains constrained until 10:22Z in the example, around 40 minutes after the increased load began. I don't think there is a way to tell AWS to scale the two together, so maybe we have to write something to do it. |
DynamoDB is provisioned at a certain ops/sec level which can be scaled up and down. For the past several months we have been using AWS' auto-scaler (#507) but it doesn't really meet the requirement: it will sometimes scale up after some brief peaks in throughput, and hotspotting (#733) can reduce throughput which provokes it to scale down.
The key thing in Cortex is the flush queue - we should scale up when the queue is building, and can scale down when it is below some reasonable length (10K?). Note DynamoDB has limits on how often you can scale down in a 24-hour period; check the docs.
Previous issue: #318
Related: #464
Somewhat related to #665
The text was updated successfully, but these errors were encountered: