Autoscaling of the high availability service #2639

NShaforostov · 2022-05-24T20:54:17Z

Background

As separate Cloud Pipeline deployments may face periodically with large workload peaks, it would be useful to implement an autoscaling of the system nodes (HA service) - to allow scaling up and down of the service according to the actual workload.

Approach

We shall monitor the state of the system API instances (at least, their RAM and/or CPUs consumption).
HA service shall have the minimum limit of instances to run itself.
If the consumption exceeds some predefined threshold during some time - new instances on the system needs shall be launched (i.e. HA service shall be scaled up).
If the workload is subsided, then additional instances shall be stopped (i.e. HA service shall be scaled down - but to not less than predefined minimal limit of instances).

I suppose that described behavior shall be managed by some new system preferences, e.g.:

preference that enables HA autoscaling
minimal number of instances
thresholds for resources, after exceeding which the autoscaling action shall be performed
count of instances in each step of the autoscaling action
frequency of the measurement of the resources consumption (time period between two measurements)

Additionally

Each HA service autoscaled action (scaling up or down) should be accompanied by a corresponding email to the admin
Add and show at the GUI (Cluster state page) new labels for the HA service nodes:
- each label shall show the state of the current running service instance
- label shall be colorizing according to the current instance consumption, for example - if the consumption is less than 50% - label shall be colorized in green, between 50% and 90% - label shall be colorized in orange, when consumption is over 90% - in red.
Add a new filter at the GUI (Cluster state page) to show only system service instances:
- by this filter only system service instances shall be shown in the nodes list
- system service instances shall not be displayed when the filter "No run id" is selected

tcibinan · 2022-06-01T17:20:50Z

Goals

From the technical point of view we would like to achieve the following goals:

Kubernetes cluster shall be autoscaled based on specific deployment utilization.
Autoscaling shall not depend on other Cloud Pipeline services.
Autoscaling shall be expandable in terms of autoscaling triggers and target deployments.
Autoscaling shall support independent multiple deployment autoscaling.
Autoscaling shall not abort most running requests/operations.

Implementation

I suggest using an additional autoscaling service which can horizontally autoscale both kubernetes deployments and kubernetes nodes in order to achieve some predefined target utilization. The following key points give more in-depth understanding of the approach.

Autoscaling service is an independent kubernetes deployment itself.
Autoscaling service deployment is created for each target deployment.
Autoscaling service configuration resides in kubernetes configmap as simple json configuration.

Algorithm

The following autoscaling algorithm can be used by the autoscaling service.

find the deployment
find the corresponding pods
find the corresponding nodes
- observe all nodes
- distinguish static and autoscaled nodes
- manage only autoscaled nodes
- ignore autoscaled nodes which have non target pods
check triggers
- disk pressure statuses of target nodes (ex. target statuses number = 0 disk pressure statuses)
- ram pressure statuses of target nodes (ex. target statuses number = 0 ram pressure statuses)
- cpu utilization of target nodes (ex. target utilization = 50 +- 10 %. scale up on 60%, scale down on 40%)
- ram utilization of target nodes (ex. target utilization = 50 +- 10 %. scale up on 60%, scale down on 40%)
- cluster nodes per target pod coefficient (ex. target coefficient = 100 cluster nodes per 1 target pod)
- target pods per node coefficient (ex. target coefficient = 2 target pods per 1 node)
- target pod failures per hour coefficient (ex. target coefficient = 3 pod failures per hour)
check limits
- minimum trigger duration (ex. trigger is active for 1 minute)
- minimum pods number (ex. 2 pods minimum)
- maximum pods number (ex. 10 pods maximum)
- minimum nodes number (ex. 2 nodes minimum)
- minimum nodes number (ex. 10 nodes maximum)
- post scale delay (ex. scale less frequent then once per 5 minutes)
scale up node if needed
- launch instance
- attach node
- set labels
scale up deployment if needed
scale down node if needed
- drain node
- terminate instance
scale down deployment if needed

Configuration

The following settings shall be configured for the autoscaling service to work:

kubernetes deployment to manage (ex. cp-api-srv)
kubernetes labels to manage (ex. cloud-pipeline/cp-api-srv)
triggers to check (ex. cpu utilization = 50%)
limits to consider (ex. from 1 to 5 pods/nodes)
cloud instance to scale (ex. instance type, iam role, security groups and etc.)

Questions

Does the autoscaling shall be configurable from Cloud Pipeline GUI?

maryvictol · 2022-07-14T11:07:51Z

The follow autoscaler parameters were checked:
trigger:

cluster_nodes_per_target_replicas,
target_replicas_per_target_nodes,
cpu_utilization: max, monitoring_period
memory_utilization: max, monitoring_period

rules:

on_threshold_trigger: extra_replicas, extra_nodes

limit:

min_nodes_number,
max_nodes_number,
min_replicas_number,
max_replicas_number,
min_scale_interval,
min_triggers_duration

Relates to #2639 and fixes changes introduced in #2735.

tcibinan · 2022-08-19T08:18:46Z

Cherry-picked to release/0.16 via 46ba80c, 4eb26db, a19e73f, e60fbcd and 90e593c.

Relates to #2639.

Relates to #2639 and fixes changes introduced in #2685.

NShaforostov added kind/enhancement New feature or request sys/gui Issues related to the web gui sys/core Issues related to core functionality (API, VM management, ...) labels May 24, 2022

mzueva assigned mzueva and tcibinan May 30, 2022

sidoruka changed the title ~~[DRAFT] Autoscaling of the high availability service~~ Autoscaling of the high availability service Jun 13, 2022

This was referenced Jun 23, 2022

Implement API graceful shutdown #2684

Merged

Implement Kubernetes deployment autoscaler #2685

Merged

Support service account kubernetes configuration in aws node scripts #2695

Merged

tcibinan mentioned this issue Jul 14, 2022

Implement nodes graceful scale down support in deployment autoscaler #2735

Merged

tcibinan added a commit that referenced this issue Aug 16, 2022

Configure api deployment termination grace period.

c5a8b04

Relates to #2639 and fixes changes introduced in #2735.

tcibinan added a commit that referenced this issue Aug 19, 2022

Configure api deployment termination grace period.

90e593c

Relates to #2639 and fixes changes introduced in #2735.

tcibinan added the state/verify Issues that are already addressed and require validation label Aug 19, 2022

tcibinan added a commit that referenced this issue Aug 23, 2022

Introduce proxies support in deployment autoscaler.

625c8d4

Relates to #2639.

tcibinan added a commit that referenced this issue Aug 23, 2022

Introduce proxies support in deployment autoscaler.

493d36a

Relates to #2639.

tcibinan added a commit that referenced this issue Oct 17, 2022

Fix typo in deployment autoscaler.

df40a26

Relates to #2639 and fixes changes introduced in #2685.

tcibinan added a commit that referenced this issue Oct 17, 2022

Fix typo in deployment autoscaler.

d082bd9

Relates to #2639 and fixes changes introduced in #2685.

ekazachkova mentioned this issue Dec 12, 2024

Deployment autoscaler enhancements #3825

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoscaling of the high availability service #2639

Autoscaling of the high availability service #2639

NShaforostov commented May 24, 2022 •

edited

Loading

tcibinan commented Jun 1, 2022 •

edited

Loading

maryvictol commented Jul 14, 2022

tcibinan commented Aug 19, 2022

Autoscaling of the high availability service #2639

Autoscaling of the high availability service #2639

Comments

NShaforostov commented May 24, 2022 • edited Loading

tcibinan commented Jun 1, 2022 • edited Loading

Goals

Implementation

Algorithm

Configuration

Questions

maryvictol commented Jul 14, 2022

tcibinan commented Aug 19, 2022

NShaforostov commented May 24, 2022 •

edited

Loading

tcibinan commented Jun 1, 2022 •

edited

Loading