Suspend ReplaceUnhealthy
process for AWS tikv auto-scaling-group
#962
Labels
cloud/aws
Amazon Web Services
ReplaceUnhealthy
process for AWS tikv auto-scaling-group
#962
Feature Request
Is your feature request related to a problem? Please describe:
The AWS Terraform script uses auto-scaling-group for all components (pd/tikv/tidb/monitor), when an ec2 instance fails the health check, the ec2 instance will be replaced. This is helpful for stateless applications or applications using EBS volumes to store data.
But TiKV pod uses instance store to store its data. When the instance is replaced, all the data on the instance store will be lost. TiKV has to resync all data to the newly added instance. Though TiDB is a distributed database and can work when a node fails, the cost to resync data is quite big if the dataset is large. Besides, the ec2 instance may recover to a healthy status by rebooting.
So for TiKV it's preferred to disable the auto-scaling-group's replace behavior.
Auto-scaling-group scaling process can be suspended and resumed according to its documentation. And Terraform also supports setting this field for auto-scaling-group.
The text was updated successfully, but these errors were encountered: