-
Notifications
You must be signed in to change notification settings - Fork 812
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: Anand Rajagopal <anrajag@amazon.com>
- Loading branch information
1 parent
d829d65
commit f5a7201
Showing
2 changed files
with
49 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
--- | ||
title: "Ruler High Availability" | ||
linkTitle: "Ruler high availability" | ||
weight: 10 | ||
slug: ruler-high-availability | ||
--- | ||
|
||
This guide explains the concepts behind ruler high availability and when to use this feature | ||
|
||
## Background | ||
|
||
When rulers are deployed using shuffle sharding, each rule group is evaluated by a single ruler only. All the rulers in | ||
the hash ring will pick the same ruler instance for a given tenant, rule group, and namespace. To learn more about shuffle | ||
sharding, please refer to [dedicated guide](./shuffle-sharding.md) | ||
|
||
There are several scenarios when rule groups might not be evaluated. Few of them are described below | ||
|
||
- **Bad underlying node**<br /> | ||
If the underlying is unhealthy and is unable to send heartbeat, it might take several minutes for other rulers to mark the ruler as unhealthy in the ring. During this time, no ruler will evaluate the rule groups | ||
that are owned by the ruler running on the unhealthy node | ||
- **OOM Kills**<br /> | ||
If a ruler gets OOM (Out Of Memory) killed, then the ruler has no chance to mark itself as `LEAVING` and therefore all the other rulers will not attempt to take ownership of rule groups that were being evaluated | ||
by the ruler that is experiencing OOM kills | ||
- **Availability zone outage**<br /> | ||
If one AZ becomes unavailable, then all the rulers in that AZ might experience network partition and the hash ring might still reflect these rulers as healthy. As mentioned in other scenarios, the rulers in other AZs will | ||
not attempt to take ownership of rule groups being evaluated by pods in the bad AZ | ||
|
||
In addition to rule evaluation, ruler APIs will also return 5xx errors in the scenarios mentioned above | ||
|
||
## Replication factor | ||
|
||
Hash ring will return number of instances equal to replication factor for a given tenant, rule group, and namespace. For example, if RF=2, then hash ring will return 2 instances. If RF=3, then hash ring will return 3 | ||
instances. If AZ awareness is enabled, hash ring will pick rulers from different AZs. The rulers are picked for each tenant, rule group, and namespace combination. | ||
|
||
## Enabling high availability for evaluation | ||
|
||
Setting the flag `-ruler.enable-ha-evaluation` to true and setting `ruler.ring.replication-factor` > 1 will enable non-primary rulers (replicas 2..n) to check if 1..n-1 is healthy. For example, if replication factor is set | ||
to 2, then the non-primary ruler will check will primary is healthy. If primary is not healthy then the secondary ruler will evaluate the rule group. If primary ruler for that rule group is healthy, then the non-primary ruler | ||
will either drop the ownership or will not take ownership. This check is performed by each ruler when syncing rule groups from storage. This will reduce the chances of missing rule group evaluations and the maximum duration | ||
of missed evaluations will be limited to the sync interval of the rule groups | ||
|
||
## Enabling high availability for API | ||
|
||
Setting the replication factor > 1, will instruct non-primary rulers to store back up of rule groups. It is important to note that the backup does not contain any state. This allows API calls to be fault-tolerant. Depending | ||
upon |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters