Skip to content

Commit

Permalink
Increase LivenessCheck timeout
Browse files Browse the repository at this point in the history
Signed-off-by: Anand Rajagopal <anrajag@amazon.com>
  • Loading branch information
rajagopalanand committed Sep 24, 2024
1 parent d829d65 commit f5a7201
Show file tree
Hide file tree
Showing 2 changed files with 49 additions and 4 deletions.
45 changes: 45 additions & 0 deletions docs/guides/ruler-high-availability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
---
title: "Ruler High Availability"
linkTitle: "Ruler high availability"
weight: 10
slug: ruler-high-availability
---

This guide explains the concepts behind ruler high availability and when to use this feature

## Background

When rulers are deployed using shuffle sharding, each rule group is evaluated by a single ruler only. All the rulers in
the hash ring will pick the same ruler instance for a given tenant, rule group, and namespace. To learn more about shuffle
sharding, please refer to [dedicated guide](./shuffle-sharding.md)

There are several scenarios when rule groups might not be evaluated. Few of them are described below

- **Bad underlying node**<br />
If the underlying is unhealthy and is unable to send heartbeat, it might take several minutes for other rulers to mark the ruler as unhealthy in the ring. During this time, no ruler will evaluate the rule groups
that are owned by the ruler running on the unhealthy node
- **OOM Kills**<br />
If a ruler gets OOM (Out Of Memory) killed, then the ruler has no chance to mark itself as `LEAVING` and therefore all the other rulers will not attempt to take ownership of rule groups that were being evaluated
by the ruler that is experiencing OOM kills
- **Availability zone outage**<br />
If one AZ becomes unavailable, then all the rulers in that AZ might experience network partition and the hash ring might still reflect these rulers as healthy. As mentioned in other scenarios, the rulers in other AZs will
not attempt to take ownership of rule groups being evaluated by pods in the bad AZ

In addition to rule evaluation, ruler APIs will also return 5xx errors in the scenarios mentioned above

## Replication factor

Hash ring will return number of instances equal to replication factor for a given tenant, rule group, and namespace. For example, if RF=2, then hash ring will return 2 instances. If RF=3, then hash ring will return 3
instances. If AZ awareness is enabled, hash ring will pick rulers from different AZs. The rulers are picked for each tenant, rule group, and namespace combination.

## Enabling high availability for evaluation

Setting the flag `-ruler.enable-ha-evaluation` to true and setting `ruler.ring.replication-factor` > 1 will enable non-primary rulers (replicas 2..n) to check if 1..n-1 is healthy. For example, if replication factor is set
to 2, then the non-primary ruler will check will primary is healthy. If primary is not healthy then the secondary ruler will evaluate the rule group. If primary ruler for that rule group is healthy, then the non-primary ruler
will either drop the ownership or will not take ownership. This check is performed by each ruler when syncing rule groups from storage. This will reduce the chances of missing rule group evaluations and the maximum duration
of missed evaluations will be limited to the sync interval of the rule groups

## Enabling high availability for API

Setting the replication factor > 1, will instruct non-primary rulers to store back up of rule groups. It is important to note that the backup does not contain any state. This allows API calls to be fault-tolerant. Depending
upon
8 changes: 4 additions & 4 deletions pkg/ruler/ruler.go
Original file line number Diff line number Diff line change
Expand Up @@ -81,8 +81,6 @@ const (
unknownHealthFilter string = "unknown"
okHealthFilter string = "ok"
errHealthFilter string = "err"

livenessCheckTimeout = 100 * time.Millisecond
)

type DisabledRuleGroupErr struct {
Expand Down Expand Up @@ -161,7 +159,8 @@ type Config struct {
EnableQueryStats bool `yaml:"query_stats_enabled"`
DisableRuleGroupLabel bool `yaml:"disable_rule_group_label"`

EnableHAEvaluation bool `yaml:"enable_ha_evaluation"`
EnableHAEvaluation bool `yaml:"enable_ha_evaluation"`
EvalHAHealthCheckTimeout time.Duration `yaml:"eval_ha_health_check_timeout"`
}

// Validate config and returns error on failure
Expand Down Expand Up @@ -238,6 +237,7 @@ func (cfg *Config) RegisterFlags(f *flag.FlagSet) {
f.BoolVar(&cfg.DisableRuleGroupLabel, "ruler.disable-rule-group-label", false, "Disable the rule_group label on exported metrics")

f.BoolVar(&cfg.EnableHAEvaluation, "ruler.enable-ha-evaluation", false, "Enable high availability")
f.DurationVar(&cfg.EvalHAHealthCheckTimeout, "ruler.eval-ha-healthcheck-timeout", 1*time.Second, "Health check timeout for evaluation HA")

cfg.RingCheckPeriod = 5 * time.Second
}
Expand Down Expand Up @@ -590,7 +590,7 @@ func (r *Ruler) nonPrimaryInstanceOwnsRuleGroup(g *rulespb.RuleGroupDesc, replic
responseChan := make(chan *LivenessCheckResponse, len(jobs))

ctx := user.InjectOrgID(context.Background(), userID)
ctx, cancel := context.WithTimeout(ctx, livenessCheckTimeout)
ctx, cancel := context.WithTimeout(ctx, r.cfg.EvalHAHealthCheckTimeout)
defer cancel()

err := concurrency.ForEach(ctx, jobs, len(jobs), func(ctx context.Context, job interface{}) error {
Expand Down

0 comments on commit f5a7201

Please sign in to comment.