-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: track replicate queue metrics by allocator action #85844
Conversation
@@ -186,6 +186,78 @@ var ( | |||
Measurement: "Demotions of Voters to Non Voters", | |||
Unit: metric.Unit_COUNT, | |||
} | |||
metaReplicateQueueAddReplicaSuccessCount = metric.Metadata{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you didn't also add rebalance (non)voter?
4bb06ce
to
6df9b71
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @kvoli)
pkg/kv/kvserver/replicate_queue.go
line 189 at r1 (raw file):
Previously, kvoli (Austen) wrote…
Is there a reason you didn't also add rebalance (non)voter?
Is this not covered by AllocatorAdd(Non)Voter
and AllocatorRemove(Non)Voter
? Is this the AllocatorConsiderRebalance
case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15 and @AlexTalks)
-- commits
line 10 at r2:
nit: this won't track rebalancing currently.
pkg/kv/kvserver/replicate_queue.go
line 189 at r1 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
Is this not covered by
AllocatorAdd(Non)Voter
andAllocatorRemove(Non)Voter
? Is this theAllocatorConsiderRebalance
case?
It's the AllocatorConsiderRebalance
case. It's not covered by either of those - since it calls rq.changeReplicas()
which won't hit those.
Sorry I didn't see the comment, you are already on it.
Code snippet:
// TODO(sarkesian): Consider adding metrics for AllocatorRemoveLearner,
// AllocatorConsiderRebalance, and AllocatorFinalizeAtomicReplicationChange
// allocator actions.
I also think there's something funky going on with the chart_catalog that CI doesn't like.
80d3200
to
94e22b2
Compare
While previously we had metrics within the replicate queue which tracked the number of particular actions processed by the queue based on a unique set of categories, this change adds new metrics for tracking the successes/errors of a replica being processed by the replicate queue, using the allocator action as a method of categorizing these actions. With this categorization, we are able to track success and error counts during rebalancing, upreplicating when we have a dead node, or decommissioning. The categorization makes no distinction between actions relatinv to voter replicas vs non-voter replicas, so they are aggregated across these two types. Release note (ops change): added new metrics: ``` queue.replicate.addreplica.(success|error) queue.replicate.removereplica.(success|error) queue.replicate.replacedeadreplica.(success|error) queue.replicate.removedeadreplica.(success|error) queue.replicate.replacedecommissioningreplica.(success|error) queue.replicate.removedecommissioningreplica.(success|error) ```
94e22b2
to
62b5e8b
Compare
bors r+ |
Build failed (retrying...): |
Build failed (retrying...): |
Build succeeded: |
In cockroachdb#85844, tracking replicate queue success and error metrics by allocator action was introduced, however this change inadvertently included a bug due to missing the `RemoveDead(Non)Voter` allocator action case, in combination with a panic on unexpected actions. This change fixes the missing case as well as the panic. Release justification: Bug fix Release note: None
In cockroachdb#85844, tracking replicate queue success and error metrics by allocator action was introduced, however this change inadvertently included a bug due to missing the `RemoveDead(Non)Voter` allocator action case, in combination with a panic on unexpected actions. This change fixes the missing case as well as the panic. Release justification: Bug fix Release note: None
In cockroachdb#85844, tracking replicate queue success and error metrics by allocator action was introduced, however this change inadvertently included a bug due to missing the `RemoveDead(Non)Voter` allocator action case, in combination with a panic on unexpected actions. This change fixes the missing case as well as the panic. Release justification: Bug fix Release note: None
85704: opt: session setting to enforce queries access only home region rows r=msirek a=msirek Fixes #86228 This commit adds a new session setting, `enforce_home_region`, which causes queries to error out if they need to talk to regions other than the gateway region to answer the query. A home region specifies the region(s) from which consistent reads from a set of rows in a table can be served locally. The home region for a set of rows in a multiregion table is determined differently depending on the type of multiregion table involved: | Locality | Home Region | | -------- | ----------- | | REGIONAL BY ROW | Home region determined by crdb_region column value | | REGIONAL BY TABLE | All rows share a single home region | | GLOBAL | Any region can act as the home region | When `enforce_home_region` is true, and a query has no home region (for example, reading from different home regions in a REGIONAL BY ROW table), error code 42899 (`QueryHasNoHomeRegion`) is returned. When `enforce_home_region` is true, and a query's home region differs from the gateway region, error code 42898 (`QueryNotRunningInHomeRegion`) is returned. The error message, in some instances, provides a useful tip on possible steps to take to allow the query to run entirely in the gateway region, e.g., ``` Query is not running in its home region. Try running the query from region 'ap-southeast-2'. Query has no home region. Query has no home region. Try adding a filter on table.crdb_region and/or on key column (table.id) Query has no home region. Try accessing only tables in multi-region databases with ZONE survivability. Query has no home region. The home region 'us-east-1' of table 'messages_rbt' does not match the home region 'ap-southeast-2' of lookup table 'messages_rbr'. ``` Support for this new session mode is being added in 3 phases. This commit consists of phase 1, which include only simple static checks during query compilation for the following allowed cases: - A scan of a table with `LOCALITY REGIONAL BY TABLE` with primary region matching the gateway region - A scan of a table with `LOCALITY GLOBAL` - A scan of a table with `LOCALITY REGIONAL BY ROW` using only local constraints (e.g. crdb_region = 'ca-central-1') - A scan of a table with `LOCALITY REGIONAL BY ROW` using locality-optimized search. - A locality-optimized lookup join with a `LOCALITY REGIONAL BY ROW` table. Only tables in multiregion databases with ZONE survivability may be scanned without error because with REGION survivability, ranges in a down region may be served non-local to the gateway region, so are not guaranteed to have low latency. Note that locality-optimized search and locality-optimized join are not guaranteed to scan no remote rows, but are still allowed. Release note (sql change): A new session setting, enforce_home_region, is added, which when true causes queries which have no home region or which may scan rows via a database connection outside of the query's home region to error out. Also, only tables in multiregion databases with ZONE survivability may be scanned without error when this setting is true because with REGION survivability, ranges in a down region may be served non-local to the gateway region, so are not guaranteed to have low latency. Release justification: Low risk feature 86844: kvserver: fix replicate queue metrics tracking bug r=AlexTalks a=AlexTalks In #85844, tracking replicate queue success and error metrics by allocator action was introduced, however this change inadvertently included a bug due to missing the `RemoveDead(Non)Voter` allocator action case, in combination with a panic on unexpected actions. This change fixes the missing case as well as the panic. Fixes #86544, fixes #86374, fixes #86107. Release justification: Bug fix Release note: None Co-authored-by: Mark Sirek <sirek@cockroachlabs.com> Co-authored-by: Alex Sarkesian <sarkesian@cockroachlabs.com>
While previously we had metrics within the replicate queue which tracked
the number of particular actions processed by the queue based on a
unique set of categories, this change adds new metrics for tracking the
successes/errors of a replica being processed by the replicate queue,
using the allocator action as a method of categorizing these actions.
With this categorization, we are able to track success and error counts
during rebalancing, upreplicating when we have a dead node, or
decommissioning. The categorization makes no distinction between actions
relatinv to voter replicas vs non-voter replicas, so they are aggregated
across these two types.
Release note (ops change): added new metrics: