Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: track replicate queue metrics by allocator action #85844

Merged
merged 1 commit into from
Aug 13, 2022

Conversation

AlexTalks
Copy link
Contributor

@AlexTalks AlexTalks commented Aug 9, 2022

While previously we had metrics within the replicate queue which tracked
the number of particular actions processed by the queue based on a
unique set of categories, this change adds new metrics for tracking the
successes/errors of a replica being processed by the replicate queue,
using the allocator action as a method of categorizing these actions.
With this categorization, we are able to track success and error counts
during rebalancing, upreplicating when we have a dead node, or
decommissioning. The categorization makes no distinction between actions
relatinv to voter replicas vs non-voter replicas, so they are aggregated
across these two types.

Release note (ops change): added new metrics:

queue.replicate.addreplica.(success|error)
queue.replicate.removereplica.(success|error)
queue.replicate.replacedeadreplica.(success|error)
queue.replicate.removedeadreplica.(success|error)
queue.replicate.replacedecommissioningreplica.(success|error)
queue.replicate.removedecommissioningreplica.(success|error)

@AlexTalks AlexTalks requested a review from a team as a code owner August 9, 2022 19:58
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@AlexTalks AlexTalks requested a review from aayushshah15 August 10, 2022 00:51
@@ -186,6 +186,78 @@ var (
Measurement: "Demotions of Voters to Non Voters",
Unit: metric.Unit_COUNT,
}
metaReplicateQueueAddReplicaSuccessCount = metric.Metadata{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you didn't also add rebalance (non)voter?

@AlexTalks AlexTalks force-pushed the rq_metrics_by_action branch from 4bb06ce to 6df9b71 Compare August 12, 2022 00:23
@AlexTalks AlexTalks requested a review from a team August 12, 2022 00:23
Copy link
Contributor Author

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @kvoli)


pkg/kv/kvserver/replicate_queue.go line 189 at r1 (raw file):

Previously, kvoli (Austen) wrote…

Is there a reason you didn't also add rebalance (non)voter?

Is this not covered by AllocatorAdd(Non)Voter and AllocatorRemove(Non)Voter? Is this the AllocatorConsiderRebalance case?

Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15 and @AlexTalks)


-- commits line 10 at r2:
nit: this won't track rebalancing currently.


pkg/kv/kvserver/replicate_queue.go line 189 at r1 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

Is this not covered by AllocatorAdd(Non)Voter and AllocatorRemove(Non)Voter? Is this the AllocatorConsiderRebalance case?

It's the AllocatorConsiderRebalance case. It's not covered by either of those - since it calls rq.changeReplicas() which won't hit those.

Sorry I didn't see the comment, you are already on it.

Code snippet:

	// TODO(sarkesian): Consider adding metrics for AllocatorRemoveLearner,
	// AllocatorConsiderRebalance, and AllocatorFinalizeAtomicReplicationChange
	// allocator actions.

I also think there's something funky going on with the chart_catalog that CI doesn't like.

@AlexTalks AlexTalks force-pushed the rq_metrics_by_action branch 3 times, most recently from 80d3200 to 94e22b2 Compare August 12, 2022 19:06
While previously we had metrics within the replicate queue which tracked
the number of particular actions processed by the queue based on a
unique set of categories, this change adds new metrics for tracking the
successes/errors of a replica being processed by the replicate queue,
using the allocator action as a method of categorizing these actions.
With this categorization, we are able to track success and error counts
during rebalancing, upreplicating when we have a dead node, or
decommissioning. The categorization makes no distinction between actions
relatinv to voter replicas vs non-voter replicas, so they are aggregated
across these two types.

Release note (ops change): added new metrics:
```
queue.replicate.addreplica.(success|error)
queue.replicate.removereplica.(success|error)
queue.replicate.replacedeadreplica.(success|error)
queue.replicate.removedeadreplica.(success|error)
queue.replicate.replacedecommissioningreplica.(success|error)
queue.replicate.removedecommissioningreplica.(success|error)
```
@AlexTalks AlexTalks force-pushed the rq_metrics_by_action branch from 94e22b2 to 62b5e8b Compare August 12, 2022 19:08
@AlexTalks
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 12, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Aug 13, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Aug 13, 2022

Build succeeded:

@craig craig bot merged commit 158ebe6 into cockroachdb:master Aug 13, 2022
@AlexTalks AlexTalks deleted the rq_metrics_by_action branch August 15, 2022 19:40
AlexTalks added a commit to AlexTalks/cockroach that referenced this pull request Aug 25, 2022
In cockroachdb#85844, tracking replicate queue success and error metrics by
allocator action was introduced, however this change inadvertently
included a bug due to missing the `RemoveDead(Non)Voter` allocator
action case, in combination with a panic on unexpected actions. This
change fixes the missing case as well as the panic.

Release justification: Bug fix
Release note: None
AlexTalks added a commit to AlexTalks/cockroach that referenced this pull request Aug 25, 2022
In cockroachdb#85844, tracking replicate queue success and error metrics by
allocator action was introduced, however this change inadvertently
included a bug due to missing the `RemoveDead(Non)Voter` allocator
action case, in combination with a panic on unexpected actions. This
change fixes the missing case as well as the panic.

Release justification: Bug fix
Release note: None
AlexTalks added a commit to AlexTalks/cockroach that referenced this pull request Aug 26, 2022
In cockroachdb#85844, tracking replicate queue success and error metrics by
allocator action was introduced, however this change inadvertently
included a bug due to missing the `RemoveDead(Non)Voter` allocator
action case, in combination with a panic on unexpected actions. This
change fixes the missing case as well as the panic.

Release justification: Bug fix
Release note: None
craig bot pushed a commit that referenced this pull request Aug 26, 2022
85704: opt: session setting to enforce queries access only home region rows r=msirek a=msirek

Fixes #86228

This commit adds a new session setting, `enforce_home_region`, which
causes queries to error out if they need to talk to regions other than
the gateway region to answer the query.

A home region specifies the region(s) from which consistent reads from a
set of rows in a table can be served locally. The home region for a set 
of rows in a multiregion table is determined differently depending on 
the type of multiregion table involved:
| Locality | Home Region |
| -------- | ----------- |
| REGIONAL BY ROW | Home region determined by crdb_region column value |
| REGIONAL BY TABLE | All rows share a single home region |
| GLOBAL | Any region can act as the home region |

When `enforce_home_region` is true, and a query has no home region
(for example, reading from different home regions in a REGIONAL BY ROW
table), error code 42899 (`QueryHasNoHomeRegion`) is returned.
When `enforce_home_region` is true, and a query's home region differs
from the gateway region, error code 42898
(`QueryNotRunningInHomeRegion`) is returned.
The error message, in some instances, provides a useful tip on possible
steps to take to allow the query to run entirely in the gateway region,
e.g.,
```
Query is not running in its home region. Try running the query from region 'ap-southeast-2'.

Query has no home region. Query has no home region.                                                                                
                          Try adding a filter on table.crdb_region and/or on key column (table.id)   
                          
Query has no home region. Try accessing only tables in multi-region databases with ZONE survivability.

Query has no home region. The home region 'us-east-1' of table 'messages_rbt'                                                                                          
                          does not match the home region 'ap-southeast-2' of lookup table 'messages_rbr'.     
```
Support for this new session mode is being added in 3 phases.
This commit consists of phase 1, which include only simple static checks
during query compilation for the following allowed cases:
- A scan of a table with `LOCALITY REGIONAL BY TABLE` with primary  
region matching the gateway region
- A scan of a table with `LOCALITY GLOBAL`
- A scan of a table with `LOCALITY REGIONAL BY ROW` using only local
constraints (e.g. crdb_region = 'ca-central-1')
- A scan of a table with `LOCALITY REGIONAL BY ROW` using
locality-optimized search.
- A locality-optimized lookup join with a `LOCALITY REGIONAL BY ROW` table.

Only tables in multiregion databases with ZONE survivability may be
scanned without error because with REGION survivability, ranges in a
down region may be served non-local to the gateway region, so are not
guaranteed to have low latency.

Note that locality-optimized search and locality-optimized join are not
guaranteed to scan no remote rows, but are still allowed. 

Release note (sql change): A new session setting, enforce_home_region,
is added, which when true causes queries which have no home region or
which may scan rows via a database connection outside of the query's 
home region to error out. Also, only tables in multiregion databases                             
with ZONE survivability may be scanned without error when this setting 
is true because with REGION survivability, ranges in a down region may 
be served non-local to the gateway region, so are not guaranteed to have 
low latency.

Release justification: Low risk feature

86844: kvserver: fix replicate queue metrics tracking bug r=AlexTalks a=AlexTalks

In #85844, tracking replicate queue success and error metrics by
allocator action was introduced, however this change inadvertently
included a bug due to missing the `RemoveDead(Non)Voter` allocator
action case, in combination with a panic on unexpected actions. This
change fixes the missing case as well as the panic.

Fixes #86544, fixes #86374, fixes #86107.

Release justification: Bug fix
Release note: None

Co-authored-by: Mark Sirek <sirek@cockroachlabs.com>
Co-authored-by: Alex Sarkesian <sarkesian@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants