Tablet throttler: (feature flagged) get remote tablets metrics from Realtime Stats #13018

shlomi-noach · 2023-05-03T06:27:57Z

Description

This is a design change for how the tablet throttler collects metrics of remote tablets. Specifically, how a shard's PRIMARY tablet collects the data from the rest of the shard.

Commonly (and by default) the information the throttler gathers is the replication lag metric. Up till now, the throttler contacted the rest of the shards tablets via HTTP and hit their /throttler/check-self API call.

This PR now makes use of RealtimeStats, which in an already existing mechanism in vitess that uses the stream API, and lets you listen on health-status events that are published by all tablets. The TxThrottler uses RealtimeStats to gather replication lag, and now the tablet throttler, too, with some nuances.

As of this PR:

RealtimeStats include two new fields ThrottlerMetric and ThrottlerMetricError.
It also includes a ThrottlerReplicationLagSeconds which right now is not populated, but I have plan for the future so I added it here. Ignore it.
state_manager checks a throttler's self-metric, and publishes the result in RealtimeStats.
Note that a throttler metric error is distinguishable from just a health error. The throttler having an error with some metrics should not and will not cause the entire HealthCheck to be in error.
All tablets thus push their throttler's collected metrics, periodically, in --health_check_interval interval.
A new feature flag, --feature-throttler-read-realtime-stats (default: false), determines whether the PRIMARY throttler reads metrics by hitting the tablets HTTP API (the old behavior), or whether it listens for health checks and realtime stats by subscribing to the stream API.

Notes:

To use this new functionality you probably want --health_check_interval to be low, as low as 1s.
Rollout bottoms-up. Only enable --feature-throttler-read-realtime-stats on the PRIMARY once all the replicas have this change.

The overall architecture of the tablet throttler remains pretty much the same, I made minimal changes to enable the new functionality. Baby steps. In the future, we can simplify the tablet throttler's architecture (e.g. getting read of the cluster probes, which right now still exist and are dormant, but still serve a purpose).

Related Issue(s)

This is an experimental PR. No linked issue at this time.

Checklist

"Backport to:" labels have been added if this change should be back-ported
Tests were added or are not required
Did the new or modified tests pass consistently locally and on the CI
Documentation was added or is not required

Deployment Notes

…imeStats Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

vitess-bot · 2023-05-03T06:28:00Z

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

Ensure that the Pull Request has a descriptive title.
If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
If a test is added or modified, there should be a documentation on top of the test to explain what the expected behavior is what the test does.

If a new flag is being introduced:

Is it really necessary to add this flag?
Flag names should be clear and intuitive (as far as possible)
Help text should be descriptive.
Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

Each item in Jobs should be named in order to mark it as required.
If the workflow should be required, the maintainer team should be notified.

Bug fixes

There should be at least one unit or end-to-end test.
The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

There should be some code comments as to why things are implemented the way they are.

New/Existing features

Should be documented, either by modifying the existing documentation or creating new documentation.
New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

Protobuf changes should be wire-compatible.
Changes to _vt tables and RPCs need to be backward compatible.
vtctl command output order should be stable and awk-able.
RPC changes should be compatible with vitess-operator
If a flag is removed, then it should also be removed from VTop, if used there.

harshit-gangal · 2023-05-03T07:19:41Z

How about primary always using StreamHealth API and if it sees throttler metrics it continues using that for that tablet otherwise switches to Http API for that tablet?

My assumption here is that, there is no change in the information received using both paths.

Benefit is that we would not need the flag and switch to newer path will be smoother.

shlomi-noach · 2023-05-03T07:24:10Z

How about primary always using StreamHealth API and if it sees throttler metrics it continues using that for that tablet otherwise switches to Http API for that tablet?

Interesting suggestion. It would have to be on per-tablet basis, because one REPLICA tablet might be updated, another wouldn't. Also, A tablet might be downgraded. So to make this happen we'd need to track "was there a recent tablet health entry for this specific tablet, and if not, we should hit the HTTP API". Does that sound right?

harshit-gangal · 2023-05-03T07:27:17Z

How about primary always using StreamHealth API and if it sees throttler metrics it continues using that for that tablet otherwise switches to Http API for that tablet?

Interesting suggestion. It would have to be on per-tablet basis, because one REPLICA tablet might be updated, another wouldn't. Also, A tablet might be downgraded. So to make this happen we'd need to track "was there a recent tablet heath entry for this specific tablet, and if not, we should hit the HTTP API". Does that sound right?

that's the thought.

This can make code more complex. Is it worth it for this case, that part I'm not sure. But, it just eases the adoption.

shlomi-noach · 2023-05-03T07:27:22Z

Question: after I Subscribe here:

		healthCheck := discovery.NewHealthCheck(ctx, discovery.DefaultHealthCheckRetryDelay, discovery.DefaultHealthCheckTimeout, throttler.ts, throttler.cell, strings.Join(throttler.knownCells, ","))
		healthCheckCh = healthCheck.Subscribe()

What happens if later on, new tablets are added to the topology? Does my subscription include the stats for those new tablets?

harshit-gangal · 2023-05-03T07:29:40Z

NewHealthCheck

There is a topowatcher in there. That keeps loading the cells-tablet view which includes the addition and removal of tablets.

shlomi-noach · 2023-05-03T08:23:16Z

How about primary always using StreamHealth API and if it sees throttler metrics it continues using that for that tablet otherwise switches to Http API for that tablet?

OK this is actually quite easily doable. The only thing is we'd need to add another field to RealtimeStats's proto, to indicate that "the value is present"; because if there's no metric, we'd see a value of 0; but 0 is also a valid metric, so we need to distinguish the scenarios.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

shlomi-noach · 2023-05-03T09:31:55Z

I have a local change that removes --feature-throttler-read-realtime-stats. As soon as the primary tablet sees incoming stats for a given probe (ie remote tablet) it stops actively probing that tablet.

I do see a problem with this implementation: it's really important for --health_check_interval to be reasonable. Because if it's 10s or if it's 1s make a ton of difference throttler-resolution -wise. But both will cause the throttler to avoid actively probing the remote tablet.
As result, if you deploy the change, and your --health_check_interval is high, then the throttler has poor granularity and will not work as expected. And, because we remove --feature-throttler-read-realtime-stats, you override the default high-frequency throttler rate.

shlomi-noach · 2023-05-08T09:00:11Z

A flag-free implementation is in #13034

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

hkdsun · 2023-05-08T18:30:01Z

As result, if you deploy the change, and your --health_check_interval is high, then the throttler has poor granularity and will not work as expected. And, because we remove --feature-throttler-read-realtime-stats, you override the default high-frequency throttler rate.

That sounds like a good reason to keep the flag around and have users enable it once they've read the disclaimer of "make sure your --health_check_interval is low enough for tablet throttling to behave well", or produce a warning if misconfigured, or something such.

I guess I'm in the same boat as Shlomi of preferring this PR over the alternative in #13034

github-actions · 2023-06-08T01:35:14Z

This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:

Push additional commits to the associated branch.
Remove the stale label.
Add a comment indicating why it is not stale.

If no action is taken within 7 days, this PR will be closed.

github-actions · 2023-07-15T01:34:22Z

This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:

Push additional commits to the associated branch.
Remove the stale label.
Add a comment indicating why it is not stale.

If no action is taken within 7 days, this PR will be closed.

shlomi-noach · 2023-07-16T06:18:34Z

We're not going to pursue this path. Instead, we will convert throttler's HTTP calls with RPC calls.

shlomi-noach added 4 commits May 2, 2023 10:31

Add ThrottlerReplicationLagSeconds and ThrottlerCustomMetric to Realt…

fa26d8c

…imeStats Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

Populate ThrottlerReplicationLagSeconds in RealtimeStats

e22198b

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

Tablet throttler: get remote tablets metrics from Realtime Stats

10b0c52

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

resolve conflict

dbacc21

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

shlomi-noach added Type: Internal Cleanup Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: Query Serving Component: TabletManager labels May 3, 2023

shlomi-noach requested a review from hkdsun May 3, 2023 06:27

vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels May 3, 2023

github-actions bot added this to the v17.0.0 milestone May 3, 2023

shlomi-noach added 3 commits May 3, 2023 12:03

ThrottlerMetricCollected

2bd0e49

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

populate ThrottlerMetricCollected

f2fb62d

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

web admin proto

aa8ed41

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

shlomi-noach mentioned this pull request May 8, 2023

Tablet throttler: get remote tablets metrics from Realtime Stats , with auto-detection #13034

Closed

4 tasks

resolve conflict

8c93072

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

github-actions bot added the Stale Marks PRs as stale after a period of inactivity, which are then closed after a grace period. label Jun 8, 2023

frouioui removed this from the v17.0.0 milestone Jun 12, 2023

frouioui added this to the v18.0.0 milestone Jun 12, 2023

github-actions bot removed the Stale Marks PRs as stale after a period of inactivity, which are then closed after a grace period. label Jun 13, 2023

shlomi-noach mentioned this pull request Jun 18, 2023

Tablet Throttler: v18 features and objectives #13341

Closed

github-actions bot added the Stale Marks PRs as stale after a period of inactivity, which are then closed after a grace period. label Jul 15, 2023

shlomi-noach closed this Jul 16, 2023

shlomi-noach deleted the throttler-health-metric branch July 16, 2023 06:18

shlomi-noach mentioned this pull request Jul 16, 2023

Tablet throttler: inter-checks via gRPC #13514

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tablet throttler: (feature flagged) get remote tablets metrics from Realtime Stats #13018

Tablet throttler: (feature flagged) get remote tablets metrics from Realtime Stats #13018

shlomi-noach commented May 3, 2023

vitess-bot bot commented May 3, 2023

harshit-gangal commented May 3, 2023 •

edited

Loading

shlomi-noach commented May 3, 2023 •

edited

Loading

harshit-gangal commented May 3, 2023

shlomi-noach commented May 3, 2023 •

edited

Loading

harshit-gangal commented May 3, 2023 •

edited

Loading

shlomi-noach commented May 3, 2023

shlomi-noach commented May 3, 2023

shlomi-noach commented May 8, 2023

hkdsun commented May 8, 2023 •

edited

Loading

github-actions bot commented Jun 8, 2023

github-actions bot commented Jul 15, 2023

shlomi-noach commented Jul 16, 2023

Tablet throttler: (feature flagged) get remote tablets metrics from Realtime Stats #13018

Tablet throttler: (feature flagged) get remote tablets metrics from Realtime Stats #13018

Conversation

shlomi-noach commented May 3, 2023

Description

Related Issue(s)

Checklist

Deployment Notes

vitess-bot bot commented May 3, 2023

Review Checklist

General

If a new flag is being introduced:

If a workflow is added or modified:

Bug fixes

Non-trivial changes

New/Existing features

Backward compatibility

harshit-gangal commented May 3, 2023 • edited Loading

shlomi-noach commented May 3, 2023 • edited Loading

harshit-gangal commented May 3, 2023

shlomi-noach commented May 3, 2023 • edited Loading

harshit-gangal commented May 3, 2023 • edited Loading

shlomi-noach commented May 3, 2023

shlomi-noach commented May 3, 2023

shlomi-noach commented May 8, 2023

hkdsun commented May 8, 2023 • edited Loading

github-actions bot commented Jun 8, 2023

github-actions bot commented Jul 15, 2023

shlomi-noach commented Jul 16, 2023

harshit-gangal commented May 3, 2023 •

edited

Loading

shlomi-noach commented May 3, 2023 •

edited

Loading

shlomi-noach commented May 3, 2023 •

edited

Loading

harshit-gangal commented May 3, 2023 •

edited

Loading

hkdsun commented May 8, 2023 •

edited

Loading