Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tablet throttler: (feature flagged) get remote tablets metrics from Realtime Stats #13018

Closed

Conversation

shlomi-noach
Copy link
Contributor

Description

This is a design change for how the tablet throttler collects metrics of remote tablets. Specifically, how a shard's PRIMARY tablet collects the data from the rest of the shard.

Commonly (and by default) the information the throttler gathers is the replication lag metric. Up till now, the throttler contacted the rest of the shards tablets via HTTP and hit their /throttler/check-self API call.

This PR now makes use of RealtimeStats, which in an already existing mechanism in vitess that uses the stream API, and lets you listen on health-status events that are published by all tablets. The TxThrottler uses RealtimeStats to gather replication lag, and now the tablet throttler, too, with some nuances.

As of this PR:

  • RealtimeStats include two new fields ThrottlerMetric and ThrottlerMetricError.
  • It also includes a ThrottlerReplicationLagSeconds which right now is not populated, but I have plan for the future so I added it here. Ignore it.
  • state_manager checks a throttler's self-metric, and publishes the result in RealtimeStats.
  • Note that a throttler metric error is distinguishable from just a health error. The throttler having an error with some metrics should not and will not cause the entire HealthCheck to be in error.
  • All tablets thus push their throttler's collected metrics, periodically, in --health_check_interval interval.
  • A new feature flag, --feature-throttler-read-realtime-stats (default: false), determines whether the PRIMARY throttler reads metrics by hitting the tablets HTTP API (the old behavior), or whether it listens for health checks and realtime stats by subscribing to the stream API.

Notes:

  • To use this new functionality you probably want --health_check_interval to be low, as low as 1s.
  • Rollout bottoms-up. Only enable --feature-throttler-read-realtime-stats on the PRIMARY once all the replicas have this change.

The overall architecture of the tablet throttler remains pretty much the same, I made minimal changes to enable the new functionality. Baby steps. In the future, we can simplify the tablet throttler's architecture (e.g. getting read of the cluster probes, which right now still exist and are dormant, but still serve a purpose).

Related Issue(s)

This is an experimental PR. No linked issue at this time.

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on the CI
  • Documentation was added or is not required

Deployment Notes

…imeStats

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@vitess-bot
Copy link
Contributor

vitess-bot bot commented May 3, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a test is added or modified, there should be a documentation on top of the test to explain what the expected behavior is what the test does.

If a new flag is being introduced:

  • Is it really necessary to add this flag?
  • Flag names should be clear and intuitive (as far as possible)
  • Help text should be descriptive.
  • Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow should be required, the maintainer team should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from VTop, if used there.

@vitess-bot vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels May 3, 2023
@github-actions github-actions bot added this to the v17.0.0 milestone May 3, 2023
@harshit-gangal
Copy link
Member

harshit-gangal commented May 3, 2023

How about primary always using StreamHealth API and if it sees throttler metrics it continues using that for that tablet otherwise switches to Http API for that tablet?

My assumption here is that, there is no change in the information received using both paths.

Benefit is that we would not need the flag and switch to newer path will be smoother.

@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented May 3, 2023

How about primary always using StreamHealth API and if it sees throttler metrics it continues using that for that tablet otherwise switches to Http API for that tablet?

Interesting suggestion. It would have to be on per-tablet basis, because one REPLICA tablet might be updated, another wouldn't. Also, A tablet might be downgraded. So to make this happen we'd need to track "was there a recent tablet health entry for this specific tablet, and if not, we should hit the HTTP API". Does that sound right?

@harshit-gangal
Copy link
Member

How about primary always using StreamHealth API and if it sees throttler metrics it continues using that for that tablet otherwise switches to Http API for that tablet?

Interesting suggestion. It would have to be on per-tablet basis, because one REPLICA tablet might be updated, another wouldn't. Also, A tablet might be downgraded. So to make this happen we'd need to track "was there a recent tablet heath entry for this specific tablet, and if not, we should hit the HTTP API". Does that sound right?

that's the thought.

This can make code more complex. Is it worth it for this case, that part I'm not sure. But, it just eases the adoption.

@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented May 3, 2023

Question: after I Subscribe here:

		healthCheck := discovery.NewHealthCheck(ctx, discovery.DefaultHealthCheckRetryDelay, discovery.DefaultHealthCheckTimeout, throttler.ts, throttler.cell, strings.Join(throttler.knownCells, ","))
		healthCheckCh = healthCheck.Subscribe()

What happens if later on, new tablets are added to the topology? Does my subscription include the stats for those new tablets?

@harshit-gangal
Copy link
Member

harshit-gangal commented May 3, 2023

NewHealthCheck

There is a topowatcher in there. That keeps loading the cells-tablet view which includes the addition and removal of tablets.

@shlomi-noach
Copy link
Contributor Author

How about primary always using StreamHealth API and if it sees throttler metrics it continues using that for that tablet otherwise switches to Http API for that tablet?

OK this is actually quite easily doable. The only thing is we'd need to add another field to RealtimeStats's proto, to indicate that "the value is present"; because if there's no metric, we'd see a value of 0; but 0 is also a valid metric, so we need to distinguish the scenarios.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

I have a local change that removes --feature-throttler-read-realtime-stats. As soon as the primary tablet sees incoming stats for a given probe (ie remote tablet) it stops actively probing that tablet.

I do see a problem with this implementation: it's really important for --health_check_interval to be reasonable. Because if it's 10s or if it's 1s make a ton of difference throttler-resolution -wise. But both will cause the throttler to avoid actively probing the remote tablet.
As result, if you deploy the change, and your --health_check_interval is high, then the throttler has poor granularity and will not work as expected. And, because we remove --feature-throttler-read-realtime-stats, you override the default high-frequency throttler rate.

@shlomi-noach
Copy link
Contributor Author

A flag-free implementation is in #13034

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@hkdsun
Copy link
Member

hkdsun commented May 8, 2023

As result, if you deploy the change, and your --health_check_interval is high, then the throttler has poor granularity and will not work as expected. And, because we remove --feature-throttler-read-realtime-stats, you override the default high-frequency throttler rate.

That sounds like a good reason to keep the flag around and have users enable it once they've read the disclaimer of "make sure your --health_check_interval is low enough for tablet throttling to behave well", or produce a warning if misconfigured, or something such.

I guess I'm in the same boat as Shlomi of preferring this PR over the alternative in #13034

@github-actions
Copy link
Contributor

github-actions bot commented Jun 8, 2023

This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:

  • Push additional commits to the associated branch.
  • Remove the stale label.
  • Add a comment indicating why it is not stale.

If no action is taken within 7 days, this PR will be closed.

@github-actions github-actions bot added the Stale Marks PRs as stale after a period of inactivity, which are then closed after a grace period. label Jun 8, 2023
@frouioui frouioui removed this from the v17.0.0 milestone Jun 12, 2023
@frouioui frouioui added this to the v18.0.0 milestone Jun 12, 2023
@github-actions github-actions bot removed the Stale Marks PRs as stale after a period of inactivity, which are then closed after a grace period. label Jun 13, 2023
@github-actions
Copy link
Contributor

This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:

  • Push additional commits to the associated branch.
  • Remove the stale label.
  • Add a comment indicating why it is not stale.

If no action is taken within 7 days, this PR will be closed.

@github-actions github-actions bot added the Stale Marks PRs as stale after a period of inactivity, which are then closed after a grace period. label Jul 15, 2023
@shlomi-noach
Copy link
Contributor Author

We're not going to pursue this path. Instead, we will convert throttler's HTTP calls with RPC calls.

@shlomi-noach shlomi-noach deleted the throttler-health-metric branch July 16, 2023 06:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Query Serving Component: TabletManager NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says Stale Marks PRs as stale after a period of inactivity, which are then closed after a grace period. Type: Enhancement Logical improvement (somewhere between a bug and feature) Type: Internal Cleanup
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants