-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tablet throttler: (feature flagged) get remote tablets metrics from Realtime Stats #13018
Conversation
…imeStats Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
If a new flag is being introduced:
If a workflow is added or modified:
Bug fixes
Non-trivial changes
New/Existing features
Backward compatibility
|
How about primary always using StreamHealth API and if it sees throttler metrics it continues using that for that tablet otherwise switches to Http API for that tablet? My assumption here is that, there is no change in the information received using both paths. Benefit is that we would not need the flag and switch to newer path will be smoother. |
Interesting suggestion. It would have to be on per-tablet basis, because one |
that's the thought. This can make code more complex. Is it worth it for this case, that part I'm not sure. But, it just eases the adoption. |
Question: after I healthCheck := discovery.NewHealthCheck(ctx, discovery.DefaultHealthCheckRetryDelay, discovery.DefaultHealthCheckTimeout, throttler.ts, throttler.cell, strings.Join(throttler.knownCells, ","))
healthCheckCh = healthCheck.Subscribe() What happens if later on, new tablets are added to the topology? Does my subscription include the stats for those new tablets? |
There is a topowatcher in there. That keeps loading the cells-tablet view which includes the addition and removal of tablets. |
OK this is actually quite easily doable. The only thing is we'd need to add another field to |
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
I have a local change that removes I do see a problem with this implementation: it's really important for |
A flag-free implementation is in #13034 |
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
That sounds like a good reason to keep the flag around and have users enable it once they've read the disclaimer of "make sure your I guess I'm in the same boat as Shlomi of preferring this PR over the alternative in #13034 |
This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:
If no action is taken within 7 days, this PR will be closed. |
This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:
If no action is taken within 7 days, this PR will be closed. |
We're not going to pursue this path. Instead, we will convert throttler's HTTP calls with RPC calls. |
Description
This is a design change for how the tablet throttler collects metrics of remote tablets. Specifically, how a shard's
PRIMARY
tablet collects the data from the rest of the shard.Commonly (and by default) the information the throttler gathers is the replication lag metric. Up till now, the throttler contacted the rest of the shards tablets via HTTP and hit their
/throttler/check-self
API call.This PR now makes use of
RealtimeStats
, which in an already existing mechanism invitess
that uses the stream API, and lets you listen on health-status events that are published by all tablets. TheTxThrottler
uses RealtimeStats to gather replication lag, and now the tablet throttler, too, with some nuances.As of this PR:
RealtimeStats
include two new fieldsThrottlerMetric
andThrottlerMetricError
.ThrottlerReplicationLagSeconds
which right now is not populated, but I have plan for the future so I added it here. Ignore it.state_manager
checks a throttler's self-metric, and publishes the result inRealtimeStats
.HealthCheck
to be in error.--health_check_interval
interval.--feature-throttler-read-realtime-stats
(default:false
), determines whether thePRIMARY
throttler reads metrics by hitting the tablets HTTP API (the old behavior), or whether it listens for health checks and realtime stats by subscribing to the stream API.Notes:
--health_check_interval
to be low, as low as1s
.--feature-throttler-read-realtime-stats
on thePRIMARY
once all the replicas have this change.The overall architecture of the tablet throttler remains pretty much the same, I made minimal changes to enable the new functionality. Baby steps. In the future, we can simplify the tablet throttler's architecture (e.g. getting read of the cluster probes, which right now still exist and are dormant, but still serve a purpose).
Related Issue(s)
This is an experimental PR. No linked issue at this time.
Checklist
Deployment Notes