-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tablet throttler: be explicit about client app name, exempt some apps from checks and heartbeat renewals #13195
Conversation
…r when they've been 'check'ed Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…hey at all want to involve the throttler. Some lightweight clients, such as the schema tracker or the binlog watcher, or messager, do not need the throttler, and since some of these clients are _always on_, we also do not _want_ them to continuously approach the throttler. One side effect of always engaging with the throttler is the infinite renewal of on-demand heartbeats Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
… in vstreamer.Engine. The throttler exempts specific apps from checks and will not renew heartbeats leases for those apps Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
If a new flag is being introduced:
If a workflow is added or modified:
Bug fixes
Non-trivial changes
New/Existing features
Backward compatibility
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me. The existing tests should cover things.
I only had minor questions and comments. Let me know what you think. I can quickly review again and approve as needed.
@@ -400,11 +398,11 @@ func TestSchemaChange(t *testing.T) { | |||
// to be strictly higher than started_timestamp | |||
assert.GreaterOrEqual(t, lastThrottledTimestamp, startedTimestamp) | |||
component := row.AsString("component_throttled", "") | |||
assert.Contains(t, []string{string(vreplication.VCopierComponentName), string(vreplication.VPlayerComponentName)}, component) | |||
assert.Contains(t, []string{throttlerapp.VCopierName, throttlerapp.VPlayerName}, component) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it might be worth making a new type that's an alias for string so that what the string represents is even more explicit. This make it clear to future code writers that we're not using arbitrary strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I initially did that but then that affected so many lines of code, casting into and out of string, that I decided to hold off. I can still do that for this PR if you think it's worhtwhile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave it up to you. I think it's probably worth it, but we can always do it later too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 37a7c2f. I'm not sure how I feel about it, what do you think?
@@ -1768,7 +1768,7 @@ export MYSQL_PWD | |||
my ($self, %args) = @_; | |||
|
|||
return sub { | |||
if (head("http://localhost:{{VTTABLET_PORT}}/throttler/check?app={{THROTTLER_ONLINE_DDL_APP}}:pt-osc:{{MIGRATION_UUID}}&p=low")) { | |||
if (head("http://localhost:{{VTTABLET_PORT}}/throttler/check?app={{THROTTLER_ONLINE_DDL_APP}}:{{THROTTLER_PT_OSC_APP}}:{{MIGRATION_UUID}}&p=low")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that we didn't support pt-osc? Might be a good opportunity to clean up the code if so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's labeled as "experimental". I'd rather not do everything in this PR. We can deprecate pt-osc later.
// recentCheckTickerValue is an ever increasing number, incrementing once per second. | ||
recentCheckTickerValue int64 | ||
// recentCheckValue is set to match or exceed recentCheckTickerValue whenever a "check" was made (other than by the throttler itself). | ||
// when recentCheckValue < recentCheckTickerValue that means there hasn't been a recent check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO 1 second ago is recent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, and that's what the code considers as "recent". 1s
. It's just that we have a 1 second granularity, and so the actual range can go as high as 1.999
.
if checkResult.RecentlyChecked { | ||
// We have just probed a tablet, and it reported back that someone just recently "check"ed it. | ||
// We therefore renew the heartbeats lease. | ||
go throttler.heartbeatWriter.RequestHeartbeats() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My gut tells me there's the potential for a race here so that we don't request heartbeats for 1+ seconds which could throttle things and impact production traffic unnecessarily. Let's say that the throttler is checked once a second, but it's always just slightly behind the second tick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, but heartbeats are leased for some 5s
-10s
, and so there is no race. Leasing heartbeats to 1s
is unadvisable. Now that you mention it, I should probably enforce a reasonable lower limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, makes sense. Thanks! Agreed on enforcing reasonable low and high limits on the on-demand heartbeat lease times.
// This check was made by someone other than the throttler itself, i.e. this came from online-ddl or vreplication or other. | ||
// We mark the fact that someone just made a check. If this is a REPLICA or RDONLY tables, this will be reported back | ||
// to the PRIMARY so that it knows it must renew the heartbeat lease. | ||
atomic.StoreInt64(&throttler.recentCheckValue, 1+atomic.LoadInt64(&throttler.recentCheckTickerValue)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so we add 1 to it. I think this at least largely addresses my earlier noted concern around timing.
Threshold float64 `json:"Threshold"` | ||
Error error `json:"-"` | ||
Message string `json:"Message"` | ||
RecentlyChecked bool `json:"RecentlyChecked"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have no synchronization around reading/writing this value, do we? If it's needed then we could make this an atomic.Bool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed. We create a new CheckResult for each check.
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! ❤️
…io#13195 Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
* Table throttler: --throttler-config-via-topo now defaults to 'true' Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * add deprecation message Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * endtoend tests: remove '--enable-lag-throttler' and use 'UpdateThrottlerConfig' everywhere Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * always use vtctldclient Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * use cluster.VtctldClientProcess Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * disable --throttler-config-via-topo in old throttler tests Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * Remove --throttler-config-via-topo where used, since it now defaults 'true' Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * fix vreplication cluster setup, waiting for throttler config to apply Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * changelog Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * extend throttler threshold Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * a bit more verbose Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * fixed CLI test Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * remove old '--enable-lag-throttler' flag, introduce '--heartbeat_on_demand_duration' Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * more log info in throttler.Open() Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * more logging Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * Revert to --heartbeat_enable Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * Protect throttler config change application with initMutex And in e2e test update the throttler config on the keyspace when it's created. Only wait for the new tablets in a shard to have the throttler enabled when adding a Shard. Signed-off-by: Matt Lord <mattalord@gmail.com> * More CI testing Signed-off-by: Matt Lord <mattalord@gmail.com> * CI testing cont Signed-off-by: Matt Lord <mattalord@gmail.com> * Yes... Signed-off-by: Matt Lord <mattalord@gmail.com> * Somebody doesn't like force pushes so msg here Signed-off-by: Matt Lord <mattalord@gmail.com> * Increase on-demand heartbeat duration from 10s to 1m Signed-off-by: Matt Lord <mattalord@gmail.com> * Use only on-demand heartbeats everywhere Signed-off-by: Matt Lord <mattalord@gmail.com> * Use same throttler config everywhere Signed-off-by: Matt Lord <mattalord@gmail.com> * Update all keyspaces and don't fail test on missing JSON keys Signed-off-by: Matt Lord <mattalord@gmail.com> * Use constant heartbeats in vrepl e2e tests Until #13175 is fixed. Signed-off-by: Matt Lord <mattalord@gmail.com> * Increase workflow command timeout Signed-off-by: Matt Lord <mattalord@gmail.com> * Don't wait for throttler on non-serving primaries Signed-off-by: Matt Lord <mattalord@gmail.com> * #13175 is fixed, therefore re-instating on-deman heartbeats Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * Added ToC Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * Tweak comment and kick CI Signed-off-by: Matt Lord <mattalord@gmail.com> * Treat isOpen as the ready/running signal. Also align all initMutex usage. Signed-off-by: Matt Lord <mattalord@gmail.com> * Re-adjust comment Signed-off-by: Matt Lord <mattalord@gmail.com> * Adjust CheckIsReady() to match OnlineDDL's expectation/usage This was only using IsReady() before, now it's using IsOpen() and IsReady(). Signed-off-by: Matt Lord <mattalord@gmail.com> * Get rid of log messages from SrvKeyspaceWatcher when no node/key Signed-off-by: Matt Lord <mattalord@gmail.com> * More corrections/tweaks Signed-off-by: Matt Lord <mattalord@gmail.com> * Use more convenient/clear new IsRunning function Signed-off-by: Matt Lord <mattalord@gmail.com> * Revert "Use more convenient/clear new IsRunning function" This reverts commit 9aef276 as this change was not correct. Signed-off-by: Matt Lord <mattalord@gmail.com> * Further fix correct use of IsOpen(), IsRunning(), IsEnabled() Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * throttler.throttledApps cannot be nil Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * Remove --enable_lag_throttler flag Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * Deprecate --throttler_config_via_topo Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * remove throttler mitigation code, as the problem was solved in #13195 Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * deperecate throttler config flags Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * Removed tabletmanager_throttler and tabletmanager_throttler_custom_config tests Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * changelog Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * remove EnableThrottler() call Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * restore default value Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * update threshold Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * update flags desc Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * using atomic.Bool Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * Update changelog/18.0/18.0.0/summary.md Co-authored-by: Matt Lord <mattalord@gmail.com> Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * use MarkDeprecated Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * do not expect flags in vttablet --help Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * remove --throttler-config-via-topo from examples scripts Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> --------- Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Signed-off-by: Matt Lord <mattalord@gmail.com> Co-authored-by: Matt Lord <mattalord@gmail.com>
Description
An extension to #13177. This PR includes everything in #13177 and then adds more logic. I intentionally split the two apart so we can discuss the two changes separately.
This PR is an alternative approach to #13187. It has more changes, but I prefer the approach in this PR over #13187
#13177 says: "if the replica tablet's throttler is checked, then as result the primary tablet's throttler should request more heartbeats".
However, there are some clients that:
The particular clients are these three members of
tabletserver
:For now, we consider all other clients as "need to use the throttler".
With this PR, we formalize the "app name": the name by which a client identifies itself to the throttler.
Generally speaking, this is an arbitrary string, and we've been using these names semi-formally thus far. Some names are
"online-ddl"
,"vreplication"
,"gh-ost"
etc. And command likeALTER VITESS_MIGRATION THROTTLE ALL
tell the throttler to explicitly throttler calls made by specific apps ("online-ddl"
in this example).What's formalized in this PR is:
throttler/app.go
. The names are constants, and now used throughout the code includingendtoend
tests."rowstreamer"
, and the external connector uses the name"external-connector"
.tabletserver
's Engine's throttler. This throttler is used toVStream
rows, and is shared by multiple clients. These particular three clients are of special interest:What's so special about these three? When enabled, they're always running. Always checking. So these three, as others, clearly identify themselves by name when using the `tabletserver's Engine's throttler.
So what we get is:
PRIMARY
or onREPLICA
, ensure to renew the on-demand heartbeat leaseRelated Issue(s)
Checklist
Deployment Notes