Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tablet throttler: be explicit about client app name, exempt some apps from checks and heartbeat renewals #13195

Conversation

shlomi-noach
Copy link
Contributor

Description

An extension to #13177. This PR includes everything in #13177 and then adds more logic. I intentionally split the two apart so we can discuss the two changes separately.

This PR is an alternative approach to #13187. It has more changes, but I prefer the approach in this PR over #13187

#13177 says: "if the replica tablet's throttler is checked, then as result the primary tablet's throttler should request more heartbeats".

However, there are some clients that:

  1. Engage with the throttler all the time
  2. Do not really need the throttler's work
  3. Thereby needlessly cause a scenario where the primary throttler leases heartbeats continuously and infinitely. Which works against the idea of "on-demand" heartbeats.

The particular clients are these three members of tabletserver:

  • schema tracker
  • binlog watcher
  • messager

For now, we consider all other clients as "need to use the throttler".

With this PR, we formalize the "app name": the name by which a client identifies itself to the throttler.

Generally speaking, this is an arbitrary string, and we've been using these names semi-formally thus far. Some names are "online-ddl", "vreplication", "gh-ost" etc. And command like ALTER VITESS_MIGRATION THROTTLE ALL tell the throttler to explicitly throttler calls made by specific apps ("online-ddl" in this example).

What's formalized in this PR is:

  • We move all names into a dedicated, single place, throttler/app.go. The names are constants, and now used throughout the code including endtoend tests.
  • We are much more granular with names. Again, names are arbitrary. For example, rowstreamer now uses the name "rowstreamer", and the external connector uses the name "external-connector".
  • Specifically, and critical to this PR, we break down uses of tabletserver's Engine's throttler. This throttler is used to VStream rows, and is shared by multiple clients. These particular three clients are of special interest:
    • schema tracker
    • binlog watcher
    • messager
      What's so special about these three? When enabled, they're always running. Always checking. So these three, as others, clearly identify themselves by name when using the `tabletserver's Engine's throttler.
  • Next, the throttler now cares about the client's name. When asked to check, it first looks up the client app's name. Some names are now exempt from checks, and will not even cause heartbeat renewal. The three aforementioned apps are exactly those apps.

So what we get is:

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on the CI
  • Documentation was added or is not required

Deployment Notes

…r when they've been 'check'ed

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…hey at all want to involve the throttler.

Some lightweight clients, such as the schema tracker or the binlog watcher, or messager, do not need the throttler, and since some of these clients are _always on_, we also
do not _want_ them to continuously approach the throttler. One side effect of always engaging with the throttler is the infinite renewal of on-demand heartbeats

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
… in vstreamer.Engine. The throttler exempts specific apps from checks and will not renew heartbeats leases for those apps

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@vitess-bot vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels May 30, 2023
@vitess-bot
Copy link
Contributor

vitess-bot bot commented May 30, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a test is added or modified, there should be a documentation on top of the test to explain what the expected behavior is what the test does.

If a new flag is being introduced:

  • Is it really necessary to add this flag?
  • Flag names should be clear and intuitive (as far as possible)
  • Help text should be descriptive.
  • Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow should be required, the maintainer team should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from VTop, if used there.

Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. The existing tests should cover things.

I only had minor questions and comments. Let me know what you think. I can quickly review again and approve as needed.

@@ -400,11 +398,11 @@ func TestSchemaChange(t *testing.T) {
// to be strictly higher than started_timestamp
assert.GreaterOrEqual(t, lastThrottledTimestamp, startedTimestamp)
component := row.AsString("component_throttled", "")
assert.Contains(t, []string{string(vreplication.VCopierComponentName), string(vreplication.VPlayerComponentName)}, component)
assert.Contains(t, []string{throttlerapp.VCopierName, throttlerapp.VPlayerName}, component)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it might be worth making a new type that's an alias for string so that what the string represents is even more explicit. This make it clear to future code writers that we're not using arbitrary strings.

Copy link
Contributor Author

@shlomi-noach shlomi-noach May 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I initially did that but then that affected so many lines of code, casting into and out of string, that I decided to hold off. I can still do that for this PR if you think it's worhtwhile.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave it up to you. I think it's probably worth it, but we can always do it later too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 37a7c2f. I'm not sure how I feel about it, what do you think?

@@ -1768,7 +1768,7 @@ export MYSQL_PWD
my ($self, %args) = @_;

return sub {
if (head("http://localhost:{{VTTABLET_PORT}}/throttler/check?app={{THROTTLER_ONLINE_DDL_APP}}:pt-osc:{{MIGRATION_UUID}}&p=low")) {
if (head("http://localhost:{{VTTABLET_PORT}}/throttler/check?app={{THROTTLER_ONLINE_DDL_APP}}:{{THROTTLER_PT_OSC_APP}}:{{MIGRATION_UUID}}&p=low")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that we didn't support pt-osc? Might be a good opportunity to clean up the code if so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's labeled as "experimental". I'd rather not do everything in this PR. We can deprecate pt-osc later.

// recentCheckTickerValue is an ever increasing number, incrementing once per second.
recentCheckTickerValue int64
// recentCheckValue is set to match or exceed recentCheckTickerValue whenever a "check" was made (other than by the throttler itself).
// when recentCheckValue < recentCheckTickerValue that means there hasn't been a recent check.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO 1 second ago is recent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, and that's what the code considers as "recent". 1s. It's just that we have a 1 second granularity, and so the actual range can go as high as 1.999.

Comment on lines +718 to +722
if checkResult.RecentlyChecked {
// We have just probed a tablet, and it reported back that someone just recently "check"ed it.
// We therefore renew the heartbeats lease.
go throttler.heartbeatWriter.RequestHeartbeats()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My gut tells me there's the potential for a race here so that we don't request heartbeats for 1+ seconds which could throttle things and impact production traffic unnecessarily. Let's say that the throttler is checked once a second, but it's always just slightly behind the second tick.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but heartbeats are leased for some 5s-10s, and so there is no race. Leasing heartbeats to 1s is unadvisable. Now that you mention it, I should probably enforce a reasonable lower limit.

Copy link
Contributor

@mattlord mattlord May 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, makes sense. Thanks! Agreed on enforcing reasonable low and high limits on the on-demand heartbeat lease times.

// This check was made by someone other than the throttler itself, i.e. this came from online-ddl or vreplication or other.
// We mark the fact that someone just made a check. If this is a REPLICA or RDONLY tables, this will be reported back
// to the PRIMARY so that it knows it must renew the heartbeat lease.
atomic.StoreInt64(&throttler.recentCheckValue, 1+atomic.LoadInt64(&throttler.recentCheckTickerValue))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so we add 1 to it. I think this at least largely addresses my earlier noted concern around timing.

Threshold float64 `json:"Threshold"`
Error error `json:"-"`
Message string `json:"Message"`
RecentlyChecked bool `json:"RecentlyChecked"`
Copy link
Contributor

@mattlord mattlord May 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have no synchronization around reading/writing this value, do we? If it's needed then we could make this an atomic.Bool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed. We create a new CheckResult for each check.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@mattlord mattlord added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: TabletManager and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request labels May 31, 2023
Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! ❤️

@shlomi-noach shlomi-noach removed the NeedsWebsiteDocsUpdate What it says label Jun 1, 2023
@shlomi-noach shlomi-noach merged commit 374b94c into vitessio:main Jun 1, 2023
@shlomi-noach shlomi-noach deleted the throttler-replica-propagate-check-selective-throttling-by-name branch June 1, 2023 04:09
shlomi-noach added a commit to planetscale/vitess that referenced this pull request Jun 5, 2023
…io#13195

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
shlomi-noach added a commit that referenced this pull request Jul 13, 2023
* Table throttler: --throttler-config-via-topo now defaults to 'true'

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* add deprecation message

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* endtoend tests: remove '--enable-lag-throttler' and use 'UpdateThrottlerConfig' everywhere

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* always use vtctldclient

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* use cluster.VtctldClientProcess

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* disable --throttler-config-via-topo in old throttler tests

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* Remove --throttler-config-via-topo where used, since it now defaults 'true'

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* fix vreplication cluster setup, waiting for throttler config to apply

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* changelog

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* extend throttler threshold

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* a bit more verbose

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* fixed CLI test

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* remove old '--enable-lag-throttler' flag, introduce '--heartbeat_on_demand_duration'

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* more log info in throttler.Open()

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* more logging

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* Revert to --heartbeat_enable

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* Protect throttler config change application with initMutex

And in e2e test update the throttler config on the keyspace
when it's created. Only wait for the new tablets in a shard
to have the throttler enabled when adding a Shard.

Signed-off-by: Matt Lord <mattalord@gmail.com>

* More CI testing

Signed-off-by: Matt Lord <mattalord@gmail.com>

* CI testing cont

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Yes...

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Somebody doesn't like force pushes so msg here

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Increase on-demand heartbeat duration from 10s to 1m

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Use only on-demand heartbeats everywhere

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Use same throttler config everywhere

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Update all keyspaces and don't fail test on missing JSON keys

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Use constant heartbeats in vrepl e2e tests

Until #13175 is
fixed.

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Increase workflow command timeout

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Don't wait for throttler on non-serving primaries

Signed-off-by: Matt Lord <mattalord@gmail.com>

* #13175 is fixed, therefore re-instating on-deman heartbeats

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* Added ToC

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* Tweak comment and kick CI

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Treat isOpen as the ready/running signal.

Also align all initMutex usage.

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Re-adjust comment

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Adjust CheckIsReady() to match OnlineDDL's expectation/usage

This was only using IsReady() before, now it's using
IsOpen() and IsReady().

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Get rid of log messages from SrvKeyspaceWatcher when no node/key

Signed-off-by: Matt Lord <mattalord@gmail.com>

* More corrections/tweaks

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Use more convenient/clear new IsRunning function

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Revert "Use more convenient/clear new IsRunning function"

This reverts commit 9aef276 as this
change was not correct.

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Further fix correct use of IsOpen(), IsRunning(), IsEnabled()

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* throttler.throttledApps cannot be nil

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* Remove --enable_lag_throttler flag

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* Deprecate --throttler_config_via_topo

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* remove throttler mitigation code, as the problem was solved in #13195

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* deperecate throttler config flags

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* Removed tabletmanager_throttler and tabletmanager_throttler_custom_config tests

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* changelog

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* remove EnableThrottler() call

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* restore default value

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* update threshold

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* update flags desc

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* using atomic.Bool

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* Update changelog/18.0/18.0.0/summary.md

Co-authored-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* use MarkDeprecated

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* do not expect flags in vttablet --help

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* remove --throttler-config-via-topo from examples scripts

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

---------

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Co-authored-by: Matt Lord <mattalord@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: TabletManager Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants