Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic tablet throttler config: enable/disable, set metrics query/threshold #11604

Merged

Conversation

shlomi-noach
Copy link
Contributor

@shlomi-noach shlomi-noach commented Oct 30, 2022

Description

A different implementation for dynamic throttler config from the one described in #11316

We have decided to implement dynamic throttler config in the following way:

  • The tablet throttler can now be enabled/disabled dynamically. Similarly, the metrics query and threshold can be changed dynamically.
  • Changes to the config are made with a new vtctldclient UpdateThrottlerConfig command.
  • The throttler configuration is in topo, not in a backend _vt table.
  • Control via vtgate is postponed and to be re-evaluated if we want such control.
  • New show vitess_throttler status query returns per-tablet throttler state.
  • New behavior is flag protected under --throttler-config-via-topo

Discussion & details.

The main deviation from #11316 is that we do not use a _vt backend table to store the throttler's config, and instead store it in topo. There are multiple reasons to that:

  1. We want the throttler configuration to apply per keyspace, not per shard
  2. And certainly not per tablet
  3. Storing state in _vt means replica tablets are susceptible to replication lag, which introduces a dependency loop
  4. And it would not be possible for replica tablets to store their own state anyway
  5. topo seems a logical place to set this kind of configuration
  6. We are able to utilize topo listeners/callback to simplify the propagation of information to the tablets.

We utilize local topos ; changes to configuration will apply to all cell-topos of a keyspace.

We do require global topo to be available if you want to make a change to the throttler. This is because we need global topo to tell us where to find per-cell topos.

vtctldclient UpdateThrottlerConfig

You indicate the specific configuration changes you make to the throttler, like so:

vtctldclient UpdateThrottlerConfig [flags] <keyspace>

Examples:

# disable throttler; all throttler checks will return with "200 OK"
$ vtctldclient UpdateThrottlerConfig --disable commerce

# enable throttler; checks are responded with appropriate status per current metrics
$ vtctldclient UpdateThrottlerConfig --enable commerce

# Both enable and set threshold in same command. Since no query is indicated, we assume the default check for replication lag
$ vtctldclient UpdateThrottlerConfig --enable --threshold 5.0 commerce

# Change threshold. Does not affect enabled/disabled state of the throttler
$ vtctldclient UpdateThrottlerConfig --threshold 1.5 commerce

# Use a custom query
$ vtctldclient UpdateThrottlerConfig --custom_query "show global status like 'threads_running'" --check_as_check_self --threshold 50 commerce

# Restore default query and threshold
$ vtctldclient UpdateThrottlerConfig --custom_query "" --check_as_check_shard --threshold 1.5 commerce

Any changes made, are sent to all tablets of given keyspace, in all cells and all shards.

  • If the change is to enable/disable the throttler, then all throttlers respond near instantly.
  • For query/threshold changes, the throttlers apply the changes within a few seconds (10sec at current configuration).

Configuration and backwards compatibility

Today, the throttler is controlled per-tablet via vttablet command line flags:

  • enable_lag_throttler
  • throttle_threshold
  • throttle_metrics_threshold (used when metrics query is defined, overrides the above, and that's confusing)
  • throttle_metrics_query
  • throttle_check_as_check_self

The above five flags are consolidated into four in the new ThrottlerConfig proto:

  message ThrottlerConfig {
    // Enabled indicates that the throttler is actually checking state for requests. When disabled, it automatically returns 200 OK for all checks
    bool enabled = 1;

    // Threshold is the threshold for either the default check (heartbeat lag) or custom check
    double threshold = 2;

    // CustomQuery is an optional query that overrides the default check query
    string custom_query = 3;

    // CheckAsCheckSelf indicates whether a throttler /check request should behave like a /check-self
    bool check_as_check_self = 4;
  }

For backwards compatibility, the existing vttablet flags are still accepted, but will be deprecated in the future. A new vttablet flag, --throttler-config-via-topo, indicates that a SrvKeyspace_ThrottlerConfig configuration (i.e. configuration stored in topo) overrides the above flags. The way to transition into the new setup is to first run your vitess cluster with existing configuration, untouched. Then, populate topo with the new config via vtctldclient UpdateThrottlerConfig as described above. Then, add --throttler-config-via-topo and restart tablets.

show vitess_throttler status

The command show vitess_throttler status retrieves throttler status from tablets in all cells and shards. To clarify, the command does not read anything from topo. The command represents how the tablets are actually running the throttler. Is it enabled? Disabled? What's the threshold?

At this time the query is only sent to PRIMARY tablets, but we will follow up and make it run on all tablets.

mysql> show vitess_throttler status;
+-------+---------+-----------+
| shard | enabled | threshold |
+-------+---------+-----------+
| 0     |       1 |         5 |
+-------+---------+-----------+

Tests

The main test in this PR is the new go/test/endtoend/tabletmanager/throttler_topo/throttler_test.go, which is tested in a new CI workflow/shard called tabletmanager_throttler_topo.

This new test runs the throttler, changes configuration dynamically, enables, disables, changes threshold, changes the metrics query, etc. etc.

This PR also has a minor effect on on-demand heatbeats to ensure they get an initial "kick" upon startup, and both tabletmanager_throttler and tabletmanager_throttler_custom_config are adjusted accordingly. In the future, we will delete those two tests/workflows/shards, and keep only tabletmanager_throttler_topo.

Related Issue(s)

#11316

Checklist

  • "Backport me!" label has been added if this change should be backported
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…ards compatibility)

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
… STATUS

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! My last concern has been addressed. Made one minor comment about the vtctld help output. I'll let the other reviewers double check their respective areas. Thank you for working on this! I think this greatly improves the throttling feature! ❤️

// UpdateThrottlerConfig makes a UpdateThrottlerConfig gRPC call to a vtctld.
UpdateThrottlerConfig = &cobra.Command{
Use: "UpdateThrottlerConfig [--enable|--disable] [--threshold=<float64>] [--custom-query=<query>] [--check-as-check-self|--check-as-check-shard] <keyspace>",
Short: "Rebuilds the cell-specific SrvVSchema from the global VSchema objects in the provided cells (or all cells if none provided).",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies that you can specify cells but currently you cannot. It also doesn't rebuild/refresh so much as update the config in the topo which is then picked up by the watchers. Looks like we can mostly copy the help output from vtctl: "Update the table throttler configuration for all cells and tablets of a given keyspace"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This is just an overlooked copy+paste. Updated the comment.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

@ajm188 is this looking good on your side?

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@deepthi deepthi added the release notes (needs details) This PR needs to be listed in the release notes in a dedicated section (deprecation notice, etc...) label Nov 16, 2022
@deepthi
Copy link
Member

deepthi commented Nov 16, 2022

@shlomi-noach can you

  • update the description to use vtctldclient examples instead of vtctlclient?
  • add this to the release notes summary?

Rest LGTM

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

update the description to use vtctldclient examples instead of vtctlclient?

done

add this to the release notes summary?

Added release notes

@ajm188
Copy link
Contributor

ajm188 commented Nov 17, 2022

@ajm188 is this looking good on your side?

ya, approving formally for completeness! great stuff

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach shlomi-noach removed release notes (needs details) This PR needs to be listed in the release notes in a dedicated section (deprecation notice, etc...) NeedsWebsiteDocsUpdate What it says labels Nov 20, 2022
@shlomi-noach
Copy link
Contributor Author

Still seeing "Code owner review required". Given @mattlord @ajm188 and @deepthi approved the PR, I'm gonna just go ahead and merge when ready.

@shlomi-noach
Copy link
Contributor Author

huh! I used to have the superpower to force merge a PR, I seem to not have it. Will solicit more approvals

@shlomi-noach shlomi-noach merged commit 9fa7784 into vitessio:main Nov 20, 2022
@shlomi-noach shlomi-noach deleted the vitess-throttler-dynamic-config branch November 20, 2022 19:23
Ayogua

This comment was marked as spam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Cluster management Component: Query Serving Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants