Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spanconfig/sqlwatcher: hide factory pattern, speed up tests, surface no-op checkpoints periodically #73171

Merged
merged 4 commits into from
Nov 30, 2021

Conversation

irfansharif
Copy link
Contributor

@irfansharif irfansharif commented Nov 25, 2021

See individual commits.


spanconfig/sqlwatcher: speed up tests

Using a more aggressive closed timestamp target duration brings the runtime for
TestSQLWatcherReactsToUpdate down from 45s+ to single digits. To speed it up,
we also just re-use the same SQLWatcher instead of creating a new one for every
test case.

Other tests in the package benefit from a similar treatment. Using the default
target duration in fact masked a buggy test; in a future commit we end up
rewrite that test so it's skipped for now.

spanconfig/sqlwatcher: hide the factory pattern

The only mutable state across concurrent WatchForSQLUpdate calls was the
internal buffer, which does not need to hang off the surrounding struct. This
obviates the need for the factory pattern we were using -- callers can set up
multiple SQLWatchers concurrently as is (see TestSQLWatcherMultiple).

This PR first hides the factory under the package boundary; in a later commit
it shed it altogether. This has the benefit of reducing the number of symbols
in pkg/spanconfig and making it symmetric with the other spanconfig
dependencies typically found together (KVAccessor, SQLTranslator). It's also
every-so-slightly less verbose to use in the upcoming spanconfig.Reconciler
(#71994).

spanconfig/sqlwatcher: checkpoint, even without updates

If system.{descriptor,zones} is static, we would previously discard
all checkpoint events. This is not what we want -- we'd still like to
know how caught up we are, with ongoing updates or otherwise.

MVCC GC, after all, can still happen; if our last checkpoint was when
the last update occurred, we may end up doing a lot of unnecessary work
when finding out our last checkpoint was GC-ed from underneath us. Lack
of periodic checkpoints also makes tests difficult to write -- you'd
have to induce a benign update to flush out all earlier ones. This
latent flakiness was uncovered after speeding up some existing tests in
an earlier commit.

NB: For incremental (possibly noop) checkpoints, we need to ensure that
the sqlwatcher's buffer is initialized with a low watermark. When
flushing events, we take the most conservative timestamp. If not
initialized to a high water, this might be 0 -- violating the
sqlwatcher's monotonically increasing checkpoint ts invariant.

Release note: None

Using a more aggressive closed timestamp target duration brings the
runtime for TestSQLWatcherReactsToUpdate down from 45s+ to single
digits. To speed it up, we also just re-use the same sqlwatcher instead
of creating a new one for every test case.

Other tests in the package benefit from a similar treatment. Using the
default target duration in fact masked a buggy test; in a future commit
we end up rewrite that test so it's skipped for now.

Release note: None
@irfansharif irfansharif requested review from a team as code owners November 25, 2021 18:29
@irfansharif irfansharif requested review from samiskin and tbg and removed request for a team November 25, 2021 18:29
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@irfansharif irfansharif requested review from nvanbenschoten, arulajmani and ajwerner and removed request for samiskin and tbg November 25, 2021 18:29
@irfansharif irfansharif force-pushed the 211125.collapse-watcher-factory branch from dea62dd to 76d236d Compare November 25, 2021 18:33
The only mutable state across concurrent WatchForSQLUpdate calls was the
internal buffer, which does not need to hang off the surrounding struct.
This obviates the need for the factory pattern we were using -- callers
can set up multiple SQLWatchers concurrently as is (see
TestSQLWatcherMultiple).

This commit simply hides the factory under the package boundary; in a
future commit we'll shed it altogether. This has the benefit of reducing
the number of symbols in pkg/spanconfig and making it symmetric with the
other spanconfig dependencies typically found together (KVAccessor,
SQLTranslator). It's also every-so-slightly less verbose to use in the
upcoming spanconfig.Reconciler (cockroachdb#71994).

Release note: None
A previous commit obviated the need for a factory altogether, but went
only as far hiding it within the package boundary. This commit gets rid
of it entirely.

Release note: None
@irfansharif irfansharif force-pushed the 211125.collapse-watcher-factory branch from 76d236d to 6d738b9 Compare November 25, 2021 20:00
@irfansharif irfansharif changed the title spanconfig/sqlwatcher: hide the factory pattern and speed up tests spanconfig/sqlwatcher: hide factory pattern, speed up tests, surface no-op checkpoints periodically Nov 25, 2021
@irfansharif irfansharif force-pushed the 211125.collapse-watcher-factory branch from 6d738b9 to f8ae233 Compare November 25, 2021 20:02
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1, 10 of 10 files at r2, 1 of 1 files at r3, 6 of 6 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, and @irfansharif)


pkg/spanconfig/spanconfigsqlwatcher/sqlwatcher.go, line 165 at r4 (raw file):

				return err
			}
			if len(events) == 0 && !checkpointNoops.ShouldProcess(timeutil.Now()) {

Would it be straightforward to write a test for this?


pkg/spanconfig/spanconfigsqlwatcher/sqlwatcher_test.go, line 371 at r1 (raw file):

				return
			default:
				time.Sleep(100 * time.Millisecond)

Why was this needed?


pkg/spanconfig/spanconfigsqlwatcher/sqlwatcher_test.go, line 320 at r2 (raw file):

	}

	wg.Wait()

nit: consider pulling this call into the branch that includes the goroutines it's waiting on.

If `system.{descriptor,zones}` is static, we would previously discard
all checkpoint events. This is not what we want -- we'd still like to
know how caught up we are, with ongoing updates or otherwise.

MVCC GC, after all, can still happen; if our last checkpoint was when
the last update occurred, we may end up doing a lot of unnecessary work
when finding out our last checkpoint was GC-ed from underneath us. Lack
of periodic checkpoints also makes tests difficult to write -- you'd
have to induce a benign update to flush out all earlier ones. This
latent flakiness was uncovered after speeding up some existing tests in
an earlier commit.

NB: For incremental (possibly noop) checkpoints, we need to ensure that
the sqlwatcher's buffer is initialized with a low watermark. When
flushing events, we take the most conservative timestamp. If not
initialized to a high water, this might be 0 -- violating the
sqlwatcher's monotonically increasing checkpoint ts invariant.

Release note: None
@irfansharif irfansharif force-pushed the 211125.collapse-watcher-factory branch from f8ae233 to d00baeb Compare November 30, 2021 12:52
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Nathan! Merging on green.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, and @nvanbenschoten)


pkg/spanconfig/spanconfigsqlwatcher/sqlwatcher.go, line 165 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Would it be straightforward to write a test for this?

It would be, done.


pkg/spanconfig/spanconfigsqlwatcher/sqlwatcher_test.go, line 371 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why was this needed?

We were busy looping before. The wait isn't "needed" for anything, but I don't think busy looping is what we wanted here either.

@irfansharif
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Nov 30, 2021

Build succeeded:

@craig craig bot merged commit 3d2e539 into cockroachdb:master Nov 30, 2021
@irfansharif irfansharif deleted the 211125.collapse-watcher-factory branch November 30, 2021 15:33
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Dec 3, 2021
We introduce two testing knobs that we'll use in future commits:

- SQLWatcherCheckpointNoopsEveryDurationOverride, if set, will override
  how often the SQLWatcher checkpoints no-ops (follow up to cockroachdb#73171)

- ExcludeDroppedDescriptorsFromLookup will control whether the
  SQLTranslator ignores dropped descriptors. We'll also use this knob
  for the Reconciler (cockroachdb#71994), for the same purpose. Ignoring dropped
  descriptors in tests is a convenient (and faster!) alternative to
  waiting for the descriptor to actually get GC-ed by the GC job. It
  also makes it easier to order events transactionally -- it's
  sufficient for a checkpoint to cross the timestamp when the table is
  dropped, instead of waiting for the GC job to pick it up post TTL.
  This latter form is used in our current sqltranslator tests where we
  sleep arbitrarily:

    # Sleep for 5 seconds, which is more than the TTL on db.t1, so that
    # the gc job can delete the descriptor.
    sleep duration=5

  This is unfortunate and makes for slower tests than necessary. We can
  get rid of it in a subsequent commit.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Dec 3, 2021
We introduce two testing knobs that we'll use in future commits:

- SQLWatcherCheckpointNoopsEveryDurationOverride, if set, will override
  how often the SQLWatcher checkpoints no-ops (follow up to cockroachdb#73171)

- ExcludeDroppedDescriptorsFromLookup will control whether the
  SQLTranslator ignores dropped descriptors. We'll also use this knob
  for the Reconciler (cockroachdb#71994), for the same purpose. Ignoring dropped
  descriptors in tests is a convenient (and faster!) alternative to
  waiting for the descriptor to actually get GC-ed by the GC job. It
  also makes it easier to order events transactionally -- it's
  sufficient for a checkpoint to cross the timestamp when the table is
  dropped, instead of waiting for the GC job to pick it up post TTL.
  This latter form is used in our current sqltranslator tests where we
  sleep arbitrarily:

    # Sleep for 5 seconds, which is more than the TTL on db.t1, so that
    # the gc job can delete the descriptor.
    sleep duration=5

  This is unfortunate and makes for slower tests than necessary. We can
  get rid of it in a subsequent commit.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Dec 6, 2021
We introduce two testing knobs that we'll use in future commits:

- SQLWatcherCheckpointNoopsEveryDurationOverride, if set, will override
  how often the SQLWatcher checkpoints no-ops (follow up to cockroachdb#73171)

- ExcludeDroppedDescriptorsFromLookup will control whether the
  SQLTranslator ignores dropped descriptors. We'll also use this knob
  for the Reconciler (cockroachdb#71994), for the same purpose. Ignoring dropped
  descriptors in tests is a convenient (and faster!) alternative to
  waiting for the descriptor to actually get GC-ed by the GC job. It
  also makes it easier to order events transactionally -- it's
  sufficient for a checkpoint to cross the timestamp when the table is
  dropped, instead of waiting for the GC job to pick it up post TTL.
  This latter form is used in our current sqltranslator tests where we
  sleep arbitrarily:

    # Sleep for 5 seconds, which is more than the TTL on db.t1, so that
    # the gc job can delete the descriptor.
    sleep duration=5

  This is unfortunate and makes for slower tests than necessary. We can
  get rid of it in a subsequent commit.

Release note: None
craig bot pushed a commit that referenced this pull request Dec 6, 2021
73416: spanconfig/{sqltranslator,testcluster}: support multi-tenant tests r=irfansharif a=irfansharif

**spanconfig/{sqltranslator,testcluster}: support multi-tenant tests**

We introduce a nimble testing framework for multi-tenant span config
tests, and use it to power tests for spanconfig.SQLTranslator when run
on behalf of secondary tenants. This framework will see further addition
when testing spanconfig.Reconciler -- we'll maintain a handle on each
tenant's reconciler, kvaccessor, and use both to determine what
span config mutations are made after SQL changes.

This commit also leverages the ExcludeDroppedDescriptorsFromLookup
testing knob we introduced in an earlier commit -- it obviates the need
for arbitrary sleeps (which are both fragile and slow). It also packages
a PrintSpanConfigDiffedAgainstDefaults helper that we'll use in a future
PR (#71994).


**sqltranslatorccl: carve standalone package for tests**

Rename-only commit. We'll use subpackages within ccl/spanconfigccl when
testing other components (spanconfig.Reconciler; [#71994](#71994)).


**spanconfig/{watcher,translator}: add testing knobs**

We introduce two testing knobs that we'll use in future commits:

- SQLWatcherCheckpointNoopsEveryDurationOverride, if set, will override
  how often the SQLWatcher checkpoints no-ops (follow up to [#73171](#73171))

- ExcludeDroppedDescriptorsFromLookup will control whether the
  SQLTranslator ignores dropped descriptors. We'll also use this knob
  for the Reconciler ([#71994](#71994)), for the same purpose. Ignoring dropped
  descriptors in tests is a convenient (and faster!) alternative to
  waiting for the descriptor to actually get GC-ed by the GC job. It
  also makes it easier to order events transactionally -- it's
  sufficient for a checkpoint to cross the timestamp when the table is
  dropped, instead of waiting for the GC job to pick it up post TTL.
  This latter form is used in our current sqltranslator tests where we
  sleep arbitrarily:

      # Sleep for 5 seconds, which is more than the TTL on db.t1, so that
      # the gc job can delete the descriptor.
      sleep duration=5

  This is unfortunate and makes for slower tests than necessary. We can
  get rid of it in a subsequent commit.

Release note: None



Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants