Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spanconfig: introduce spanconfig.Reconciler #71994

Merged
merged 3 commits into from
Dec 15, 2021

Conversation

irfansharif
Copy link
Contributor

@irfansharif irfansharif commented Oct 26, 2021

spanconfig: introduce spanconfig.Reconciler

Reconciler is responsible for reconciling a tenant's zone configs (SQL
construct) with the cluster's span configs (KV construct). It's the
central engine for the span configs infrastructure; a single Reconciler
instance is active for every tenant in the system.

type Reconciler interface {
  // Reconcile starts the incremental reconciliation process from
  // the given checkpoint. If it does not find MVCC history going
  // far back enough[1], it falls back to a scan of all
  // descriptors and zone configs before being able to do more
  // incremental work. The provided callback is invoked with
  // timestamps that can be safely checkpointed. A future
  // Reconciliation attempt can make use of this timestamp to
  // reduce the amount of necessary work (provided the MVCC
  // history is still available).
  //
  // [1]: It's possible for system.{zones,descriptor} to have been
  //      GC-ed away; think suspended tenants.
  Reconcile(
    ctx context.Context,
    checkpoint hlc.Timestamp,
    callback func(checkpoint hlc.Timestamp) error,
  ) error
}

Let's walk through what it does. At a high-level, we maintain an
in-memory data structure that's up-to-date with the contents of the KV
(at least the subset of spans we have access to, i.e. the keyspace
carved out for our tenant ID). We watch for changes to SQL state
(descriptors, zone configs), translate the SQL updates to the flattened
span+config form, "diff" the updates against our data structure to see
if there are any changes we need to inform KV of. If so, we do, and
ensure that our data structure is kept up-to-date. We continue watching
for future updates and repeat as necessary.

There's only single instance of the Reconciler running for a given
tenant at a given point it time (mutual exclusion/leasing is provided by
the jobs subsystem). We needn't worry about contending writers, or the
KV state being changed from underneath us. What we do have to worry
about, however, is suspended tenants' not being reconciling while
suspended. It's possible for a suspended tenant's SQL state to be GC-ed
away at older MVCC timestamps; when watching for changes, we could fail
to observe tables/indexes/partitions getting deleted. Left as is, this
would result in us never issuing a corresponding deletion requests for
the dropped span configs -- we'd be leaving orphaned span configs lying
around (taking up storage space and creating pointless empty ranges). A
"full reconciliation pass" is our attempt to find all these extraneous
entries in KV and to delete them.

We can use our span config data structure here too, one that's
pre-populated with the contents of KV. We translate the entire SQL state
into constituent spans and configs, diff against our data structure to
generate KV updates that we then apply. We follow this with clearing out
all these spans in our data structure, leaving behind all extraneous
entries to be found in KV -- entries we can then simply issue deletes
for.

----

server,kvaccessor: record span configs during tenant creation/gc

For newly created tenants, we want to ensure hard splits on tenant
boundaries. The source of truth for split points in the span configs
subsystem is the contents of system.span_configurations. To ensure
hard splits, we insert a single key record at the tenant prefix. In a
future commit we'll introduce the spanconfig.Reconciler process, which
runs per tenant and governs all config entries under each tenant's
purview. This has implications for this initial record we're talking
about (this record might get cleaned up for e.g.); we'll explore it in
tests for the Reconciler itself.

Creating a single key record is easy enough -- we could've written
directly to system.span_configurations. When a tenant is GC-ed
however, we need to clear out all tenant state transactionally. To that
end we plumb in a txn-scoped KVAccessor into the planner where
crdb_internal.destroy_tenant is executed. This lets us easily delete
all abandoned tenant span config records.

Note: We get rid of spanconfig.experimental_kvaccessor.enabled. Access
to spanconfigs infrastructure is already sufficiently gated through
COCKROACH_EXPERIMENTAL_SPAN_CONFIGS. Now that
crdb_internal.create_tenant attempts to write through the KVAccessor,
it's cumbersome to have to enable the setting manually in every
multi-tenant test (increasingly the default) enabling some part of the
span configs infrastructure.

This commit introduces the need for a migration -- for existing clusters
with secondary tenants, when upgrading we need to install this initial
record at the tenant prefix for all extant tenants (and make sure to
continue doing so for subsequent newly created tenants). This is to
preserve the hard-split-on-tenant-boundary invariant we wish to provide.
It's possible for an upgraded multi-tenant cluster to have dormant sql
pods that have never reconciled. If KV switches over to the span configs
subsystem, splitting only on the contents of
system.span_configurations, we'll fail to split on all tenant
boundaries. We'll introduce this migration in a future PR (before
enabling span configs by default).

Release note: None


Only the last two commits here are of interest (first one is from #73531).

@cockroach-teamcity
Copy link
Member

This change is Reviewable

irfansharif added a commit to irfansharif/cockroach that referenced this pull request Nov 25, 2021
We first introduced spanconfig.StoreWriter in cockroachdb#70287. Here we extend the
interface to accept a batch of updates instead of just one.

    type StoreWriter interface {
        Apply(ctx context.Context, dryrun bool, updates ...Update) (
	    deleted []roachpb.Span, added []roachpb.SpanConfigEntry,
        )
    }

The implementation is subtle -- we're not processing one update at a
time. The semantics we're after is applying a batch of updates
atomically on a consistent snapshot of the underlying store. This comes
up in the upcoming spanconfig.Reconciler (cockroachdb#71994) -- there, following a
zone config/descriptor change, we want to update KV state in a single
request/RTT instead of an RTT per descendent table. The intended
usage is better captured in the aforementioned PR; let's now talk
about what this entails for the datastructure.

To apply a single update, we want to find all overlapping spans and
clear out just the intersections. If the update is adding a new span
config, we'll also want to add the corresponding store entry after. We
do this by deleting all overlapping spans in their entirety and
re-adding the non-overlapping segments. Pseudo-code:

  for entry in store.overlapping(update.span):
      union, intersection = union(update.span, entry), intersection(update.span, entry)
      pre  = span{union.start_key, intersection.start_key}
      post = span{intersection.end_key, union.end_key}

      delete {span=entry.span, conf=entry.conf}
      if entry.contains(update.span.start_key):
          # First entry overlapping with update.
          add {span=pre, conf=entry.conf} if non-empty
      if entry.contains(update.span.end_key):
          # Last entry overlapping with update.
          add {span=post, conf=entry.conf} if non-empty

  if adding:
      add {span=update.span, conf=update.conf} # add ourselves

When extending to a set of updates, things are more involved. Let's
assume that the updates are non-overlapping and sorted by start key. As
before, we want to delete overlapping entries in their entirety and
re-add the non-overlapping segments. With multiple updates, it's
possible that a segment being re-added will overlap another update.  If
processing one update at a time in sorted order, we want to only re-add
the gap between the consecutive updates.

    keyspace         a  b  c  d  e  f  g  h  i  j
    existing state      [--------X--------)
    updates          [--A--)           [--B--)

When processing [a,c):A, after deleting [b,h):X, it would be incorrect
to re-add [c,h):X since we're also looking to apply [g,i):B. Instead of
re-adding the trailing segment right away, we carry it forward and
process it when iterating over the second, possibly overlapping update.
In our example, when iterating over [g,i):B we can subtract the overlap
from [c,h):X and only re-add [c,g):X.

It's also possible for the segment to extend past the second update. In
the example below, when processing [d,f):B and having [b,h):X carried
over, we want to re-add [c,d):X and carry forward [f,h):X to the update
after (i.e. [g,i):C)).

    keyspace         a  b  c  d  e  f  g  h  i  j
    existing state      [--------X--------)
    updates          [--A--)  [--B--)  [--C--)

One final note: we're iterating through the updates without actually
applying any mutations. Going back to our first example, when processing
[g,i):B, retrieving the set of overlapping spans would (again) retrieve
[b,h):X -- an entry we've already encountered when processing [a,c):A.
Re-adding non-overlapping segments naively would re-add [b,g):X -- an
entry that overlaps with our last update [a,c):A. When retrieving
overlapping entries, we need to exclude any that overlap with the
segment that was carried over. Pseudo-code:

  carry-over = <empty>
  for update in updates:
      carried-over, carry-over = carry-over, <empty>
      if update.overlap(carried-over):
          # Fill in the gap between consecutive updates.
          add {span=span{carried-over.start_key, update.start_key}, conf=carried-over.conf}
          # Consider the trailing span after update; carry it forward if non-empty.
          carry-over = {span=span{update.end_key, carried-over.end_key}, conf=carried-over.conf}
      else:
          add {span=carried-over.span, conf=carried-over.conf} if non-empty

      for entry in store.overlapping(update.span):
         if entry.overlap(processed):
              continue # already processed

          union, intersection = union(update.span, entry), intersection(update.span, entry)
          pre  = span{union.start_key, intersection.start_key}
          post = span{intersection.end_key, union.end_key}

          delete {span=entry.span, conf=entry.conf}
          if entry.contains(update.span.start_key):
              # First entry overlapping with update.
              add {span=pre, conf=entry.conf} if non-empty
          if entry.contains(update.span.end_key):
              # Last entry overlapping with update.
              carry-over = {span=post, conf=entry.conf}

       if adding:
          add {span=update.span, conf=update.conf} # add ourselves

  add {span=carry-over.span, conf=carry-over.conf} if non-empty

We've extended the randomized testing suite to generate batches of
updates at a time. We've also added a few illustrated datadriven tests.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Nov 25, 2021
The only mutable state across concurrent WatchForSQLUpdate calls was the
internal buffer, which does not need to hang off the surrounding struct.
This obviates the need for the factory pattern we were using -- callers
can set up multiple SQLWatchers concurrently as is (see
TestSQLWatcherMultiple).

This commit simply hides the factory under the package boundary; in a
future commit we'll shed it altogether. This has the benefit of reducing
the number of symbols in pkg/spanconfig and making it symmetric with the
other spanconfig dependencies typically found together (KVAccessor,
SQLTranslator). It's also every-so-slightly less verbose to use in the
upcoming spanconfig.Reconciler (cockroachdb#71994).

Release note: None
@irfansharif irfansharif force-pushed the 211026.sql-reconciler branch 16 times, most recently from 0eda1a2 to e5e0da5 Compare November 29, 2021 20:49
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 9 of 35 files at r22, 27 of 27 files at r23, 8 of 8 files at r24, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, and @nvanbenschoten)


pkg/migration/migrations/seed_tenant_span_configs.go, line 82 at r23 (raw file):

				// This tenant already has span config entries. It was either
				// already migrated (migrations need to be idempotent) or it was
				// created after PreSeedTenantSpanConfigs was activated. THere's

s/THere's/There's/


pkg/spanconfig/spanconfig.go, line 48 at r20 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Leave a note about what happens if someone tries to chains calls to this like kvaccessor.WithTxn(txn1).WithTxn(txn2).

Is this worth supporting? Could doing so mask very subtle bugs? Consider a log.Fatal in this case instead.


pkg/spanconfig/spanconfigreconciler/reconciler.go, line 85 at r17 (raw file):

Previously, irfansharif (irfan sharif) wrote…

It's waiting for a lease to expire somewhere, right?

No, it's transactionally checking against the job table to make sure that no job exists, and then (and only then) transactionally creating the new job. I'm reading the code snippet linked above and the library method linked below, though I may be misunderstanding something. Does our use of txns against the jobs table not give us the true mutual exclusion we're wanting?

func RunningJobExists(

That depends on what removes jobs from the table. If a node begins running a job and then crashes or otherwise becomes unresponsive, how do we recover and ensure that someone else can adopt the job? I assume there's a time-based lease at some level here or none of this would be fault-tolerant.

Maybe that's what the sqlliveness table stores. I see reference to a crdb_internal.sql_liveness_is_alive builtin function, which is checking a lease expiration. server.sqlliveness.ttl and server.sqlliveness.heartbeat imply a 40 second lease that's extended every 5 seconds.

We don't need to bottom out on this in this PR, but every new job that assumes mutual exclusion should be having these discussions at some point.


pkg/spanconfig/spanconfigtestutils/recorder.go, line 76 at r17 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Done (in a stand-alone commit). Called this constructor "Deletion" instead, so it reads store.Apply(..., spanconfig.Deletion(sp)).

Thanks, I think this turned out well. It's much easier to read now.


pkg/sql/tenant.go, line 127 at r20 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Good catch, split the migration across two versions. The first will start allowing newly created tenants to install fresh span configs, the second will install configs for tenants that don't have any. Added appropriate tests, etc.

Is that sufficient to avoid the race? If these are minor versions, do we have a guarantee that the fast node that runs both migrations isn't two minor versions ahead of the old node?

What's the hazard with removing this version check entirely? Is it that the spanconfig table might not exist? Is so, maybe we should be using transactions to avoid the race. Could we transactionally check whether the table exists yet, and perform the insert if it does? And if it doesn't, we know that the migration to seed the table hasn't run yet, so we rely on that.


pkg/sql/tenant.go, line 128 at r23 (raw file):

	if !execCfg.Settings.Version.IsActive(ctx, clusterversion.PreSeedTenantSpanConfigs) {
		return err

Should this be return nil?


pkg/sql/tenant.go, line 440 at r23 (raw file):

		scKVAccessor := execCfg.SpanConfigKVAccessor.WithTxn(txn)
		entries, err := scKVAccessor.GetSpanConfigEntriesFor(ctx, []roachpb.Span{tenantSpan})

What will happen here if the spanconfigs table hasn't been created yet?

Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL -- should've addressed everything.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, and @nvanbenschoten)


pkg/spanconfig/spanconfig.go, line 48 at r20 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is this worth supporting? Could doing so mask very subtle bugs? Consider a log.Fatal in this case instead.

Done.


pkg/spanconfig/spanconfigreconciler/reconciler.go, line 85 at r17 (raw file):

I assume there's a time-based lease at some level here or none of this would be fault-tolerant.

// Periodically check if the span config reconciliation job exists and start
// it if it doesn't.
timer := timeutil.NewTimer()
defer timer.Stop()
triggerJobCheck()
for {
timer.Reset(checkReconciliationJobInterval.Get(&m.settings.SV))

It's simpler, the span config manager (runs for every sql pod) simply polls the jobs table periodically to make sure that there's at least one job that's running, and if not, creating it. Perhaps @arulajmani or @ajwerner can correct me if I'm wrong, but if we're holding pkg/jobs right in the spanconfig manager, and creating jobs in the way described above, I believe we already have guaranteed mutual exclusion. If not, I'd love to learn how -- it's a guarantee we're relying on in the Reconciler.


pkg/spanconfig/spanconfigtestutils/recorder.go, line 76 at r17 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Thanks, I think this turned out well. It's much easier to read now.

Thanks for the suggestion! I agree.


pkg/sql/tenant.go, line 127 at r20 (raw file):

Is that sufficient to avoid the race? If these are minor versions, do we have a guarantee that the fast node that runs both migrations isn't two minor versions ahead of the old node?

I believe it's sufficient. The migrations infrastructure provides two guarantees:

  • All cluster versions, and their associated migrations, are stepped through in order
  • Before bumping the cluster version X, all nodes will have already observed cluster version bump X-1. This includes newly added nodes, etc. See here.

These guarantees let you string together arbitrarily complex migrations.

Is it that the spanconfig table might not exist?

Given the SpanConfigurationsTablemigration precedes both these migrations, the table is guaranteed to exist.

What's the hazard with removing this version check entirely?

I don't think there's much of a hazard -- the system table already exists (was baked into 21.2). That said, I think it's also fine to gate new functionality (writing of these records) with an explicit cluster version instead of allowing it to happen unconditionally if create_tenant is executed against a 22.1 node in a mixed-version cluster. It makes it slightly easier to reason about.


pkg/sql/tenant.go, line 128 at r23 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should this be return nil?

Done.


pkg/sql/tenant.go, line 440 at r23 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What will happen here if the spanconfigs table hasn't been created yet?

That's not possible, see comment above about the guarantees provided by the migrations infrastructure.

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 11 of 41 files at r14, 69 of 69 files at r16, 34 of 69 files at r17, 13 of 58 files at r18, 35 of 35 files at r22, 1 of 8 files at r24.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, and @nvanbenschoten)


pkg/spanconfig/spanconfig.go, line 56 at r17 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Done, went with the unified interface. It does however add a wart -- this interface doesn't really apply to the Connector (just panics in the implementation). I'm happy to change it out to the usual factory pattern if you'd like.

Mild preference to switching this to using the usual factory pattern. Like you mentioned, there's the wart with the Connector and it'll also sidestep Nathan's concern elsewhere about chaining WithTxns.


pkg/spanconfig/spanconfigreconciler/reconciler.go, line 121 at r24 (raw file):

// checkpoint. For changes to, say, RANGE DEFAULT, the RPC request proto is
// proportional to the number of schema objects.
func (r *Reconciler) Reconcile(

Did you consider having fullReconciler and incrementalReconciler implement the Reconciler interface and have the contents of the job do what this function does instead?


pkg/spanconfig/spanconfigreconciler/reconciler.go, line 360 at r24 (raw file):

			}

			toDelete, toUpsert := r.storeWithKVContents.Apply(ctx, false /* dryrun */, updates...)

This means that the incrementalReconciler is single use, right? For example, you could've applied updates to the store but the RPC below could fail. I don't think this is a problem with the way things are structured currently, but maybe consider making this more explicit by either passing in the store to this function or making a factory for this struct.

@arulajmani
Copy link
Collaborator

@irfansharif I'm not sure why, but reviewable isn't being great to me on this PR. Even for this second batch of comments looks like the line numbers are getting lost in the UI but you can see them in the comment. I don't have that many, but sorry for the trouble here.

@irfansharif irfansharif force-pushed the 211026.sql-reconciler branch 2 times, most recently from 68aee41 to 59d2b43 Compare December 13, 2021 21:53
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

reviewable isn't being great to me on this PR [..] sorry for the trouble here

Nothing to be sorry about! Thanks for the review. Perhaps the native github reviews work better?

Mild preference to switching this to using the usual factory pattern.

I also like the factory pattern better, is it alright to do it in a follow on? I even had it in an earlier revision, it's a very small change. I keep catching painful (only because of slow UI build times, so not actually that painful) skews because of the new cluster versions and I'd like to start enabling this stuff on master before everyone's gone for the holidays.

Did you consider having fullReconciler and incrementalReconciler implement the Reconciler interface and have the contents of the job do what this function does instead?

Not really, I think I prefer with the current breakdown, especially with it being as decomposed as it is into the various stages. Part of the reason is that we're using the full reconciliation process to also cheaply construct a spanconfig.Store with kvaccessor contents. I also like keeping the job as dumb as it is, focused only on retries, and keep more business logic here where it's easier to test/inject knobs/etc. See below.

This means that the incrementalReconciler is single use, right? For example, you could've applied updates to the store but the RPC below could fail.

Even the fullReconciler is. This is what originally motivated the dryrun option, thinking I'd do diffs := store.apply(..., dryrun=true); issue-rpcs(diffs); store.apply(..., dryrun=false), but felt unnecessary at the end. In the happy case things will succeed. In the unhappy case, it's more than fine to retry at a higher level (span config job/manager), and preferable IMO because it keeps the retry/recovery logic out of the reconciliation code and at the level where it's going to exist in some form anyway (if the timestamp's being watched are already gc-ed for e.g., the jobs need to to start reconciliation all over).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, and @nvanbenschoten)

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 11 files at r25, 8 of 8 files at r26, 6 of 42 files at r27, 30 of 33 files at r28, 8 of 8 files at r29, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, and @nvanbenschoten)


pkg/spanconfig/spanconfigkvaccessor/kvaccessor.go, line 66 at r29 (raw file):

func (k *KVAccessor) WithTxn(ctx context.Context, txn *kv.Txn) spanconfig.KVAccessor {
	if k.optionalTxn != nil {
		log.Fatalf(ctx, "KVAccessor already scoped to txn (was .WithTxn(...) chained multiple times?")

nit: missing closing paren.


pkg/spanconfig/spanconfigreconciler/reconciler.go, line 85 at r17 (raw file):

Previously, irfansharif (irfan sharif) wrote…

I assume there's a time-based lease at some level here or none of this would be fault-tolerant.

// Periodically check if the span config reconciliation job exists and start
// it if it doesn't.
timer := timeutil.NewTimer()
defer timer.Stop()
triggerJobCheck()
for {
timer.Reset(checkReconciliationJobInterval.Get(&m.settings.SV))

It's simpler, the span config manager (runs for every sql pod) simply polls the jobs table periodically to make sure that there's at least one job that's running, and if not, creating it. Perhaps @arulajmani or @ajwerner can correct me if I'm wrong, but if we're holding pkg/jobs right in the spanconfig manager, and creating jobs in the way described above, I believe we already have guaranteed mutual exclusion. If not, I'd love to learn how -- it's a guarantee we're relying on in the Reconciler.

I think it might be helpful to ask the inverse question. What happens if the node running the job fails? Someone else must take over, right? Similarly, what happens if the node running the job hits a minute-long GC pause?


pkg/sql/tenant.go, line 127 at r20 (raw file):

Before bumping the cluster version X, all nodes will have already observed cluster version bump X-1. This includes newly added nodes, etc. See here.

Got it, thanks for the link. I was under the impression that this was only true of major version numbers, but that must be an outdated understanding.


pkg/sql/tenant.go, line 440 at r23 (raw file):

Previously, irfansharif (irfan sharif) wrote…

That's not possible, see comment above about the guarantees provided by the migrations infrastructure.

But we don't have a version gate in this code path. I must be missing something.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, and @nvanbenschoten)


pkg/spanconfig/spanconfigreconciler/reconciler.go, line 85 at r17 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think it might be helpful to ask the inverse question. What happens if the node running the job fails? Someone else must take over, right? Similarly, what happens if the node running the job hits a minute-long GC pause?

The only true claim of mutual exclusion we have is when we update the job checkpoint, and even then, only if we properly use the job object passed into the resumer. It does seem possible that under the right sort of scenario, the current job may fail or the current pod may lose its lease and then a new pod may start executing. Currently none of the job leasing code considers synchrony assumptions.

Here are the safeguards:

  • When a job instance goes to update its status, it ensures that the current lease is the lease under which it was resumed.
  • When an underlying sqlliveness session is observed to be expired by another instance, that session's record will be deleted.
  • Job registry objects will periodically revoke leases from jobs which are leased by sessions which are known to not be live.

The above mechanisms provide mutual exclusion for updates to the records itself.

Now, there's also the fact that sql pods will kill themselves if they think that they are expired. Sadly, that doesn't actually provide us with any real protection. Firstly, the process may not detect it is expired until another node has already removed it; session expiration occurs in the MVCC time domain and not the clock time domain. Secondly, I don't think we're that principled or detecting the expiration promptly while making an outstanding request to extend the session. Lastly, the process of receiving from a timer and acting on it is not instantaneous.

So, if we want to make deep claims about mutual exclusion we'll need something deeper than what we've got.


pkg/spanconfig/spanconfigreconciler/reconciler.go, line 87 at r17 (raw file):

Quoted 4 lines of code…
// [1]: #73399 proposes a new KV request type that would let us more easily
//      ascertain when a tenant's sql statements have been reconciled in KV,
//      which we could then use to only suspend pods after all outstanding work
//      ("reconciliation") has been done.

I may be misreading your note. The intention of the issue was not really to more easily ascertain when a tenant's sql statements had been reconciled, but rather to more rapidly trigger reconciliation. The reconciliation state itself will be represented by a single row in KV, so the caller can easily either watch that row or poll it to wait for reconciliation. The more pressing desire is to make reconciliation happen promptly after a write which causes changes. If we could do that, then it'd be reasonable to wait for reconciliation before returning to the client.

Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, and @nvanbenschoten)


pkg/spanconfig/spanconfigreconciler/reconciler.go, line 85 at r17 (raw file):

Previously, ajwerner wrote…

The only true claim of mutual exclusion we have is when we update the job checkpoint, and even then, only if we properly use the job object passed into the resumer. It does seem possible that under the right sort of scenario, the current job may fail or the current pod may lose its lease and then a new pod may start executing. Currently none of the job leasing code considers synchrony assumptions.

Here are the safeguards:

  • When a job instance goes to update its status, it ensures that the current lease is the lease under which it was resumed.
  • When an underlying sqlliveness session is observed to be expired by another instance, that session's record will be deleted.
  • Job registry objects will periodically revoke leases from jobs which are leased by sessions which are known to not be live.

The above mechanisms provide mutual exclusion for updates to the records itself.

Now, there's also the fact that sql pods will kill themselves if they think that they are expired. Sadly, that doesn't actually provide us with any real protection. Firstly, the process may not detect it is expired until another node has already removed it; session expiration occurs in the MVCC time domain and not the clock time domain. Secondly, I don't think we're that principled or detecting the expiration promptly while making an outstanding request to extend the session. Lastly, the process of receiving from a timer and acting on it is not instantaneous.

So, if we want to make deep claims about mutual exclusion we'll need something deeper than what we've got.

Oh! I was under the mistaken assumption that job resumption entailed touching the job record itself transactionally, and so did the lease expiration (don't ask me how, I see it doesn't make sense). Filed #73789 to make sure it's addressed pre-22.1 (+cc @arulajmani).


pkg/spanconfig/spanconfigreconciler/reconciler.go, line 87 at r17 (raw file):

Previously, ajwerner wrote…
// [1]: #73399 proposes a new KV request type that would let us more easily
//      ascertain when a tenant's sql statements have been reconciled in KV,
//      which we could then use to only suspend pods after all outstanding work
//      ("reconciliation") has been done.

I may be misreading your note. The intention of the issue was not really to more easily ascertain when a tenant's sql statements had been reconciled, but rather to more rapidly trigger reconciliation. The reconciliation state itself will be represented by a single row in KV, so the caller can easily either watch that row or poll it to wait for reconciliation. The more pressing desire is to make reconciliation happen promptly after a write which causes changes. If we could do that, then it'd be reasonable to wait for reconciliation before returning to the client.

Got it, updated. Elsewhere @nvanbenschoten mentioned possibly wanting to make pod suspension a more explicit protocol in the database. A request like proposed in #73399 could help us there too: we could block all new sql txns temporarily right as we're about to suspend, ensure we've reconciled, thus suspending only pods with no outstanding work.


pkg/sql/tenant.go, line 127 at r20 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Before bumping the cluster version X, all nodes will have already observed cluster version bump X-1. This includes newly added nodes, etc. See here.

Got it, thanks for the link. I was under the impression that this was only true of major version numbers, but that must be an outdated understanding.

Yea, it's something we codified with the long-running migrations project.


pkg/sql/tenant.go, line 440 at r23 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

But we don't have a version gate in this code path. I must be missing something.

Doh, didn't notice that was for the GCTenantRecord code path -- added.

@irfansharif
Copy link
Contributor Author

(Bump.)

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 30 of 33 files at r31, 8 of 8 files at r32, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)


pkg/spanconfig/spanconfigreconciler/reconciler.go, line 85 at r17 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Oh! I was under the mistaken assumption that job resumption entailed touching the job record itself transactionally, and so did the lease expiration (don't ask me how, I see it doesn't make sense). Filed #73789 to make sure it's addressed pre-22.1 (+cc @arulajmani).

Thanks for filing. Let's continue the discussion over in #73789 so it doesn't hold up this PR any longer.

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also like the factory pattern better, is it alright to do it in a follow on? I even had it in an earlier revision, it's a very small change. I keep catching painful (only because of slow UI build times, so not actually that painful) skews because of the new cluster versions and I'd like to start enabling this stuff on master before everyone's gone for the holidays.

Sounds good, fine by me!

Even the fullReconciler is. This is what originally motivated the dryrun option, thinking I'd do diffs := store.apply(..., dryrun=true); issue-rpcs(diffs); store.apply(..., dryrun=false), but felt unnecessary at the end. In the happy case things will succeed. In the unhappy case, it's more than fine to retry at a higher level (span config job/manager), and preferable IMO because it keeps the retry/recovery logic out of the reconciliation code and at the level where it's going to exist in some form anyway (if the timestamp's being watched are already gc-ed for e.g., the jobs need to to start reconciliation all over).

I'm fine with retrying at a higher level, but we should add a comment on the {incremental,full}Reconcilers that these things are meant to be single use. Better yet, maybe we can be defensive about this in code by either using a factory to construct these or checking on some state to ensure {incremental,full}Reconciler.reconcile is only ever called once.

For newly created tenants, we want to ensure hard splits on tenant
boundaries. The source of truth for split points in the span configs
subsystem is the contents of system.span_configurations. To ensure
hard splits, we insert a single key record at the tenant prefix. In a
future commit we'll introduce the spanconfig.Reconciler process, which
runs per tenant and governs all config entries under each tenant's
purview. This has implications for this initial record we're talking
about (this record might get cleaned up for e.g.); we'll explore it in
tests for the Reconciler itself.

Creating a single key record is easy enough -- we could've written
directly to system.span_configurations. When a tenant is GC-ed
however, we need to clear out all tenant state transactionally. To that
end we plumb in a txn-scoped KVAccessor into the planner where
crdb_internal.destroy_tenant is executed. This lets us easily delete
all abandoned tenant span config records.

Note: We get rid of spanconfig.experimental_kvaccessor.enabled. Access
to spanconfigs infrastructure is already sufficiently gated through
the env var. Now that crdb_internal.create_tenant attempts to write
through the KVAccessor, it's cumbersome to have to enable the setting
manually in every multi-tenant test (increasingly the default) enabling
some part of the span configs infrastructure.

---

This commit also needs a migration -- for existing clusters with
secondary tenants, when upgrading we need to install this initial record
at the tenant prefix for all extant tenants (and make sure to continue
doing so for subsequent newly created tenants). This is to preserve the
hard-split-on-tenant-boundary invariant we wish to provide. It's
possible for an upgraded multi-tenant cluster to have dormant sql pods
that have never reconciled. If KV switches over to the span configs
subsystem, splitting only on the contents of system.span_configurations,
we'll fail to split on all tenant boundaries. To this end we introduce
clusterversion.SeedTenantSpanConfigs, which allows us to seed span
config data for secondary tenants. The associated migration seeds
entries for existing tenants.

Release note: None
Reconciler is responsible for reconciling a tenant's zone configs (SQL
construct) with the cluster's span configs (KV construct). It's the
central engine for the span configs infrastructure; a single Reconciler
instance is active for every tenant in the system.

    type Reconciler interface {
      // Reconcile starts the incremental reconciliation process from
      // the given checkpoint. If it does not find MVCC history going
      // far back enough[1], it falls back to a scan of all
      // descriptors and zone configs before being able to do more
      // incremental work. The provided callback is invoked with
      // timestamps that can be safely checkpointed. A future
      // Reconciliation attempt can make use of this timestamp to
      // reduce the amount of necessary work (provided the MVCC
      // history is still available).
      //
      // [1]: It's possible for system.{zones,descriptor} to have been
      //      GC-ed away; think suspended tenants.
      Reconcile(
        ctx context.Context,
        checkpoint hlc.Timestamp,
        callback func(checkpoint hlc.Timestamp) error,
      ) error
    }

Let's walk through what it does. At a high-level, we maintain an
in-memory data structure that's up-to-date with the contents of the KV
(at least the subset of spans we have access to, i.e. the keyspace
carved out for our tenant ID). We watch for changes to SQL state
(descriptors, zone configs), translate the SQL updates to the flattened
span+config form, "diff" the updates against our data structure to see
if there are any changes we need to inform KV of. If so, we do, and
ensure that our data structure is kept up-to-date. We continue watching
for future updates and repeat as necessary.

There's only single instance of the Reconciler running for a given
tenant at a given point it time (mutual exclusion/leasing is provided by
the jobs subsystem). We needn't worry about contending writers, or the
KV state being changed from underneath us. What we do have to worry
about, however, is suspended tenants' not being reconciling while
suspended. It's possible for a suspended tenant's SQL state to be GC-ed
away at older MVCC timestamps; when watching for changes, we could fail
to observe tables/indexes/partitions getting deleted. Left as is, this
would result in us never issuing a corresponding deletion requests for
the dropped span configs -- we'd be leaving orphaned span configs lying
around (taking up storage space and creating pointless empty ranges). A
"full reconciliation pass" is our attempt to find all these extraneous
entries in KV and to delete them.

We can use our span config data structure here too, one that's
pre-populated with the contents of KV. We translate the entire SQL state
into constituent spans and configs, diff against our data structure to
generate KV updates that we then apply. We follow this with clearing out
all these spans in our data structure, leaving behind all extraneous
entries to be found in KV -- entries we can then simply issue deletes
for.

Release note: None
By using a type definition instead of a stand-alone type, we can reduce
the amount of code needed to convert from a SpanConfigEntry to an Update
(something we frequently do). It also helps make it less error prone.
While here, we introduce two helpful constructors for the two kinds of
updates we're typically interested in -- additions and deletions.

Release note: None
@irfansharif
Copy link
Contributor Author

but we should add a comment on the {incremental,full}Reconcilers that these things are meant to be single use.

Done. Thanks for all the reviews, this code was fun/exciting to write and iterate on -- looking forward to it flailing! I'm going to close #67679 and break out the remaining small items into individual quick-win issues.

bors r+

@craig
Copy link
Contributor

craig bot commented Dec 15, 2021

Build succeeded:

@craig craig bot merged commit 5e70321 into cockroachdb:master Dec 15, 2021
@irfansharif irfansharif deleted the 211026.sql-reconciler branch December 15, 2021 19:40
craig bot pushed a commit that referenced this pull request Dec 16, 2021
73746: spanconfig: visualize differences with gossiped SystemConfigSpan r=irfansharif a=irfansharif

**spanconfig: visualize differences with gossiped SystemConfigSpan**

We introduce a test-only package that compares the span configs
infrastructure against the gossiped SystemConfigSpan it was designed to
replace. The datadriven test spits out a visual "diff" of the two
subsystems, indicating exactly where they differ (or not!) in terms of
implied range boundaries and the configurations to apply. It does so by
having both systems run side-by-side and scanning their contents. These
range boundaries and configurations have second-order effects (various
actions are executed on behalf of ranges by internal queues), but we
don't bother with any of that in the test.

Of particular note is the behavior of each system in the presence of
secondary tenants that create arbitrary schemas and zone configs. This
after all is what motivated the new infrastructure. Here's an example,
as a diff, that shows how the system config span is unable to split
within an active tenant's boundary unlike the new thing:

```diff
configs version=current offset=37 limit=5
----
...
/Tenant/10                                 range default
/Tenant/11/Table/3                         range default
/Tenant/11/Table/4                         range default
/Tenant/11/Table/5                         range default
/Tenant/11/Table/6                         range default
...

configs version=legacy offset=43
----
...
/Tenant/10                                 range default
/Tenant/11                                 range default

diff offset=31 limit=8
----
--- gossiped system config span (legacy)
+++ span config infrastructure (current)
...
@@ -43,4 +37,37 @@
 /Table/47                                  range system
 /Tenant/10                                 range default
-/Tenant/11                                 range default
+/Tenant/11/Table/3                         range default
+/Tenant/11/Table/4                         range default
+/Tenant/11/Table/5                         range default
+/Tenant/11/Table/6                         range default
...
```
This commit also gets rid of the ShadowReader. We were using it
previously to manually diagnose differences between the two subsystems.
With the improved testing, we can get rid of the painter's tape.

---

**spanconfig: rm benign differences with SystemConfigSpan**

In the previous commit we demonstrated visually the differences between
the gossip-backed system config span and the new infrastructure. This
commit gets rid of the benign differences. We could also just not do
anything here, but it's only a minor change.

Release note: None

---

Only the last two commits are of interest here, the earlier ones are from #71994.



Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants