Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: Add migration for interleaved intents to separated #66445

Merged
merged 3 commits into from
Aug 13, 2021

Conversation

itsbilal
Copy link
Contributor

@itsbilal itsbilal commented Jun 14, 2021

Adds a long running migration, SeparatedIntentsMigration,
that does the following steps:

  1. Iterate through all range descriptors using a RangeIterator
  2. Calls a new ranged write command, Barrier, that is a no-op
    during evaluation but waits on any other writes on that range
    to finish before returning
  3. Call ScanInterleavedIntents, a new ranged read command, to
    return interleaved intents encountered over key range
  4. Iterate over returned intents then push all those txns and
    resolve any intents returned.

The barrier in step 2 ensures that no future writes after that
point will write interleaved intents. The actual disabling
of the setting to write interleaved intents happens in the
next commit.

Part of #41720.

Release note: None.

@itsbilal itsbilal requested a review from a team as a code owner June 14, 2021 19:23
@itsbilal itsbilal self-assigned this Jun 14, 2021
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@itsbilal
Copy link
Contributor Author

There are still some TODOs here, primarily 1) removing the separatedIntentsEnabled setting, and 2) rate limiting. I'm going to add on a second commit for 1) shortly.

But since this is a pretty involved change, I wanted to get this out early so I can get any opinions on alternative ways to approach this, or things I might have overlooked. Thanks!

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very excited for this!

Reviewed 24 of 24 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @sumeerbhola, and @tbg)


pkg/kv/kvserver/replica.go, line 1875 at r1 (raw file):

interlevaed

interleaved


pkg/kv/kvserver/replica.go, line 1934 at r1 (raw file):

		}

		txnIntents[meta.Txn.ID] = append(txnIntents[meta.Txn.ID], keyAndIntent{

I think we should limit the number of intents we pull into memory at once, to avoid OOMs.


pkg/kv/kvserver/replica.go, line 1950 at r1 (raw file):

		h := roachpb.Header{
			Timestamp:    r.Clock().Now(),
			UserPriority: roachpb.NormalUserPriority,

I think we want to push with MaxUserPriority so that we always win pushes (i.e. never block).


pkg/kv/kvserver/replica_send.go, line 137 at r1 (raw file):

		// the result.
		if mlt := br.Responses[0].GetMigrateLockTable(); mlt != nil {
			err := r.migrateLockTable(ctx, ba.Requests[0].GetMigrateLockTable(), mlt)

I'm curious why we're handling this request here using this engine snapshot registry. There are a few other options that seem more straightforward to me, such as doing the migration during request evaluation or in executeReadOnlyBatch, or collecting the intents in the command and returning them in the response (instead of holding onto the engine snapshot). We could even just do a high-priority scan to resolve all the intents, which I think would have the same effect, although perhaps less efficiently unless we're clever about WriteIntentError retries.


pkg/kv/kvserver/batcheval/cmd_migrate.go, line 201 at r1 (raw file):

// separatedIntentsMigration is the below-raft part of the migration for
// interleaved to separated intents. It is a no-op as the only purpose of
// running the Migrate command here is to clear out replicas with

...with what?


pkg/migration/migrations/separated_intents.go, line 73 at r1 (raw file):

			select {
			case r, ok := <-m.requests:
				if !ok && r.end == nil && r.start == nil {

Checking r.endand r.start isn't necessary here -- if !ok then r will always be a zero value.

@itsbilal itsbilal force-pushed the separated-intents-migration branch from 9dce6ed to 746e71b Compare June 16, 2021 18:15
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! still reading -- flushing some comments

Reviewed 8 of 24 files at r1, 6 of 8 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @itsbilal, @sumeerbhola, and @tbg)


pkg/kv/kvserver/replica.go, line 1934 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I think we should limit the number of intents we pull into memory at once, to avoid OOMs.

+1


pkg/kv/kvserver/replica.go, line 1950 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I think we want to push with MaxUserPriority so that we always win pushes (i.e. never block).

The downside is it could impact user-facing traffic. It may be better to debug a stuck migration versus having users complain about upgrading to 21.2 causing production issues. I would suggest getting some more opinions.


pkg/kv/kvserver/replica.go, line 1909 at r2 (raw file):

			return err
		}
		if !key.IsMVCCKey() {

should this ever be a non-MVCC key. That would mean we are somehow iterating over the local range key space of the lock table. May be worth returning an error instead.


pkg/kv/kvserver/replica.go, line 1920 at r2 (raw file):

			//
			// TODO(bilal): Explore seeking here in case there are keys with lots of
			// versioned values.

I think we need a benchmark for this migration.


pkg/kv/kvserver/replica_send.go, line 137 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I'm curious why we're handling this request here using this engine snapshot registry. There are a few other options that seem more straightforward to me, such as doing the migration during request evaluation or in executeReadOnlyBatch, or collecting the intents in the command and returning them in the response (instead of holding onto the engine snapshot). We could even just do a high-priority scan to resolve all the intents, which I think would have the same effect, although perhaps less efficiently unless we're clever about WriteIntentError retries.

I think this is being done here since this is the first place we pop out of executeBatchWithConcurrencyRetries which means for this read-only command we've released latches. So this potentially slow migration will not hinder any other operations like range splits, merges etc.


pkg/kv/kvserver/batcheval/cmd_migrate.go, line 58 at r2 (raw file):

	latchSpans, lockSpans *spanset.SpanSet,
) {
	latchSpans.AddNonMVCC(spanset.SpanReadOnly, roachpb.Span{Key: keys.RangeDescriptorKey(rs.GetStartKey())})

can you add a code comment on why this doesn't need to latch the span of the request. I am assuming that is because we don't care about concurrent writes since they could not be adding interleaved intents.


pkg/kv/kvserver/batcheval/cmd_migrate.go, line 69 at r2 (raw file):

// created as part of the MigrateLockTable command. A singleton instance
// of this type is stored as lockTableSnapshotRegistry.
type lockTableSnapshotRegistryType struct {

mild preference for lockTableMigrationSnapshotRegistryType since we aren't really using this snapshot to read the lock table key space.


pkg/kv/kvserver/batcheval/cmd_migrate.go, line 79 at r2 (raw file):

}

// GetAndRemoveLockTableSnapshot grabs o lock on the lock table snapshot

s/o/a/


pkg/kv/kvserver/batcheval/cmd_migrate.go, line 147 at r2 (raw file):

	ms := cArgs.Stats

	if ms != nil && ms.ContainsEstimates == 0 && ms.IntentCount == 0 {

We're willing to trust stats now?


pkg/migration/migrations/migrations.go, line 72 at r2 (raw file):

		"move over all intents to separate lock table",
		toCV(clusterversion.SeparatedIntentsMigration),
		separatedIntentsMigration),

how will we know when this is complete?


pkg/migration/migrations/separated_intents.go, line 42 at r2 (raw file):

// migrateLockTableRequest represents one request for a migrateLockTable
// command. This will correspond to one range at the time of running the
// IterateRangeDescriptors command. As long as

sentence is incomplete.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @itsbilal, @sumeerbhola, and @tbg)


pkg/migration/migrations/separated_intents.go, line 113 at r2 (raw file):

}

func separatedIntentsMigration(

who will run this function and which node will it be run on?


pkg/migration/migrations/separated_intents.go, line 158 at r2 (raw file):

		}
		numMigratedRanges += len(descriptors)
		log.Infof(ctx, "[batch %d/??] started lock table migrations for %d ranges", batchIdx, numMigratedRanges)

how would one figure out if migration was stuck on some ranges? Would one look in the cluster logs?

@itsbilal itsbilal force-pushed the separated-intents-migration branch from 746e71b to 790f1b8 Compare June 16, 2021 21:22
Copy link
Contributor Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs! Didn't reply to all the comments yet - but pushing out some of the larger changes so far while I get to the others.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @itsbilal, @sumeerbhola, and @tbg)


pkg/kv/kvserver/replica.go, line 1875 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…
interlevaed

interleaved

Done.


pkg/kv/kvserver/replica.go, line 1934 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I think we should limit the number of intents we pull into memory at once, to avoid OOMs.

Done. Added a 100-intent limit.


pkg/kv/kvserver/batcheval/cmd_migrate.go, line 201 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

...with what?

.. interleaved intents 😂 . Done.


pkg/kv/kvserver/batcheval/cmd_migrate.go, line 79 at r2 (raw file):

Previously, sumeerbhola wrote…

s/o/a/

Done.


pkg/migration/migrations/separated_intents.go, line 42 at r2 (raw file):

Previously, sumeerbhola wrote…

sentence is incomplete.

Done.


pkg/migration/migrations/separated_intents.go, line 113 at r2 (raw file):

Previously, sumeerbhola wrote…

who will run this function and which node will it be run on?

This function will run on one node in the cluster. If it doesn't finish successfully, it could be re-run on any other node on the cluster. The migration framework takes care of this.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @itsbilal, @sumeerbhola, and @tbg)


pkg/migration/migrations/separated_intents.go, line 113 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

This function will run on one node in the cluster. If it doesn't finish successfully, it could be re-run on any other node on the cluster. The migration framework takes care of this.

is there any checkpointing provided by the migration framework or it needs to start from the beginning?

I realize we currently have the IntentCount==0 fast path that may make restarting from the beginning cheap enough. But if we do have to read all the data in the cluster, it would be beneficial to not have to repeat work.

@itsbilal itsbilal force-pushed the separated-intents-migration branch from 790f1b8 to 92f1df0 Compare June 17, 2021 20:59
Copy link
Contributor Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @itsbilal, @sumeerbhola, and @tbg)


pkg/kv/kvserver/replica.go, line 1909 at r2 (raw file):

Previously, sumeerbhola wrote…

should this ever be a non-MVCC key. That would mean we are somehow iterating over the local range key space of the lock table. May be worth returning an error instead.

That'll always be local then, right? I guess it's worth returning an error here then. Done.


pkg/kv/kvserver/batcheval/cmd_migrate.go, line 58 at r2 (raw file):

Previously, sumeerbhola wrote…

can you add a code comment on why this doesn't need to latch the span of the request. I am assuming that is because we don't care about concurrent writes since they could not be adding interleaved intents.

Done.


pkg/kv/kvserver/batcheval/cmd_migrate.go, line 69 at r2 (raw file):

Previously, sumeerbhola wrote…

mild preference for lockTableMigrationSnapshotRegistryType since we aren't really using this snapshot to read the lock table key space.

Done.


pkg/kv/kvserver/batcheval/cmd_migrate.go, line 147 at r2 (raw file):

Previously, sumeerbhola wrote…

We're willing to trust stats now?

@tbg made this suggestion, wondering what his thoughts here would be. My understanding here was that intent count would be consistent.


pkg/migration/migrations/migrations.go, line 72 at r2 (raw file):

Previously, sumeerbhola wrote…

how will we know when this is complete?

This will end up in the migrations system table, as that tracks completion.


pkg/migration/migrations/separated_intents.go, line 73 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Checking r.endand r.start isn't necessary here -- if !ok then r will always be a zero value.

I thought !ok just means "no more values after this one", but on reading the golang spec, you're right. I don't need to check for the latter. Done.

https://golang.org/ref/spec#Receive_operator


pkg/migration/migrations/separated_intents.go, line 113 at r2 (raw file):

Previously, sumeerbhola wrote…

is there any checkpointing provided by the migration framework or it needs to start from the beginning?

I realize we currently have the IntentCount==0 fast path that may make restarting from the beginning cheap enough. But if we do have to read all the data in the cluster, it would be beneficial to not have to repeat work.

It'll need to start from the beginning. We could store progress somewhere ourselves, but that could be quite a lifting in itself. I'm leaning hard on the fast-paths letting us skip many ranges we've already migrated.


pkg/migration/migrations/separated_intents.go, line 158 at r2 (raw file):

Previously, sumeerbhola wrote…

how would one figure out if migration was stuck on some ranges? Would one look in the cluster logs?

I should probably add more verbose logging to help with this, because I can see that being pretty opaque right now. Do you think a log line for every range we start and end a migration on would be too verbose and useless? A better way would be to spin up a separate goroutine that tracks and regularly logs "waiting for migration to complete on X ranges" and if X is low enough, it prints out their range IDs / boundaries at time of seed.

@itsbilal itsbilal force-pushed the separated-intents-migration branch 2 times, most recently from 82eb5c1 to bd53d0e Compare June 18, 2021 18:03
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 8 files at r2, 11 of 11 files at r3, 24 of 24 files at r4, 4 of 4 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @sumeerbhola, and @tbg)


pkg/kv/kvserver/replica.go, line 1934 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Done. Added a 100-intent limit.

Thanks! Is it worth adding a byte limit as well, in case we encounter large keys? That would allow us to optimistically use a higher intent count limit as well, which should help with throughput in the typical case of small keys.


pkg/kv/kvserver/replica.go, line 1950 at r1 (raw file):

Previously, sumeerbhola wrote…

The downside is it could impact user-facing traffic. It may be better to debug a stuck migration versus having users complain about upgrading to 21.2 causing production issues. I would suggest getting some more opinions.

This is a good point, I don't know how aggressive we want to be here. If we are worried about impacting workloads, I would think we should use PUSH_TOUCH instead, which will never abort other txns. PUSH_ABORT with NormalUserPriority would have a 50% chance of aborting conflicting txns, I believe.


pkg/kv/kvserver/replica.go, line 1903 at r3 (raw file):

	for resumeKey.Key != nil {
		iter := snap.NewEngineIterator(storage.IterOptions{

Is it necessary/desirable to reopen the iterator for each batch? I would imagine it'd be more performant to keep it open, and the cost should be negligible (?).


pkg/kv/kvserver/replica_send.go, line 137 at r1 (raw file):

Previously, sumeerbhola wrote…

I think this is being done here since this is the first place we pop out of executeBatchWithConcurrencyRetries which means for this read-only command we've released latches. So this potentially slow migration will not hinder any other operations like range splits, merges etc.

Right. We could in principle run the request without taking out any latches, which is essentially what we're doing here as well, but the mutexes would hold up splits/merges.

As a straw man, could we not instead run a request that gathers a batch of interleaved intents and returns them in the response, using resume spans until done? We could then resolve those intents either here or in the migration job itself, but without having to use this global Pebble snapshot registry.


pkg/migration/migrations/separated_intents.go, line 158 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

I should probably add more verbose logging to help with this, because I can see that being pretty opaque right now. Do you think a log line for every range we start and end a migration on would be too verbose and useless? A better way would be to spin up a separate goroutine that tracks and regularly logs "waiting for migration to complete on X ranges" and if X is low enough, it prints out their range IDs / boundaries at time of seed.

I don't have a strong opinion either way (would maybe take a cue from other similar migrations), but it would be useful to see which ranges are currently being processed with some frequency, to debug whether it's stuck and what it's stuck on.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, how do we synchronize with ongoing transactions? Is the migration only run after the upgrade is finalized across the entire cluster?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @sumeerbhola, and @tbg)

@tbg tbg requested a review from sumeerbhola June 21, 2021 14:26
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have only given this a high-level look and wrote down some of the particulars of the synchronization that we need.

Seeing how complex this migration is, I'd appreciate a much more verbose description of the PR so that I can point out potential holes in the approach and make suggestions without giving this a low-level review yet. What's here is too vague for me to fill in the blanks. Thank you!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal and @sumeerbhola)


pkg/kv/db.go, line 718 at r5 (raw file):

// local map. At send time, the replica(s) conduct a migration to convert all
// interleaved intents to separated intents. This command is called with the
// assumption that no more interleaved intents will be written after this point.

This is an invariant, but is it upheld? I know it is approximately true because the cluster version that enables separate intents is made active across all nodes before the migration that calls this method starts. But it is, at least in theory, still racy. There might be a goroutine, hanging, that just checked for the separated intents and came back wanting to write an inline intent (but hasn't done so yet). The migration to push out the "always write inline intents" comes along, but it's too late. MigrateLockTable runs. And now the goroutine unwedges and writes an intent.

Maybe this scenario isn't possible, but this PR needs to document in an appropriate place why this can't be possible.

If I'm reading the code correctly, the determination of inline vs separated intents is made here:

pb.wrappedIntentWriter =
wrapIntentWriter(context.Background(), pb, settings, false /* isLongLived */)

and this is during evaluation (called from here):

// evaluateWriteBatchWrapper is a wrapper on top of evaluateBatch() which deals
// with filling out result.LogicalOpLog.
func (r *Replica) evaluateWriteBatchWrapper(
ctx context.Context,
idKey kvserverbase.CmdIDKey,
rec batcheval.EvalContext,
ms *enginepb.MVCCStats,
ba *roachpb.BatchRequest,
lul hlc.Timestamp,
latchSpans, lockSpans *spanset.SpanSet,
) (storage.Batch, *roachpb.BatchResponse, result.Result, *roachpb.Error) {
batch, opLogger := r.newBatchedEngine(latchSpans, lockSpans)
br, res, pErr := evaluateBatch(ctx, idKey, batch, rec, ms, ba, lul, false /* readOnly */)
if pErr == nil {
if opLogger != nil {
res.LogicalOpLog = &kvserverpb.LogicalOpLog{
Ops: opLogger.LogicalOps(),
}
}
}
return batch, br, res, pErr
}

This is good, as evaluation is under the protection of replica latches. To make this airtight, we need:

  1. a "barrier" that makes sure that "new" operations will never write separated intents.
  2. a way to flush out the "old" side of the barrier, i.e. ops that may have inline intents pending.

The first part is indeed provided by a cluster setting (when active, all nodes have been informed that all new operations must use separated intents, and need to ignore the cluster setting)

But there could be some command hanging out for ages in evaluation that is still writing an inline intent. We need to make sure it is gone. One simple way to do this would be to declare a read at infinity over the entire keyspan (i.e. on each range). This would in effect block on all evaluations that are inflight, meaning that once it's through, nobody is contemplating inline intents anymore (and any inline intents have materialized on the state machines, at least on the leader, though not durably - we need to sync everyone's stores once; the migrations infra does this in its stabilization wrappers, need to make sure we hit that path here).

To avoid the pipeline stall that comes with a scan at infinity, we can also sleep for some time, take the local node's current hlc reading, and then do a read at that timestamp. I think that should also work, though now that we have non-blocking txns we need to sleep a little longer. It may not be worth doing this dance and just do the original "read at infinity" strategy.


pkg/kv/kvserver/batcheval/cmd_migrate.go, line 147 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

@tbg made this suggestion, wondering what his thoughts here would be. My understanding here was that intent count would be consistent.

At this point I would avoid trusting the stats, see this comment: https://reviewable.io/reviews/cockroachdb/cockroach/66268#-MblNrQB9zzcli3MQoaW. The stakes are too high here.

Doing the full scan unconditionally does make this migration a lot more expensive, but it also makes it more predictable - when it runs in our tests it will have largely the same cost that it will have in our customers' deployments. This is our first "true" long-running migration, and we need to test it out appropriately. This will be easier if it doesn't have a fast path.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we synchronize with ongoing transactions? Is the migration only run after the upgrade is finalized across the entire cluster?

Correct, the migration only runs after all nodes in the cluster have locked out older versions. When we introduced the migration framework, we changed to use only even-number versions for named version. The way the migration/version bump works is that it first bumps to the odd version preceding the migration version and proves that that has been set as the cluster version before it proceeds to perform the migration and then set increment the version again. There are some hoops we jump through to make sure that no nodes which are not running the proper version can run.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal and @sumeerbhola)


pkg/migration/migrations/separated_intents.go, line 113 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

It'll need to start from the beginning. We could store progress somewhere ourselves, but that could be quite a lifting in itself. I'm leaning hard on the fast-paths letting us skip many ranges we've already migrated.

The jobs frameworks offers checkpointing and checkpoint state storage. It hasn't been plumbed through to the migration jobs. It would not be terribly hard to do so. Happy to guide that if you're interested. It feels like an optimization which could be done as follow-up, no? If we deem it important, then let's file an issue?


pkg/migration/migrations/separated_intents.go, line 175 at r3 (raw file):

	log.Infof(ctx, "[batch %d/%d] migrated lock table for %d ranges", batchIdx, batchIdx, numMigratedRanges)

	// Part 2 - Issue no-op Migrate commands to all ranges. This has the only

Would it be cleaner if this was a separate migration at the next version? You'd get checkpointing of these two parts for free. Also, why no parallelism here?

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @erikgrinaker, @itsbilal, @sumeerbhola, and @tbg)


pkg/kv/db.go, line 718 at r5 (raw file):

This would in effect block on all evaluations that are inflight, meaning that once it's through, nobody is contemplating inline intents anymore (and any inline intents have materialized on the state machines, at least on the leader, though not durably - we need to sync everyone's stores once; the migrations infra does this in its stabilization wrappers, need to make sure we hit that path here

Doesn't the non-sync optimization in replica_application_state_machine.go also rely on the fact that if the node fails no future leaseholder will execute BatchRequests before applying all the raft committed entries to its state machine?

To avoid the pipeline stall that comes with a scan at infinity, we can also sleep for some time, take the local node's current hlc reading, and then do a read at that timestamp. I think that should also work, though now that we have non-blocking txns we need to sleep a little longer. It may not be worth doing this dance and just do the original "read at infinity" strategy.

The simpler thing, that would avoid any chance of a pipeline stall, would be to add a WaitForWrites method on spanlatch.Manager that only grabs a snapshot and waits for write latches on the span (without adding its own read latches). The messiness here would be arranging for this to be called -- I don't think we want to plumb this special case through the general framework of RequestSequencer.SequenceReq. We could special case this in executeBatchWithConcurrencyRetries, prior to the for loop, since a MigrateLockTableRequest will be in a BatchRequest by itself.


pkg/kv/kvserver/replica.go, line 1950 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

This is a good point, I don't know how aggressive we want to be here. If we are worried about impacting workloads, I would think we should use PUSH_TOUCH instead, which will never abort other txns. PUSH_ABORT with NormalUserPriority would have a 50% chance of aborting conflicting txns, I believe.

makes sense -- lets do PUSH_TOUCH.
@tbg do you have an opinion?


pkg/kv/kvserver/replica.go, line 1903 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Is it necessary/desirable to reopen the iterator for each batch? I would imagine it'd be more performant to keep it open, and the cost should be negligible (?).

I agree -- pebbleSnapshot does not reuse iterators under the cover.


pkg/kv/kvserver/replica.go, line 1921 at r5 (raw file):

		intentCount := 0

		for ; valid && err == nil; valid, err = iter.NextEngineKey() {

I think we need to add support for some throttling here in case the scan starts affecting user-facing traffic. Just pausing the migration would be insufficient since we'd want to resume it eventually. This would allow an admin to adjust a cluster setting to slow down the work being done for each range.


pkg/migration/migrations/separated_intents.go, line 113 at r2 (raw file):

Previously, ajwerner wrote…

The jobs frameworks offers checkpointing and checkpoint state storage. It hasn't been plumbed through to the migration jobs. It would not be terribly hard to do so. Happy to guide that if you're interested. It feels like an optimization which could be done as follow-up, no? If we deem it important, then let's file an issue?

I think the fast-path is nixed now, so we do need checkpointing -- not in this PR, but in a followup.
I assume we would get progress reporting in terms of how many ranges have completed, as part of that checkpointing.


pkg/migration/migrations/separated_intents.go, line 34 at r5 (raw file):

// happens when the destination replica(s) are sending replies back to the
// original node.
const concurrentMigrateLockTableRequests = 4

Could we get unlucky and all 4 could be ranges with the same leaseholder?
Ideally we should arrange it such that we run up to k (k=1 default) range per node in the cluster. That way we also scale the number of concurrent ranges being migrated to the cluster size. Fine to leave it as a TODO for a followup PR.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @itsbilal, @sumeerbhola, and @tbg)


pkg/migration/migrations/separated_intents.go, line 113 at r2 (raw file):

Previously, sumeerbhola wrote…

I think the fast-path is nixed now, so we do need checkpointing -- not in this PR, but in a followup.
I assume we would get progress reporting in terms of how many ranges have completed, as part of that checkpointing.

It'll be up to the caller of the checkpoint code to indicate % completion. The checkpoint state itself can be whatever, though I think completed spans is likely the most natural choice. I'll ready the plumbing for a subsequent PR to adopt.


pkg/migration/migrations/separated_intents.go, line 158 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I don't have a strong opinion either way (would maybe take a cue from other similar migrations), but it would be useful to see which ranges are currently being processed with some frequency, to debug whether it's stuck and what it's stuck on.

Slow logging?

@itsbilal itsbilal force-pushed the separated-intents-migration branch from b4fde72 to 36c0455 Compare June 23, 2021 20:34
Copy link
Contributor Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @itsbilal, @sumeerbhola, and @tbg)


pkg/kv/db.go, line 718 at r5 (raw file):

The first part is indeed provided by a cluster setting (when active, all nodes have been informed that all new operations must use separated intents, and need to ignore the cluster setting)

To clarify, when you say cluster setting, you're referring to the active cluster version, right? The second commit in this PR actually removes the cluster setting that even allows the toggling of separated intents at runtime; as long as all nodes can read separated intents, all writes will write separated intents.

I might be simplifying this as I'm thinking of this from a storage lens, but if it really does come down to a race between the finalization of the cluster version and an intentDemuxWriter writing an interleaved intent as it only saw the old cluster version,, can we just remove the caching of cluster versions in the writer if it's older than SeparatedIntentsMigration and check the cluster setting on every write? This is expensive (and I don't even have an intuitive sense of how expensive it would be), but it would be a transient state. And even then it might not be airtight enough for our needs.

I'll poke around the code a bit more (thanks Sumeer for the pointers) and get back to this very shortly.


pkg/kv/kvserver/replica.go, line 1934 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Thanks! Is it worth adding a byte limit as well, in case we encounter large keys? That would allow us to optimistically use a higher intent count limit as well, which should help with throughput in the typical case of small keys.

Done. Would be good to get your thoughts on the two constants that drive it.


pkg/kv/kvserver/replica.go, line 1920 at r2 (raw file):

Previously, sumeerbhola wrote…

I think we need a benchmark for this migration.

Done.


pkg/kv/kvserver/replica.go, line 1903 at r3 (raw file):

Previously, sumeerbhola wrote…

I agree -- pebbleSnapshot does not reuse iterators under the cover.

Done.


pkg/kv/kvserver/replica.go, line 1921 at r5 (raw file):

Previously, sumeerbhola wrote…

I think we need to add support for some throttling here in case the scan starts affecting user-facing traffic. Just pausing the migration would be insufficient since we'd want to resume it eventually. This would allow an admin to adjust a cluster setting to slow down the work being done for each range.

For this, are you thinking of just a rate.NewLimiter keyed by a cluster setting that's, say, kv.separated_intent_migration.max_scan_rate ? We could occasionally fetch the cluster setting so that we pick up changes from the user, and reset the limiter if it has changed.


pkg/kv/kvserver/batcheval/cmd_migrate.go, line 147 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

At this point I would avoid trusting the stats, see this comment: https://reviewable.io/reviews/cockroachdb/cockroach/66268#-MblNrQB9zzcli3MQoaW. The stakes are too high here.

Doing the full scan unconditionally does make this migration a lot more expensive, but it also makes it more predictable - when it runs in our tests it will have largely the same cost that it will have in our customers' deployments. This is our first "true" long-running migration, and we need to test it out appropriately. This will be easier if it doesn't have a fast path.

Sounds good. Removing the stats-based fast path, thanks!


pkg/migration/migrations/separated_intents.go, line 158 at r2 (raw file):

Previously, ajwerner wrote…

Slow logging?

Done. Added a new status logging goroutine that runs every 5 seconds.


pkg/migration/migrations/separated_intents.go, line 175 at r3 (raw file):

Previously, ajwerner wrote…

Would it be cleaner if this was a separate migration at the next version? You'd get checkpointing of these two parts for free. Also, why no parallelism here?

Done. Lack of parallelism mimics other migrations - I imagine this will be a relatively fast invocation?

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 37 of 37 files at r6.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @itsbilal, @sumeerbhola, and @tbg)


pkg/kv/kvserver/replica.go, line 1934 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Done. Would be good to get your thoughts on the two constants that drive it.

Thanks! I think we can be more aggressive about this. For ClearRange, for example, we pull 1000 intents or 1MB at a time. The GC queue is being changed to pull 5000 intents or 1 MB at a time (but also limits the number of txns to push to 100).

I think something like 1000 intents and 1 MB would make sense here.

@itsbilal itsbilal force-pushed the separated-intents-migration branch from 36c0455 to 0da66e2 Compare June 24, 2021 20:33
Copy link
Contributor Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @itsbilal, @sumeerbhola, and @tbg)


pkg/kv/db.go, line 718 at r5 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

The first part is indeed provided by a cluster setting (when active, all nodes have been informed that all new operations must use separated intents, and need to ignore the cluster setting)

To clarify, when you say cluster setting, you're referring to the active cluster version, right? The second commit in this PR actually removes the cluster setting that even allows the toggling of separated intents at runtime; as long as all nodes can read separated intents, all writes will write separated intents.

I might be simplifying this as I'm thinking of this from a storage lens, but if it really does come down to a race between the finalization of the cluster version and an intentDemuxWriter writing an interleaved intent as it only saw the old cluster version,, can we just remove the caching of cluster versions in the writer if it's older than SeparatedIntentsMigration and check the cluster setting on every write? This is expensive (and I don't even have an intuitive sense of how expensive it would be), but it would be a transient state. And even then it might not be airtight enough for our needs.

I'll poke around the code a bit more (thanks Sumeer for the pointers) and get back to this very shortly.

Scratch what I said yesterday. I implemented something that's pretty similar to what Sumeer is describing - a WaitWithoutAcquiring that waits on conflicting latches without adding itself, and we've defined a read latch at infinity for MigrateLockTable so it'll block on all existing writes. I'm new to most of this code so please take a close look in case I missed something. Thanks!


pkg/kv/kvserver/replica.go, line 1934 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Thanks! I think we can be more aggressive about this. For ClearRange, for example, we pull 1000 intents or 1MB at a time. The GC queue is being changed to pull 5000 intents or 1 MB at a time (but also limits the number of txns to push to 100).

I think something like 1000 intents and 1 MB would make sense here.

Done. Thanks!

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 46 files at r35, 2 of 27 files at r36.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @itsbilal, @nvanbenschoten, and @sumeerbhola)


pkg/storage/intent_reader_writer.go, line 59 at r36 (raw file):

		// Be resilient to the version not yet being initialized.
		idw.clusterVersionIsRecentEnoughCached = !idw.settings.Version.ActiveVersionOrEmpty(ctx).Less(
			clusterversion.ByKey(clusterversion.SeparatedIntents))

if the version is not initialized for some reason, clusterVersionIsRecentEnoughCached will say false, which is not distinguishable from the case where the cluster is running 20.2. Isn't there a risk that we will write interleaved intents? And could this happen at a node, after the migration has been run?

Can anyone be running a cluster with 21.2 nodes and 20.2 nodes? I thought that a mixed version cluster is only permitted to run 2 consecutive major versions (is that true, and do we enforce it?). If yes, should we remove all the code that uses the cluster version in deciding whether to write separated intents?

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 18 files at r33.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @itsbilal, @nvanbenschoten, and @sumeerbhola)


pkg/storage/intent_reader_writer.go, line 59 at r36 (raw file):

Previously, sumeerbhola wrote…

if the version is not initialized for some reason, clusterVersionIsRecentEnoughCached will say false, which is not distinguishable from the case where the cluster is running 20.2. Isn't there a risk that we will write interleaved intents? And could this happen at a node, after the migration has been run?

Can anyone be running a cluster with 21.2 nodes and 20.2 nodes? I thought that a mixed version cluster is only permitted to run 2 consecutive major versions (is that true, and do we enforce it?). If yes, should we remove all the code that uses the cluster version in deciding whether to write separated intents?

Yes please to removing this cluster version key: #66544

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 18 files at r33.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @itsbilal, @nvanbenschoten, and @sumeerbhola)


pkg/keys/keys.go, line 560 at r36 (raw file):

// Logically, the keys are arranged as follows:
//
// k1 /local/k1/KeyMin ... /local/k1/KeyMax k1\x00 /local/k1/x00/KeyMin ...

I don't understand this example. What is KeyMin here? Is it roachpb.KeyMin which is the empty string. I am assuming / represents the \x00\x01 separator added at the end by EncodeBytesAscending which is what we call in MakeRangeKeyPrefix.

I guess what I am struggling with is "range-local keys exist conceptually in the space between regular keys". And more specifically the range local keys are to the right of the regular key and that is because of the suffix in these range local keys, yes?

And if my original range is [k0, k1) and I am calling this with /local/k1/KeyMax, why would it not be always fine to return k1 (and not k1\x00)? I couldn't quite figure out based on the callers but I am guessing this is because the caller is later converting the returned RKey back to a range-local key and using it to do an engine scan. But then the upper bound returned here is not necessarily tight. Consider:
k1 /local/k1/KeyMin /local/k1/bar /local/k1/foo /local/k1/KeyMax
and we call this function with /local/k1/bar, and get back k1\x00. If we later convert this back to /local/k1\x00/ and do a SeekLT we will pickup /local/k1/foo which was outside the original upper bound. What am I missing?

It would be useful to specify in a code comment the reason for existence of this AddrUpperBound and if the existing use case is different from the new one, perhaps we should have a separate function rather than reuse the same.

@itsbilal itsbilal force-pushed the separated-intents-migration branch from 4eb9d28 to 8263f6e Compare August 11, 2021 20:29
Copy link
Contributor Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dismissed @sumeerbhola from a discussion.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @itsbilal, @nvanbenschoten, and @sumeerbhola)


pkg/keys/keys.go, line 560 at r36 (raw file):

Previously, sumeerbhola wrote…

I don't understand this example. What is KeyMin here? Is it roachpb.KeyMin which is the empty string. I am assuming / represents the \x00\x01 separator added at the end by EncodeBytesAscending which is what we call in MakeRangeKeyPrefix.

I guess what I am struggling with is "range-local keys exist conceptually in the space between regular keys". And more specifically the range local keys are to the right of the regular key and that is because of the suffix in these range local keys, yes?

And if my original range is [k0, k1) and I am calling this with /local/k1/KeyMax, why would it not be always fine to return k1 (and not k1\x00)? I couldn't quite figure out based on the callers but I am guessing this is because the caller is later converting the returned RKey back to a range-local key and using it to do an engine scan. But then the upper bound returned here is not necessarily tight. Consider:
k1 /local/k1/KeyMin /local/k1/bar /local/k1/foo /local/k1/KeyMax
and we call this function with /local/k1/bar, and get back k1\x00. If we later convert this back to /local/k1\x00/ and do a SeekLT we will pickup /local/k1/foo which was outside the original upper bound. What am I missing?

It would be useful to specify in a code comment the reason for existence of this AddrUpperBound and if the existing use case is different from the new one, perhaps we should have a separate function rather than reuse the same.

The key point is that AddrUpperBound isn't being used to mutate the request's Header (which has raw Keys, not RKeys in it). So no, the caller isn't converting the RKey back to a range-local key; that transformation is inherently impossible to get right. The RKey is only being used to look up the range descriptor and to evict it from the cache if necessary.

The issue I was facing was that if you address a request to the range-local bounds of a range, this method used to mutate the upper bound's RKey just enough to trip up an assertion in the range descriptor cache which would think that the request was out of bounds of the range it was trying to send it to.


pkg/storage/intent_reader_writer.go, line 59 at r36 (raw file):

Previously, ajwerner wrote…

Yes please to removing this cluster version key: #66544

Done. Removed all the code to handle pre-SeparatedIntents versions. It's all in the third commit.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting very close. I'm impressed by the stamina.

Reviewed 2 of 18 files at r33, 43 of 46 files at r38, 37 of 37 files at r39.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @itsbilal, @nvanbenschoten, and @sumeerbhola)


pkg/roachpb/api.go, line 276 at r39 (raw file):

// combine() allows responses from individual ranges to be aggregated
// into a single one.
type combinable interface {

Do we need to make both of our new requests combinable? What would happen if a range split below a BarrierRequest? We'd want to learn about the maximum time of the barrier, right? And then what would happen if a range split below a ScanInterleavedIntentsRequest? We'd want to combine the intents, right?

Should we find a way to test these cases?

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! just some minor comments,

Reviewed 1 of 18 files at r33, 40 of 46 files at r38, 31 of 37 files at r39.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @itsbilal, @nvanbenschoten, and @sumeerbhola)


pkg/kv/kvserver/batcheval/cmd_barrier.go, line 34 at r39 (raw file):

	// Barrier is special-cased in the concurrency manager to *not* actually
	// grab these latches. Instead, any conflicting latches with these are waited
	// on, but new latches aren't inserted.

Can you add to the comment here
// This will conflict with all writes and reads over the span, regardless of timestamp. Note that this guarantees that concurrent writes will be flushed out before this request is evaluated, but there is no guarantee regarding flushing out of concurrent reads since they could be getting evaluated on a follower. We don't currently need any guarantees regarding concurrent reads, so this is acceptable.


pkg/kv/kvserver/concurrency/testdata/concurrency_manager/barrier, line 167 at r39 (raw file):


# -------------------------------------------------------------
# Barrier requests do not block future writes or reads.

do not block any reads or writes, yes?
maybe add both past and future ones below.


pkg/kv/kvserver/spanlatch/manager_test.go, line 596 at r39 (raw file):

	lg1 := m.AcquireOptimistic(spans("d", "f", write, zeroTS))
	require.True(t, m.CheckOptimisticNoConflicts(lg1, spans("d", "f", write, zeroTS)))
	lg1, err := m.WaitUntilAcquired(context.Background(), lg1)

can't these 3 calls be replaced by m.Acquire()?


pkg/migration/migrations/migrations.go, line 72 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

This will end up in the migrations system table, as that tracks completion.

Will #66268 need to read from the system table in order to enable the fast path for ranged intent resolution?


pkg/migration/migrations/separated_intents.go, line 235 at r9 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

It's the same code as all the other migrations under this framework - each batch is a batch of range descriptors passed in by IterateRangeDescrriptors. We don't know the final batch count until the very end, it's why we substitute ?? for the denominator until the migration finishes. Should I add a comment about this?

No need for a code comment.


pkg/migration/migrations/separated_intents.go, line 335 at r39 (raw file):

					fmt.Fprintf(&ranges, ", ")
				}
				fmt.Fprintf(&ranges, "%s", roachpb.RangeID(rangeID))

Is it easy to find historical range ids and figure out what key range they corresponded to? I am wondering if we see a problem and the range id no longer exists due to split/merge whether we will easily be able to figure out which key range to look at.


pkg/migration/migrations/separated_intents.go, line 341 at r39 (raw file):

				log.Infof(ctx, "currently migrating lock table on ranges %s", ranges.String())
			}

can we also log how many ranges have been successfully migrated?de


pkg/migration/migrations/testdata/separated_intents, line 1 at r39 (raw file):


nice test!


pkg/migration/migrations/testdata/separated_intents, line 143 at r39 (raw file):

----

# Case where one request is fired per intent. There are 15 keys across 2 ranges,

one scanInterleavedIntents request ...

@itsbilal itsbilal force-pushed the separated-intents-migration branch from 8263f6e to d613fce Compare August 12, 2021 16:00
Copy link
Contributor Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

Dismissed @sumeerbhola from 5 discussions.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @sumeerbhola)


pkg/kv/kvserver/batcheval/cmd_barrier.go, line 34 at r39 (raw file):

Previously, sumeerbhola wrote…

Can you add to the comment here
// This will conflict with all writes and reads over the span, regardless of timestamp. Note that this guarantees that concurrent writes will be flushed out before this request is evaluated, but there is no guarantee regarding flushing out of concurrent reads since they could be getting evaluated on a follower. We don't currently need any guarantees regarding concurrent reads, so this is acceptable.

Done.


pkg/kv/kvserver/concurrency/testdata/concurrency_manager/barrier, line 167 at r39 (raw file):

Previously, sumeerbhola wrote…

do not block any reads or writes, yes?
maybe add both past and future ones below.

Done.


pkg/kv/kvserver/spanlatch/manager_test.go, line 596 at r39 (raw file):

Previously, sumeerbhola wrote…

can't these 3 calls be replaced by m.Acquire()?

Yes, but that skips the CheckOptimisticNoConflicts check. Just seems safer / follows the same convention as other tests which I copied this from.


pkg/migration/migrations/migrations.go, line 72 at r2 (raw file):

Previously, sumeerbhola wrote…

Will #66268 need to read from the system table in order to enable the fast path for ranged intent resolution?

I believe it can just look at the cluster version. The cluster version won't move to PostSeparatedIntentMigration until the migration is complete. At least that's how it appears in the tests. Maybe @ajwerner can add more clarity.


pkg/migration/migrations/separated_intents.go, line 335 at r39 (raw file):

Previously, sumeerbhola wrote…

Is it easy to find historical range ids and figure out what key range they corresponded to? I am wondering if we see a problem and the range id no longer exists due to split/merge whether we will easily be able to figure out which key range to look at.

Not sure about this. Maybe we could also log range boundaries around here.


pkg/migration/migrations/separated_intents.go, line 341 at r39 (raw file):

Previously, sumeerbhola wrote…

can we also log how many ranges have been successfully migrated?de

Added.


pkg/migration/migrations/testdata/separated_intents, line 1 at r39 (raw file):

Previously, sumeerbhola wrote…

nice test!

thanks!


pkg/migration/migrations/testdata/separated_intents, line 143 at r39 (raw file):

Previously, sumeerbhola wrote…

one scanInterleavedIntents request ...

Done.


pkg/roachpb/api.go, line 276 at r39 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we need to make both of our new requests combinable? What would happen if a range split below a BarrierRequest? We'd want to learn about the maximum time of the barrier, right? And then what would happen if a range split below a ScanInterleavedIntentsRequest? We'd want to combine the intents, right?

Should we find a way to test these cases?

Good point. I made both requests combinable. We probably want to make this testable, but I also don't wanna delay this PR more than necessary - can I punt split/combine testing to a follow-up PR?

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 43 of 46 files at r41, 42 of 42 files at r42.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @itsbilal, @nvanbenschoten, and @sumeerbhola)


pkg/roachpb/api.go, line 276 at r39 (raw file):

can I punt split/combine testing to a follow-up PR?

Yes, absolutely.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 46 files at r29, 8 of 46 files at r41.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @ajwerner, @itsbilal, @nvanbenschoten, and @sumeerbhola)


pkg/kv/kvserver/spanlatch/manager_test.go, line 596 at r39 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Yes, but that skips the CheckOptimisticNoConflicts check. Just seems safer / follows the same convention as other tests which I copied this from.

Those 3 calls are roughly equivalent to calling Acquire, and there shouldn't be any possibility of conflict since this is the first thing the test does. You don't have to change it -- this is a nit.


pkg/migration/migrations/separated_intents.go, line 113 at r2 (raw file):

Previously, ajwerner wrote…

It'll be up to the caller of the checkpoint code to indicate % completion. The checkpoint state itself can be whatever, though I think completed spans is likely the most natural choice. I'll ready the plumbing for a subsequent PR to adopt.

Is there checkpointing now, or is that a TODO for later?

@itsbilal itsbilal force-pushed the separated-intents-migration branch from d613fce to e5b3f76 Compare August 12, 2021 18:28
Copy link
Contributor Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for all the reviews! My last push addresses the case where we were skipping range-local keys of timeseries ranges. Added a test for it.

A couple outstanding TODOs for future PRs:

  1. Checkpointing of migration state
  2. Better end-to-end testing (similar to TestSeparatedIntentsMigration) but of migration of range-local keys
  3. Better testing of range splits under Barrier/ScanInterleavedIntents requests.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @ajwerner, @nvanbenschoten, and @sumeerbhola)


pkg/kv/kvserver/spanlatch/manager_test.go, line 596 at r39 (raw file):

Previously, sumeerbhola wrote…

Those 3 calls are roughly equivalent to calling Acquire, and there shouldn't be any possibility of conflict since this is the first thing the test does. You don't have to change it -- this is a nit.

Done.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @itsbilal, @nvanbenschoten, and @sumeerbhola)


pkg/migration/migrations/migrations.go, line 72 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

I believe it can just look at the cluster version. The cluster version won't move to PostSeparatedIntentMigration until the migration is complete. At least that's how it appears in the tests. Maybe @ajwerner can add more clarity.

The version is stored in memory. It's just an atomic read to check for activation. The upgrade protocol makes RPCs to all of the nodes to make sure that they have both persisted the new version to disk and updated their in-memory state. All new nodes will also see the new version.


pkg/migration/migrations/separated_intents.go, line 113 at r2 (raw file):

Previously, sumeerbhola wrote…

Is there checkpointing now, or is that a TODO for later?

Feel free to reach out for guidance on hooking this up.


pkg/migration/migrations/separated_intents.go, line 263 at r31 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

You mean a total count of all ranges in the channels as well as being processed right now? The processed count will always be <=4 as that's the concurrency limit, and it should be pretty easy to observe that by looking at the list. To account for the queue I'd have to add an atomic, but it's doable.

No, the total that have been processed ever by this job instance.


pkg/migration/migrations/separated_intents.go, line 335 at r39 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Not sure about this. Maybe we could also log range boundaries around here.

Every time we make any replication change, we write that change out to system.eventlog as part of the same transaction. In that way we have a history. We retain history for 30 days.

@itsbilal itsbilal force-pushed the separated-intents-migration branch from e5b3f76 to 0a000dc Compare August 12, 2021 20:20
@itsbilal
Copy link
Contributor Author

Merging this before any more conflicts. Thanks all!

bors r=nvanbenschoten

@craig
Copy link
Contributor

craig bot commented Aug 12, 2021

Merge conflict.

Currently, AddrUpperBound calls Next() on the upper bound
RKey if the Key corresponding to it was a local key.
This is to try to include all local keys under the last
global key that's in the range. However if the local
key was the corresponding range key prefix, all local
keys will be excluded in the range, and so will be
the matching global key, so special-case that here
and not call Next().

Release note: None.
Adds a long running migration, SeparatedIntentsMigration,
that does the following steps:

1) Iterate through all range descriptors using a RangeIterator
2) Calls a new ranged write command, Barrier, that is a no-op
  during evaluation but waits on any other writes on that range
  to finish before returning
3) Call ScanInterleavedIntents, a new ranged read command, to
  return interleaved intents encountered over key range
4) Iterate over returned intents then push all those txns and
  resolve any intents returned.

The barrier in step 2 ensures that no future writes after that
point will write interleaved intents. The actual disabling
of the setting to write interleaved intents happens in the
next commit.

Part of cockroachdb#41720.

Release note: None.
This change moves the SeparatedIntentsEnabled setting to
a testing knob, away from the cluster setting. It also
removes all logic to handle pre-SeparatedIntents cluster
versions, as that was part of the 20.2 release and 21.2
does not need to accommodate that distinction. We always
assume we're in a new-enough cluster version that can read
separated intents.

The only way interleaved intents can
be written now is when the `DisableSeparatedIntents` testing
knob is true.

Release note: None.
@itsbilal itsbilal force-pushed the separated-intents-migration branch from 0a000dc to e780937 Compare August 12, 2021 23:25
@itsbilal
Copy link
Contributor Author

🙏

bors r=nvanbenschoten

@craig
Copy link
Contributor

craig bot commented Aug 13, 2021

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants