Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: introduce an "ignore list" for seqnums in MVCC reads #41612

Closed
knz opened this issue Oct 15, 2019 · 22 comments · Fixed by #42582
Closed

storage: introduce an "ignore list" for seqnums in MVCC reads #41612

knz opened this issue Oct 15, 2019 · 22 comments · Fixed by #42582
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@knz
Copy link
Contributor

knz commented Oct 15, 2019

Required for SQL savepoints, as discussed in #41569.
To support partial txn rollbacks (savepoint rollbacks) we need to skip over values written in the past that are associated with rolled back seqnums.

Today, the MVCC read logic is already equipped with logic to skip all values written after a specific seqnum stored in the meta txn proto (Sequence).

We want to extend the read logic to also skip over values written at seqnums part of an "ignore list":

  • the txn meta proto is extended with a new field "ignore list". This contains a set (ordered if need be) of seqnum ranges to skip over
  • during MVCC reads, when looking up a value "at" a particular seqnum, we extract the list of all writes before that seqnum, then, to find the most recent write, we iterate over that list in reverse order and stop at the first write that has a seqnum not part of the ignore list.
  • during intent resolution the same logic should be applied.

This logic should be available for both the rocksdb and pebble engines.

@knz knz added A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) labels Oct 15, 2019
@knz
Copy link
Contributor Author

knz commented Oct 15, 2019

cc @andreimatei @nvanbenschoten

@knz
Copy link
Contributor Author

knz commented Oct 15, 2019

Discussed with @petermattis - since @itsbilal is currently rewriting the mvcc code he's the best person to look at this (or, conversely it would be disruptive to his work for anyone else to look at it)

@knz
Copy link
Contributor Author

knz commented Oct 15, 2019

The ideal timeline would be to enable early testing of these semantics before the end of november. Please reach out to @andreimatei or myself to discuss milestones.

@petermattis
Copy link
Collaborator

Note that the the seqnums being talked about here are TxnMeta.Sequence. These are distinct from the RocksDB and Pebble seqnums.

@knz
Copy link
Contributor Author

knz commented Oct 15, 2019

Good point. Bilal and I were just understanding that by looking at the code.

@knz
Copy link
Contributor Author

knz commented Oct 30, 2019

Discussing implementation with @nvanbenschoten . The code currently uses std::upper_bound. This won't do as-is because we need to exclude the intents from the ignored range.

First idea was to apply upper_bound on top of a filtered sub-set of the intents.

Nathan suggests iterated binary search:

Search for the latest sequence number, if it's in an excluded range, search again up to that iterator for the starting seq num of that excluded range, etc.

@knz
Copy link
Contributor Author

knz commented Oct 30, 2019

Discussion with Nathan and @petermattis:

Looking at c++ code generated from protobufs.
For example we have a repeated SequencedIntent intent_history that currently gets compiled to RepeatedPtrField<MVCCMetadata_SequencedIntent> which contains many instances of MVCCMetadata_SequencedIntent*.
I'd imagine we could use a RepeatedField<> instead (something more like std::vector<MVCCMetadata_SequencedIntent> ). Why the use of RepeatedPtrField and is that something we can / want to optimize?
(This becomes relevant with savepoint rollbacks, and the introduction of an IgnoredSeqNums array which will be an array of start, end bounds)

Peter's answer:

I believe that can be optimized. Not sure if it is worthwhile right now given the focus on Pebble.
I think the effect would be pretty small unless the intent history starts getting really large.

my opinion (knz): even if there's just one entry, that's 2 heap allocations instead of just 1, for every single intent in the system.

@petermattis
Copy link
Collaborator

my opinion (knz): even if there's just one entry, that's 2 heap allocations instead of just 1, for every single intent in the system.

If you want to reduce heap allocations in the C++ proto code, I believe the right answer is to use arenas. I'll be moderately surprised if this turns out to be worthwhile.

@knz
Copy link
Contributor Author

knz commented Nov 4, 2019

Investigating this further: I'm hitting a snag (perhaps two).

First finding is that there are two copies of the TxnMeta object I should probably care about:

  • one is stored inside the intent's MVCCMetadata.
  • another is available to the caller of MVCC scans/gets, and passed to DBIterator via a simplified DBTxn.

Now there are two problems.

  1. for MVCC scans/gets. To deal with savepoint rollbacks, we need to provide the latest (up to date) list of ignored seqnum ranges to the MVCC reader. This means that the ignore list must be passed to the mvccScanner via DBTxn - we can't pick it up from the just-decoded MVCCMetadata (which, on the read path, may be out of date).

    But then, what of the cost of providing this slice to C++ alongside the rest on every instantiation of a DBIterator? Isn't that the kind of overhead that DBTxn was meant to avoid?

  2. for intent resolution. I understand (but I am not sure) that something needs to store the latest ignore list somewhere so that intent resolution / cleanup can find it.

    ISTM that we're not going to go and populate the latest ignore list on the MVCCMetadata of every intent ever laid down by the txn. But then, where else?

    (My uncertainty stems from the fact I haven't yet found the code responsible for cleaning up intents.)

@petermattis
Copy link
Collaborator

First finding is that there are two copies of the TxnMeta object I should probably care about:

This matches my understanding.

But then, what of the cost of providing this slice to C++ alongside the rest on every instantiation of a DBIterator? Isn't that the kind of overhead that DBTxn was meant to avoid?

This slice will be empty for most scans, right? I'd hope the overhead of passing is non-existent if the slice is empty. Note that passing a read-only slice of primitive values to C++ has near zero overhead as the C++ side can read the Go memory. The challenge is if the slice has objects which contain pointers.

(My uncertainty stems from the fact I haven't yet found the code responsible for cleaning up intents.)

Is MVCCResolveWriteIntent what you're looking for?

@nvanbenschoten
Copy link
Member

This means that the ignore list must be passed to the mvccScanner via DBTxn - we can't pick it up from the just-decoded MVCCMetadata (which, on the read path, may be out of date).

This is correct. The txn in the MVCCMetadata won't be used here.

This slice will be empty for most scans, right?

+1

Is MVCCResolveWriteIntent what you're looking for?

Yes, and note that the ignore list will be provided in Intent.Txn.

@knz
Copy link
Contributor Author

knz commented Nov 4, 2019

Yes, and note that the ignore list will be provided in Intent.Txn.

Is that up to date with the latest txn record at that point?

@petermattis
Copy link
Collaborator

Is that up to date with the latest txn record at that point?

Probably not. Intent.Txn was written at some prior time. The actual txn record for the txn which laid down the intent might be different. I might be misunderstanding your question.

@knz
Copy link
Contributor Author

knz commented Nov 4, 2019

MVCCResolveWriteIntent needs to forget any value that was written at one of the ignored seqnums.

If the latest list of ignored seqnums is not available, or not up-to-date, this function may erroneously preserve a written value that should have been rolled back via a savepoint rollback.

In other words, intent resolution needs to operate on the latest list of ignored seqnums. I do not know (yet) how to guarantee this.

@nvanbenschoten
Copy link
Member

Intent is the parameter passed to MVCCResolveWriteIntent, not what's read from RocksDB. It's always up-to-date with the state of the transaction.

@knz
Copy link
Contributor Author

knz commented Nov 4, 2019

thanks that was helpful

@knz
Copy link
Contributor Author

knz commented Nov 4, 2019

Note that passing a read-only slice of primitive values to C++ has near zero overhead as the C++ side can read the Go memory.

I see this to be intuitively true, but there's some plumbing I need to figure out.

  1. in the code I only see this being done with byte arrays, not with arrays of other primitive types.
  2. what of read-only slices of protobuf-defined structs containing only primitive values. I assume it should be fine then too, but again I don't see examples yet.

What I'm going to do now is cobble something together that will look ugly and misguided to your educated eyes, and then you're going to tell me how to fix it.

@petermattis
Copy link
Collaborator

what of read-only slices of protobuf-defined structs containing only primitive values. I assume it should be fine then too, but again I don't see examples yet.

I'm not sure if there are any examples of this, but there shouldn't be a problem with doing so. You basically will pass a pointer and a length.

What I'm going to do now is cobble something together that will look ugly and misguided to your educated eyes, and then you're going to tell me how to fix it.

Ha! Perfect! That works for me.

@knz
Copy link
Contributor Author

knz commented Nov 4, 2019

done #42152

@knz
Copy link
Contributor Author

knz commented Nov 11, 2019

(@nvanbenschoten can you check the following)

Coming back to the work specification: the RFC also calls for detaching the read sequence from the write sequence.

Initially, I thought that this req was only a matter of the TxnCoordSender populating the sequence field differently for read and write requests. However, that was mistaken. Both seqnums are needed for every write request and handled in MVCC:

  • obviously for CPut, which does both
  • also for any write request that populates the intent history: the value in the intent history must be the previous value as per the read seqnum, not the current write seqnum.

Additionally, to ensure replayability (idempotence) the following invariant must hold: every write request at a given write seqnum must always be issued with the same read seqnum. This won't be a problem in TxnCoordSender but it must become part of the API spec.

@nvanbenschoten
Copy link
Member

obviously for CPut, which does both

Is considering the CPut's read sequence to be 1 less than its write sequence insufficient? This should be what we do today. I can't imagine a case where we'd want to do anything other than this.

also for any write request that populates the intent history: the value in the intent history must be the previous value as per the read seqnum, not the current write seqnum.

Why is this? I'd expect it to be the previous value as per that value's write seq num (as it is today).

As is, I'm still not convinced that "the RFC also calls for detaching the read sequence from the write sequence" is true.

@knz
Copy link
Contributor Author

knz commented Nov 11, 2019

As is, I'm still not convinced that "the RFC also calls for detaching the read sequence from the write sequence" is true.

Let's hope you're right. I would also prefer this to be true as it is simpler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
4 participants