Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Performance degradation caused by kv tombstones #17229

Open
a-robinson opened this issue Jul 26, 2017 · 66 comments
Open

storage: Performance degradation caused by kv tombstones #17229

a-robinson opened this issue Jul 26, 2017 · 66 comments
Labels
A-kv-client Relating to the KV client and the KV interface. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. docs-todo S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. T-storage Storage Team
Milestone

Comments

@a-robinson
Copy link
Contributor

a-robinson commented Jul 26, 2017

If a workload deletes or updates rows frequently and also scans over those rows, the tombstones left behind by our MVCC cause the scans to get slower and slower over time. This is surprising to users and can render workloads that are perfectly reasonable on a new database impractically slow on a database that's been under load for 24 hours.

This was hit by a user whose workload involved two tables scans in a three-way join in #17075, and has received some previous discussion on our forum. Consider this a tracking issue for the user-visible problem.

Jira issue: CRDB-6045

@tbg
Copy link
Member

tbg commented Jul 26, 2017

#16252 as well and I remember there was another instance (but can't find it now). Suffice to say it's a relatively frequent problem.

@petermattis
Copy link
Collaborator

An idea mentioned by @tschottdorf is to move historical data to a separate column family. Concretely, we could have a background process that periodically moved old versions and deletion tombstones to a separate column family. This could be done independently on each replica as the logical data would remain unchanged. Each replica would maintain a timestamp that would control whether the separate column family needs to be accessed or not.

@bdarnell bdarnell added this to the 1.3 milestone Sep 8, 2017
@bdarnell
Copy link
Contributor

bdarnell commented Sep 8, 2017

I'm tagging this issue for 1.3 because it keeps coming up and I want to make sure it gets into the next prioritization cycle. (If we could squeeze one more thing into 1.2, this would be a candidate)

I'd also like to propose index-level zone configs as a potential solution for this. Normally, the solution to slow reads is to add an index, but that doesn't work here because the index itself is getting filled up with garbage. An index with lower GC retention would work. We'd just need to ensure that the query planner takes this into account and won't use an index that may have been GC'd at the query's timestamp.

@petermattis
Copy link
Collaborator

There might be other optimizations possible here. Index-level zone configs could possibly fall out of the partitioning work scheduled for 1.2.

Cc @danhhz and @benesch

@benesch
Copy link
Contributor

benesch commented Sep 8, 2017

As part of the partitioning work, I expect we'll update our usage of zone configs internally to be based on byte ranges instead of database/table IDs. Targeting indexes falls out naturally from that work. I imagine the hard part is actually the user interface; the echo 'SOME YAML JUNK' | ./cockroach zone -f - CLI gets clunkier and clunkier with every knob we add.

I heard @BramGruneir might have a zone config redesign in the pipeline?

@petermattis
Copy link
Collaborator

./config zone set -f - myTable@myIndex seems natural.

@BramGruneir is interested in providing SQL statements to set&get zone configs. I'd imagine this would be some sort of ALTER TABLE syntax.

@benesch
Copy link
Contributor

benesch commented Sep 8, 2017

What if your index is partitioned, though? Then you might end up with

$ ./config zone set -f - db.tbl@idx1.partition1

and there'd be four potential zone configs that could take effect (db, tbl, idx1, partition1). We'll need to decide if they inherit or not. If they don't inherit it's simpler to understand what's happening—the first zone config that exists is used verbatim—but without inheritance updating a table-wide setting (say, replication factor) requires updating every partition zone config for every index. That could mean manually running several dozen cockroach zone commands that install zone configs that only differ by a few bytes.

Anyway, my point is that if partitioning doesn't motivate us to redesign the zone config interface, then partitioning + index zone configs should definitely motivate us.

@petermattis
Copy link
Collaborator

but without inheritance updating a table-wide setting (say, replication factor) requires updating every partition zone config for every index. That could mean manually running several dozen cockroach zone commands that install zone configs that only differ by a few bytes.

cockroach zone set could be extended to set a specific field (e.g. the replication factor) on all zones matching a pattern. But I see your point that some rethinking is likely in order here: we'll either need to extend the tools for manipulating zones or extend the zone config semantics.

@petermattis petermattis added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label Dec 4, 2017
@petermattis
Copy link
Collaborator

@spencerkimball This was the issue I mentioned earlier.

@vivekmenezes
Copy link
Contributor

It looks like SELECT COUNT(*) can take 1400ms even on a an empty table even after lowering the TTL and waiting several hours after the TTL

@spencerkimball
Copy link
Member

spencerkimball commented Dec 7, 2017 via email

@bdarnell
Copy link
Contributor

In the report of 1400ms for a empty table (or nearly empty: the stats I have show 106KB of live data), there were 25 ranges (with all the live data in the last one). 17 of them were empty; the other 8 were awaiting GC with 32MB of non-live data.

So I think in this case, the problem is not a bunch of empty ranges, it's that the GC queue appears to have fallen behind and hasn't deleted all this data that is past its TTL. We've made some improvements here recently (this test was using version 1.1.3; at least some of the GC queue improvements will be in 1.1.4).

In the longer term, though, the ever-growing number of ranges due to the lack of merging is the bigger problem for this workload. Insertions to this table (with a unique_id primary key) are producing splits that are five minutes apart.

@petermattis
Copy link
Collaborator

We've also encountered a workload recently where a small set of rows is updated very frequently leaving thousands or tens of thousands of versions per key per day. Any scan of that table is going to have bad performance. Setting a lower GC TTL will help, but not completely mitigate the problem. And we can't set TTL too low without affecting backups (i.e. backups need to happen more frequently than the TTL).

I've been thinking more about using a separate column family for old versions. Rather than having a queue which moves old versions to a separate column family, we could do this maintenance incrementally during MVCC operations. For instance, if an MVCC put creates a new version, we could do a quick check to see how many versions exists and if it is greater than N (e.g. 5 or 10), we could move a chunk of versions to the separate column family. The MVCC read operations would need a separate iterator for this column family, but I think this could be structured such that it is only accessed if you're doing a historical read.

@tbg
Copy link
Member

tbg commented Dec 11, 2017

@petermattis I have thought about this too, but was worried about the efficiency of this approach. It's the natural way to implement flexible GC policies, such as "keep only X versions" (however useful that is in practice I don't know, but it's essentially the semantics you're suggesting) or the familiar "keep versions needed for reads at NOW()-TTLSeconds". This latter one may have to deal with lots of versions still, so we'd likely have to store (or at least cache, and that will often thrash) the number of versions for every key.

One special case is that in which we only keep the latest version (i.e. no time travel, or a TTL of zero). In that case, get{,Reverse}ScanMeta sees all the relevant information already. I think that's worthy of a prototype implementation because an interesting issue that needs to be addressed is that the GCThreshold would naively have to be updated with every write, and that creates a single bottleneck. However, these updates commute, so the bottleneck can actually be avoided, and we're already shipping GCThreshold as a side-effect through Raft (so that the only change here would be that we synthesize the write downstream of Raft instead of putting it into the batch).

A step up from that is moving into an old column family and multiplexing reads accordingly (which isn't completely trivial).

It gets really interesting when you throw #16132 in the mix -- You could have two column families that get rotated every TTL, and get to throw away the older one wholesale every time. After that, the GC queue would mostly be in charge of pushing old transactions and resolving their intents.

@petermattis
Copy link
Collaborator

@tschottdorf We'd definitely have to benchmark the efficiency of using two column families. On the surface, it would primarily affect updates. An update that was previously the addition of a single new key would translate into the addition of a new key, the deletion of a historic version and the insertion of the historic version in the secondary column family. With a little sophistication, we do this in batches to amortize the cost. Reads become interesting, but I'm assuming the common case is reading the most recent timestamp.

Trying to throw #16132 into the mix seems challenging. I don't see any obvious way to achieve that as you can't just rotate the current column family to be the historic one as it still contains the most recent values.

@tbg tbg self-assigned this Jan 12, 2018
@tbg
Copy link
Member

tbg commented Jan 12, 2018

Tentatively self-assigning for some prototyping. Concretely, I want to not even bother with column families right now, but it should be relatively straightforward to make MVCCPut rewrite older keys (which match some criterion) into a hidden parallel keyspace that MVCCScan is aware of (and I just reviewed some PRs on that, so I should manage).

That should give us the correct perf numbers for the (common) read best case in which the archive keyspace is never accessed. For write workloads that need to move to the archive and reads that need to access it, I see no reason to believe that column families will make things faster, and so the benchmark numbers there should be useful as well.

I'm not sure that we need column families unless we want to configure the archive differently or drop whole column families quickly.

@petermattis, any obvious flaws in the above plan?

@tbg
Copy link
Member

tbg commented Jan 13, 2018

Two more observations (mostly out of scope for first prototype):

  1. triggering a rewrite of old versions of a key when this key is written is the easy case. The harder case is when there is a "minefield" of tombstones as the queue workload tends to create. There, none of the writes will trigger a rewrite, so it has to be done on the scans. Keeping track of the number of versions eligible for a rewrite shouldn't be hard, as we're forced to visit them anyway (that's the main problem). Then we should be able to rewrite these locally (as long as it's atomic or at least in the right order), without going through Raft or serializing with in-flight writes, because old versions never change (assuming we teach the consistency checker and are somewhat cognizant of the effect of interleaving with GC). Followers (who don't see reads happening) will need to be instructed to trigger a similar process (easy).
  2. For the consistency checker to avoid doing lots of seeks, it will want to consume the archive and actual keyspace in blocks, as opposed to jumping back and forth between them for every key. It also needs to treat archived versions exactly the same as regular versions (as we don't want to go through Raft to rewrite, so replicas may diverge here). That probably just means stripping the archive keyspace prefix before hashing and not actually using a hash function (but something that's commutative, think sum mod p). This weakens the consistency checker, but I assume it can be made strong enough (for example, sum mod p becomes stronger if you change p between checks). Perhaps it's easier and not too expensive to merge-iterate through the archive and the recent keyspace, though.

@petermattis
Copy link
Collaborator

I'm not sure that we need column families unless we want to configure the archive differently or drop whole column families quickly.

That's a good point. We really just need a separate keyspace.

@petermattis, any obvious flaws in the above plan?

Depending on where you implement the parallel keyspace logic, you probably won't be able to snapshot/rebalance so you'd be limited to a single-node cluster. That seems fine for a prototype and to explore the problem space.

triggering a rewrite of old versions of a key when this key is written is the easy case. The harder case is when there is a "minefield" of tombstones as the queue workload tends to create. There, none of the writes will trigger a rewrite, so it has to be done on the scans.

For a queue, could the deletion tombstone be placed in the "archive" keyspace? That would leave the "live" keyspace containing only the latest version of visible values.

For the consistency checker to avoid doing lots of seeks, it will want to consume the archive and actual keyspace in blocks, as opposed to jumping back and forth between them for every key.

I've been imagining we'd implement this down in the C++ code and have some sort of "merged iterator". The upper layers of the code could be completely unaware of what was happening.

@tbg
Copy link
Member

tbg commented Jan 13, 2018

For a queue, could the deletion tombstone be placed in the "archive" keyspace? That would leave the "live" keyspace containing only the latest version of visible values.

We can do that. The downside is that it penalizes all writes as those have to check the archive.

I've been imagining we'd implement this down in the C++ code and have some sort of "merged iterator". The upper layers of the code could be completely unaware of what was happening.

If you think that that can be made efficient enough, that seems preferable to me too.

@petermattis
Copy link
Collaborator

We can do that. The downside is that it penalizes all writes as those have to check the archive.

Good point. That kind of stinks.

If you think that that can be made efficient enough, that seems preferable to me too.

Do you have an alternative in mind? Doing this doing in C++ seems good for both efficiency and to keep most of the system unaware of what is going on.

@tbg
Copy link
Member

tbg commented Jan 13, 2018

Do you have an alternative in mind? Doing this doing in C++ seems good for both efficiency and to keep most of the system unaware of what is going on.

No, I'm perfectly happy doing this in C++.

@petermattis
Copy link
Collaborator

@tschottdorf To clarify: I think the reading side can be done entirely in C++. The writing side will probably have to be done in Go because all of our logic for MVCC writes is in Go.

Anyways, I'm eager to see what you prototype. I think we need to address the performance degradation from large numbers of versions in 2.1.

@tbg
Copy link
Member

tbg commented Jan 13, 2018

@petermattis we're on the same page.

@tbg
Copy link
Member

tbg commented Jan 18, 2018

Some more concrete thoughts on this.

Riffing on the terminology from Ressi, the key space would be (logically) divided in an active and a passive generation (simulating potentially a later separate column family). Practically (to reduce migration concerns) this means that the active generation is the keyspace we use now (i.e. userspace keys are mapped directly into MVCC-encoded keys), and the passive generation has some store-local key prefix (i.e. we prepend a store-local key prefix to the "active" sibling key).

The basic strategy is straightforward: we want the newest data in the active generation, and have most old versions and deletion tombstones in the passive one. If we additionally have a criterion using which a read can decide when it is safe to skip the passive generation, we get the desired benefits if the criterion is usually satisfied for reads at current timestamps.

A kv pair is called "live" if it is the most recent version of that key, unless it is a deletion tombstone, in which case it is never live. In other words, where a read at timestamp infinity would see a value, the corresponding kv pair is live.

Writes always go to the active generation. Live data is always in the active generation. Non-live kv pairs may live in the passive generation (but requiring that they always are is likely too strict a requirement, see below).

When does a kv pair that is non-live migrate from the active to the passive generation? One straightforward idea is to amortize this cost over the write that makes the version non-live: when a new key is written, the writer also moves the previous version (which it reads anyway) into the passive generation (with an important tweak for tombstones, see below).

As for the "read criterion", a simple one to decide whether to check the passive generation on a read is to keep track of a maximum timestamp[1] at which any of the keys in the passive generation would be readable. For example, if a kv pair at t=1 is overwritten at t=5, then we would have to bump the max timestamp to t=4 (for a read at 4 would see the value at t=1, a read at 5 wouldn't). If the new write at t=5 is a deletion tombstone, this is the same, but we also get to move the deletion tombstone itself into the passive generation[2].

This seems straightforward, but note that every (overwriting) write updates the max timestamp to the current time. This means that a majority of writes won't get the benefit of skipping the passive generation, and we're back fairly close to square one (though something has been won -- in the absence of overwrites, everything works as expected).

To address this, the strategy that comes to mind is to not move immediately after shadowing. Instead, defer this until a few versions accumulate (or until some versions are at least T seconds old) and then rewrite. This makes it quite unattractive to amortize this with the writes themselves, as we're likely not willing to a) burden the write with the large overhead of reading and rewriting potentially many versions or b) keeping more metadata for each write.

Instead, what can be done is to trigger the conversion through reads. Reading is where these non-live versions really cause pain, so a read is naturally good at discovering that a cleanup is necessary. This has some desirable properties (but not only):

  • No overhead when nothing is to be done
  • Better potential use of batching (important when considering columnarization)
  • nothing will be moved into the old generation until it's read, so a COUNT(*) after a months of insert-delete will be expensive. (This can be mitigated by making our regular maintenance operations trigger this cleanup as well. Concretely, looking a the consistency checker. The time scales are very different though).
  • more background jobs = potentially more surprises and perf blips

Interestingly, there is a nice way to a least prototype this: let the scan construct a WriteBatch that, when applied, would rewrite the offending kv pairs. Whenever a WriteBatch is returned from a Scan (with appropriate metadata about the new max timestamp), simply apply it to the underlying engine after returning to the client. (To make this 100% correct, some synchronization is needed between bumping the max_timestamp and in-flight reads still using the previous one, but not required for prototyping). A simple policy is to move values only when that wouldn't bump max_timestamp within (say) 5s of the current time.

[2]: are there any issues with this regarding newer writes that could write at t<=5 now? I don't think so, because of the timestamp cache, assuming that gets populated on a deletion -- which it may currently not.

@tbg
Copy link
Member

tbg commented Jan 18, 2018

PS my starting point for this will likely be a cmd/workload loadgen (and thus fixture) that clearly exhibits this problem.

@petermattis
Copy link
Collaborator

Nice writeup!

Interestingly, there is a nice way to a least prototype this: let the scan construct a WriteBatch that, when applied, would rewrite the offending kv pairs. Whenever a WriteBatch is returned from a Scan (with appropriate metadata about the new max timestamp), simply apply it to the underlying engine after returning to the client.

Would you send that WriteBatch to the other replicas? Or would you let each replica migrate keys from active to passive individually? Because the data is logically the same, I don't think we need to synchronize the movement on the replicas, though not doing so might cause performance blips when the leaseholder changes.

Ah, re-reading I see you're thinking about this in the context of a prototype. Doing something easy for a prototype seems appropriate.

PS my starting point for this will likely be a cmd/workload loadgen (and thus fixture) that clearly exhibits this problem.

The MVCCScan benchmark also clearly shows the problem with large numbers of versions, though perhaps that is too low of a level to work at.

@tbg tbg added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Sep 27, 2018
@tbg tbg added S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. and removed S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting labels Jan 8, 2019
@andreimatei
Copy link
Contributor

we should have something about this in the output of explain analyze. @asubiotto you want it?
There's talk above about an attempt to get tracing info about this out of RocksDB, but Tobi said it was slow for some reason.

@asubiotto
Copy link
Contributor

@andreimatei it was slow to get tracing info? Adding this information to EXPLAIN ANALYZE is not a problem, we just need some way to access it from the kvFetcher.

@github-actions
Copy link

github-actions bot commented Jun 9, 2021

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@jordanlewis
Copy link
Member

We're in the progress of adding this information to EXPLAIN ANALYZE, finally: #64503.

@jordanlewis
Copy link
Member

cc @sumeerbhola, not sure if this one's useful to keep around any longer but just FYI.

@sumeerbhola
Copy link
Collaborator

There has been some thinking specifically on separating older versions inside Pebble, for this purpose of not degrading reads with a large number of versions, but it isn't captured in an issue. I'll take the TODO to create a separate issue for it. Though, if we could do compaction time GC #57260 it isn't clear whether this would be needed.

@a-robinson
Copy link
Contributor Author

Has any thought been given to using a metric along the lines of the MVCC step count from #64503 as a signal into the MVCC GC queue's scoring mechanism?

I had an issue recently where there were a lot of rows with primary keys of the form foo:<UUID> being created and deleted relatively quickly, and then a lot of queries along the lines of SELECT * FROM t WHERE key > 'foo:' AND key < 'foo;' LIMIT 1.

Those SELECTS got very expensive because each had to scan over tens of thousands of garbage MVCC entries, but MVCC GC never kicked in automatically on the range because even though the overwhelming majority of these entries were older than the gc.ttlseconds threshold and were causing a ton of CPU to be burned on these queries, the garbage keys only represented a very small fraction of the data in the range and so the range didn't get a particularly high MVCC GC score.

This particular instance is also obviously fixable with better application schema/query design, which I'll take care of, but it does seem like the amount of MVCC garbage being encountered by queries on a range could be a very valuable signal to the GC queue. That way, the garbage that's actually being encountered most often and wasting the most resources could be prioritized first.

On the other hand, if cockroachdb/pebble#1170 actually lands in the near future, then I suppose this will be a moot point.


And on a less important but related note, I found that manually triggering an MVCC GC run on the range wasn't enough to reduce the CPU usage, because the MVCC garbage was still in lower levels of the LSM. I had to also trigger a manual LSM compaction to actually speed up the queries. I assume that cockroachdb/pebble#918 would have helped with this latter point.

@tbg
Copy link
Member

tbg commented Mar 14, 2023

Thanks @a-robinson! I filed #98561 to track your idea.

@sumeerbhola
Copy link
Collaborator

On the other hand, if cockroachdb/pebble#1170 actually lands in the near future, then I suppose this will be a moot point.

This is currently expected in v23.1, but is mainly useful for scans where keys are small compared to the values. Any rough estimates for the key and value sizes in the index being scanned in your query?

@a-robinson
Copy link
Contributor Author

Any rough estimates for the key and value sizes in the index being scanned in your query?

~128 byte keys (as represented in the KV/storage layer) and ~512 bytes for each value.

@benesch
Copy link
Contributor

benesch commented Mar 28, 2023

I strongly suspect that we just ran into this problem at @MaterializeInc. Whenever we scale up our production cluster (in this case from 6 to 12 nodes), one or two nodes will take a disproportionate share of the CPU load, despite small (10m) GC thresholds. I don't have hard proof, but CPU profiles of the affected nodes show a surprising amount of time in storage.mvccScanToBytes relative to unaffected ndoes.

Internal support ticket for the curious: https://support.cockroachlabs.com/hc/en-us/requests/16470

And on a less important but related note, I found that manually triggering an MVCC GC run on the range wasn't enough to reduce the CPU usage, because the MVCC garbage was still in lower levels of the LSM. I had to also trigger a manual LSM compaction to actually speed up the queries. I assume that cockroachdb/pebble#918 would have helped with this latter point.

I wanted to give this a try to validate the theory, but it seems there's no way to trigger this on a Cockroach Cloud cluster?

@tbg
Copy link
Member

tbg commented Mar 30, 2023

@benesch drive by comment, once #99726 is picked up it will be a bit easier to "conclusively" see whether there are more tombstone accesses on one store vs another.

Also, if you can get a stmt trace that you suspect is slow because of tombstones, it will have iterator stats printed into it which contain similar information today already.

@benesch
Copy link
Contributor

benesch commented Apr 10, 2023

Awesome, thanks @tbg! We're running another load test in a month that I expect to tickle this. Will report back if we manage to get a repro.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-client Relating to the KV client and the KV interface. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. docs-todo S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. T-storage Storage Team
Projects
Status: Backlog
Development

No branches or pull requests