-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: add limits to skipped data iteration #68467
Conversation
This PR needs #68102 to land first because it would only work once we have an ability to stop on arbitrary key mid export. I will update this one once it is ready. Regardless of its readiness, this functionality is distinct and could be reviewed independently and I don't want to put it in the stack of changes in previous PR. Upd: rebased over necessary change on master already. |
4faae5c
to
a28ee45
Compare
6fffb7e
to
b513076
Compare
7bf1107
to
e18e0ff
Compare
@dt David do you think having an iteration limit should be a tenant/caller responsibility or a cluster responsibility? We have similar limit maxIntentCount which limits how many intents could be collected during export to have some protection for memory overuse and it is defined as a cluster setting because it protects servers from oom-ing. For iteration limits that throttle export resource usage on different dimension, should we stick to the same approach or rather let caller pass the limit explicitly? IIRC we wanted to give cpu constrained clients an ability to spread export requests, but it looks like it is more of a cluster level rather than request level limit. |
46bcac8
to
1edc5dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 10 files at r2, 5 of 5 files at r4, 1 of 5 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @dt, and @sumeerbhola)
pkg/storage/engine.go, line 362 at r4 (raw file):
// ExportOptions contains options provided to export operation. type ExportOptions struct { // StartKey determines the start of exported interval (inclusive).
nit: ... start of the exported ...
pkg/storage/engine.go, line 363 at r4 (raw file):
type ExportOptions struct { // StartKey determines the start of exported interval (inclusive). // StartKey.Timestamp is either empty which represent starting from potential
nit: represents starting from a potential ...
pkg/storage/engine.go, line 365 at r4 (raw file):
// StartKey.Timestamp is either empty which represent starting from potential // intent and continuing to versions or non-empty, which represents starting // from particular version.
nit: from a particular version.
pkg/storage/engine.go, line 367 at r4 (raw file):
// from particular version. StartKey MVCCKey // EndKey determines end of exported interval (exclusive).
nit: determines the end ...
pkg/storage/engine.go, line 369 at r4 (raw file):
// EndKey determines end of exported interval (exclusive). EndKey roachpb.Key // StartTS and EndTS determine exported time range as (startTS, endTS]
nit: missing period.
pkg/storage/engine.go, line 387 at r4 (raw file):
// If StopMidKey is false, once function reaches targetSize it would continue // adding all versions until it reaches next key or end of range. If true, it // would stop immediately when targetSize is reached and return a next versions
nit: ... the next versions ...
pkg/storage/mvcc_incremental_iterator.go, line 130 at r5 (raw file):
// Resume key is not necessarily a valid iteration key as we could stop in between // eligible keys. MaxAllowedIterations int64
I'm wary of introducing resource control settings that cannot be easily understood in terms of real resources.
maxIntentCount was not quite a real resource but could be understood as a limit on the return size.
Ideally, we would want to limit the cpu time and IO time spent in executing an operation that scans data. This would also fit in well with the current admission control which functions better when the execution of a BatchRequest
has a bounded size. But we don't have cpu information from golang and I can't see when we we would have IO time information either.
But I think this could still be something more tangible that we can use not just for exports, but also for other scans and gets (requests for non-background operations could use a higher limit). We have some counts in pebble.IteratorStats
but they have multiple dimensions like step/seek, forward/reverse and it would be nicer to have a single dimension that was more relatable. I am thinking number of ssblock bytes "read" (we'd count all the bytes for an ssblock when loading it into the iterator) would be a good metric. If the setting puts a 100MB limit on it, it means something real. We don't expose this value via IteratorStats
, or track it in the low-level sstable iterators but it can be added. One downside is that it does not count the bytes iterated in the memtable -- I don't think that matters in production settings at all since the memtable is tiny relative to the rest of the store.
@jbowens @itsbilal for other opinions.
1edc5dd
to
816a7a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @itsbilal, @jbowens, and @sumeerbhola)
pkg/storage/mvcc_incremental_iterator.go, line 130 at r5 (raw file):
Previously, sumeerbhola wrote…
I'm wary of introducing resource control settings that cannot be easily understood in terms of real resources.
maxIntentCount was not quite a real resource but could be understood as a limit on the return size.Ideally, we would want to limit the cpu time and IO time spent in executing an operation that scans data. This would also fit in well with the current admission control which functions better when the execution of a
BatchRequest
has a bounded size. But we don't have cpu information from golang and I can't see when we we would have IO time information either.But I think this could still be something more tangible that we can use not just for exports, but also for other scans and gets (requests for non-background operations could use a higher limit). We have some counts in
pebble.IteratorStats
but they have multiple dimensions like step/seek, forward/reverse and it would be nicer to have a single dimension that was more relatable. I am thinking number of ssblock bytes "read" (we'd count all the bytes for an ssblock when loading it into the iterator) would be a good metric. If the setting puts a 100MB limit on it, it means something real. We don't expose this value viaIteratorStats
, or track it in the low-level sstable iterators but it can be added. One downside is that it does not count the bytes iterated in the memtable -- I don't think that matters in production settings at all since the memtable is tiny relative to the rest of the store.
@jbowens @itsbilal for other opinions.
I think like the idea of limiting on block-bytes loaded, but as far as a KV-level work limit, I think capping the number of keys compared for inclusion in the returned result seems reasonable; we already have limits on number of keys returned by KV, so limiting number of keys that we will examined by KV for return, even if they aren't ultimately returned, doesn't seem like too much of a conceptual stretch?
Ultimately I think both limits would be desirable -- limiting block bytes loaded helps cap the IO and storage CPU footprint of the request, but if for example we're in some very compressed, valueless index-key blocks, I could see wanting to limit the number of iterations, not just loaded block bytes, to also limit KV's cpu time too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @itsbilal, @jbowens, and @sumeerbhola)
pkg/storage/mvcc_incremental_iterator.go, line 130 at r5 (raw file):
Previously, dt (David Taylor) wrote…
I think like the idea of limiting on block-bytes loaded, but as far as a KV-level work limit, I think capping the number of keys compared for inclusion in the returned result seems reasonable; we already have limits on number of keys returned by KV, so limiting number of keys that we will examined by KV for return, even if they aren't ultimately returned, doesn't seem like too much of a conceptual stretch?
Ultimately I think both limits would be desirable -- limiting block bytes loaded helps cap the IO and storage CPU footprint of the request, but if for example we're in some very compressed, valueless index-key blocks, I could see wanting to limit the number of iterations, not just loaded block bytes, to also limit KV's cpu time too.
My 2p on what I see our goal here.
We want to limit resources used by export requests to minimize impact on higher priority requests. The most obvious resources are CPU and IO in this case, but also a time that we held range locks. I'm not sure what is the memory impact at the moment. Depending on range content we could have many small values that we need to go through which could incur extra CPU load when skipping data or large payloads where we can hit higher IO use.
We also need a way to expose that to MVCCIncrementalIterator which is currently agnostic of underlying storage and delegates work to storage iterators.
Current approach with just counting how many times we stepped avoid adding extra methods to iterators, but it is a bit ugly and only acting as a proxy of underlying complexity.
A better approach may be to get a "counters" object when constructing iterator in reader.NewMVCCIterator() and then just ask that object if we reached target?
The second aspect that I don't particularly like is that we explicitly pass limits which couples caller with iterator while a better way could be to have some reading "profile" that would be configured separately and could have as many resource limits as we currently have so that we don't have to change the whole call stack if we find a way to expose something more useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @jbowens, and @sumeerbhola)
pkg/storage/mvcc_incremental_iterator.go, line 130 at r5 (raw file):
The most obvious resources are CPU and IO in this case, but also a time that we held range locks.
How about we measure walltime. Unless the system is very overloaded, it should be a good proxy for cpu+io time.
The second aspect that I don't particularly like is that we explicitly pass limits which couples caller with iterator while a better way could be to have some reading "profile" that would be configured separately and could have as many resource limits as we currently have so that we don't have to change the whole call stack if we find a way to expose something more useful.
This seems worthwhile. We could add something like a ResourceTimeLimiter
struct, which for now would only do walltime, and pass it through MVCCIncrementalIteratOptions
/MVCCScanOptions
/MVCCGetOptions
(like we do with the mon.BoundAccount
in the latter two). This would be called every N loop iterations to amortize the cost of fetching the "resource time", to check if some resource time had been exceeded. Later we could add other dimensions to the ResourceTimeLimiter
.
21365be
to
bfceb08
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @jbowens, and @sumeerbhola)
pkg/storage/mvcc_incremental_iterator.go, line 130 at r5 (raw file):
Previously, sumeerbhola wrote…
The most obvious resources are CPU and IO in this case, but also a time that we held range locks.
How about we measure walltime. Unless the system is very overloaded, it should be a good proxy for cpu+io time.
The second aspect that I don't particularly like is that we explicitly pass limits which couples caller with iterator while a better way could be to have some reading "profile" that would be configured separately and could have as many resource limits as we currently have so that we don't have to change the whole call stack if we find a way to expose something more useful.
This seems worthwhile. We could add something like a
ResourceTimeLimiter
struct, which for now would only do walltime, and pass it throughMVCCIncrementalIteratOptions
/MVCCScanOptions
/MVCCGetOptions
(like we do with themon.BoundAccount
in the latter two). This would be called every N loop iterations to amortize the cost of fetching the "resource time", to check if some resource time had been exceeded. Later we could add other dimensions to theResourceTimeLimiter
.
I gave it a go and it looks ok I think. There's still an issue with ensuring that we always advance, but it is currently mitigated by throttling checks so that we always move N times before first check is done.
Maybe we could pass current key to Exhausted() so that it could check or have this check in iterator itself which would require saving start key and comparing current to start once we hit the limit.
If there would be other checks, limiter might need a view of underlying iterator stats injected or the whole creation sequence turned upside down.
0b3478e
to
5c1e92b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 11 files at r10, 2 of 7 files at r12, 1 of 532 files at r14, 2 of 529 files at r15, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @dt, @itsbilal, @jbowens, and @sumeerbhola)
pkg/storage/mvcc_incremental_iterator.go, line 41 at r15 (raw file):
// // If iteration resource limit is requested, iterator would return an error as // soon as limit is reached. The error will contain a resume key which could be
why would this be outside the requested span?
pkg/storage/mvcc_incremental_iterator.go, line 132 at r15 (raw file):
// ClockCheckEveryNIterations defines for how many iterations we could cache // current time when performing iteration wall clock time limiting. const ClockCheckEveryNIterations = 100
can this be package private?
pkg/storage/mvcc_incremental_iterator.go, line 240 at r15 (raw file):
// StartKey must also be populated with resume key. This is needed to ensure progress // for cases when initial seek would exhaust resources and that subsequent call would // restart from further position.
I didn't understand this comment. Is this trying to say that we don't want to stop until we are past startKey
? Given this is a SimpleMVCCIterator
the first call on this needs to be SeekGE
. We can keep a notFirstCallToSeekGE
bool in MVCCIncrementalIterator
and and pass it to advance()
.
Something like
func (...) SeekGE(...) {
...
ignoreLimiter := !i.notFirstCallToSeekGE
i.notFirstCallToSeekGE = false
i.advance(ignoreLimiter)
}
I think this is better than doing additional key comparisons, and doesn't require additional state in options.
pkg/storage/mvcc_incremental_iterator.go, line 242 at r15 (raw file):
// restart from further position. // Note that resume key is not necessarily a valid iteration key as we could stop in // between eligible keys because of timestamp range limits.
because of resource limits.
ac7f201
to
b25cbef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @itsbilal, @jbowens, and @sumeerbhola)
pkg/storage/mvcc_incremental_iterator.go, line 41 at r15 (raw file):
Previously, sumeerbhola wrote…
why would this be outside the requested span?
If we don't use TBI then when we do Next on underlying iter it may move to the mvcckey that is out of bounds for iterator. We then do advance which will start iterating until it hits next valid mvcckey, but resource limiter could cut it short so we end up with the key that is not eligible for inclusion into results.
With TBI it should always be within limits I think.
pkg/storage/mvcc_incremental_iterator.go, line 132 at r15 (raw file):
Previously, sumeerbhola wrote…
can this be package private?
Done.
pkg/storage/mvcc_incremental_iterator.go, line 240 at r15 (raw file):
Previously, sumeerbhola wrote…
I didn't understand this comment. Is this trying to say that we don't want to stop until we are past
startKey
? Given this is aSimpleMVCCIterator
the first call on this needs to beSeekGE
. We can keep anotFirstCallToSeekGE
bool inMVCCIncrementalIterator
and and pass it toadvance()
.Something like
func (...) SeekGE(...) { ... ignoreLimiter := !i.notFirstCallToSeekGE i.notFirstCallToSeekGE = false i.advance(ignoreLimiter) }
I think this is better than doing additional key comparisons, and doesn't require additional state in options.
I like that, it looks cleaner at least on the outside.
pkg/storage/mvcc_incremental_iterator.go, line 242 at r15 (raw file):
Previously, sumeerbhola wrote…
because of resource limits.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @itsbilal, @jbowens, and @sumeerbhola)
pkg/storage/mvcc_incremental_iterator.go, line 41 at r15 (raw file):
Previously, aliher1911 (Oleg) wrote…
If we don't use TBI then when we do Next on underlying iter it may move to the mvcckey that is out of bounds for iterator. We then do advance which will start iterating until it hits next valid mvcckey, but resource limiter could cut it short so we end up with the key that is not eligible for inclusion into results.
With TBI it should always be within limits I think.
I was thinking about it and it could be fixed by adding checks before error is raised and adjusting the key in the following way - if timestamp is lower than requested, move to the next key and use latest requested timestamp, if timestamp is higher, move to the highest requested without changing key, if the key moved outside of requested range as a result of adjustment, don't err and finish gracefully.
That would duplicate the logic we have within the loop. Do you think this added complexity is justified for consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @dt, @itsbilal, @jbowens, and @sumeerbhola)
pkg/storage/mvcc_incremental_iterator.go, line 41 at r15 (raw file):
If we don't use TBI then when we do Next on underlying iter it may move to the mvcckey that is out of bounds for iterator.
We set an upper bound for the iter, yes? So I am not sure why that would happen.
We then do advance which will start iterating until it hits next valid mvcckey, but resource limiter could cut it short so we end up with the key that is not eligible for inclusion into results.
But I think I agree that the general structure of this change is error prone. The MVCCIncrementalIterator
is maintaining some invariants that are true at the end of each seek and next* function. By checking for resource limits in the middle of these functions after some work is done, we are inviting complexity regarding those invariants. An easy fix would be lift the resource limit checking to the start of each seek and next* function and avoid doing the checking for the first seek, to ensure forward progress.
That brings me to a related concern: the code in pebble.go isn't necessarily trivial since it has to deal with the fact that MVCCIncrementalIterator
can complain about resource limits at any call to next*. There are 2 additional users of MVCCIncrementalIterator
in `catchup_scan.go and mvcc.go. The former is probably not one that will use resource limits, but the latter eventually could. Can you look at the code to see if it would be simpler to lift the resource limit checking into the callers? With the current approach we can't avoid modifying the callsites, so why not give them full control. That way if they don't want the resource checking to happen before some particular next call since it is harder to maintain some resumption invariant, they can do so.
Let me know if you want to discuss synchronously.
aa30b60
to
0d78564
Compare
Summarizing: we had an offline conversation with @sumeerbhola and he made fair points that time bound iterator should reduce severity of effects when we need to skip over large amounts of data. Without such requirement, code that tracks resources could be pulled out on level above so that export itself could check how long we iterate and stop if needed. Second consideration is that mvcc_incremental_iterator maintains some invariants regarding its state and what it returns and breaking out of loop violates them. Having such loosely defined component is not good and it would impact planned work to rewrite it in near future. Based on that discussion I made changes to pull code out to pebble part. |
Thanks for summarizing. Adding to the considerations in the previous comment: the clients of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good. A few minor comments
0d78564
to
ada0880
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
ada0880
to
351bba6
Compare
Number of arguments to ExportMVCCToSst is too large. This diff moves them into a struct to improve readability. Release note: None Release justification:
Previously when exporting data from pebble iterator could spend unbouded time skipping entries regardless of export size limits. This is becoming a problem for resource constrained clusters where low priority requests like export that are used by backups to interfere with high priority workloads. If we want to throttle backups we need to be able to limit how many underlying operations we want to perform per request. This change adds an optional iteration limit to export. Once the limit is reached, export will end its current chunk and return a resume span even if desired size is not reached. Current limiter uses wall clock time to stop interation. Release note: None
Export requests could iterate unbounded amount of data in storage this diff adds kv.bulk_sst.max_request_time hidden cluster setting to limit how for long export could run irrespective of how much data is actually exported. Release note: None
351bba6
to
e608905
Compare
bors r=sumeerbhola |
Build succeeded: |
Previously when iterating engine using MVCCIncrementalIterator caller
could skip large amounts of non-matching data which would result in
"unbounded" resource usage.
This is becoming a problem for resource constrained clusters where low
priority requests like export that are used by backups to interfere with
high priority workloads. If we want to throttle backups we need to be able
to limit how many underlying operations we want to perform per request.
This change adds an optional iteration limit to the iterator. Once the
limit is reached, iterator will return an error. Error will provide a
resume key to continue iteration in next request.
Release note: None
Fixes #68234