sql: missing memory accounting for internal executor (+crash w/ crdb_internal.jobs) #48595

knz · 2020-05-08T17:15:08Z

(NB: this issue is both applicable to the 20.1 tree which materializes the vtables lazily, and the previous approach)

Describe the problem

In PR #45758 new code was added to account for memory of the materialized rows in crdb_internal.jobs.

However the calls to the memory monitor API is performed after all the rows have been loaded in RAM already.

If there was not enough RAM to start with, the node crashes with an OOM error before the point where the monitor API is called.

This issue affects all uses of the internal executor, not just crdb_internal.jobs.

This is why the OOM crashes upon large system jobs tables are still happening despite #45758 being merged.

Expected resolution

The proper approach is to enhance the InternalExecutor to ~~use a SQL memory monitor when filling its bufferedCommandResult~~ stream results to its caller, which is not done currently.

knz · 2020-05-08T17:16:12Z

cc @jordanlewis for triage

cc @andreimatei: maybe you can make an architectural suggestion as to where to find a good monitor inside (r *bufferedCommandResult) AddRow()

andreimatei · 2020-05-08T17:25:03Z

What I would encourage would be making the InternalExecutor stream its results back instead of buffering them (in that bufferedCommandResult). Like, stream them out on a channel, perhaps with the channel's buffering controlled by the caller, or even through actually implementing a tiny database/sql driver and then using the standard go Rows object.
I know @spaskob was looking at using a channel somehow a while ago.

knz · 2020-05-08T17:27:16Z

Ok a channel is also good. In any case, this fix must be lightweight enough as to be backportable, so I wouldn't recommend a large architectural change.

knz · 2020-05-08T17:36:33Z

@dt and I had a separate discussion: as explained in #46367 it would be good if the solution here was not based on memory accounting, and instead use streaming as suggested by Andrei. This ensures that we don't force more memory errors onto users when the solution to this gets backported.

ajwerner · 2020-05-08T17:37:53Z

Your expected resolution: memory accounting in the internal executor would be great. It just didn't seem easy at the time. Not only are the results handed back from the InternalExecutor neither bufferred nor memory-accounted, the execution of the InternalExecutor isn't memory accounted at all.

We knew that the memory accounting was bogus when we added #45758. The idea was that it was a modest improvement for two reasons. At the time we also buffered deserialized copies of the data in RAM, additionally, the hope was that if we got to that point, at least we might prevent a different, concurrent request from pushing us over the limit. It was not a great patch but the hope was that it was better than nothing.

knz · 2020-05-08T17:38:37Z

It was not a great patch but the hope was that it was better than nothing.

I saw your comment in the code. I agree the patch was better than nothing. Unfortunately our CC SREs also discovered it was not sufficient. Hence this followup.

spaskob · 2020-05-08T22:45:27Z

This is the current state on my work on this: #48615
It turned out to be a lot of work so I have abandoned it for quite some in favor of 20.1 release work.
Let me know if it is worth revisiting.

knz · 2020-05-09T16:10:43Z

Yes I appreciate that it is a lot of work and the problem is somewhat complex to solve. However, the impact is high and would be a time/money saver for CC. This is definitely worth the investment. I encourage you to discuss this with other TLs/PMs to ensure it gets budgeted/scheduled appropriately.

59330: sql: make internal executor streaming r=yuzefovich a=yuzefovich This commit updates the internal executor to operate in a streaming fashion by refactoring its internal logic to implement an iterator pattern. A new method `QueryInternalEx` (and its counterpart `QueryInternal`) is introduced (both not used currently) while all existing methods of `InternalExecutor` interface are implemented using the new iterator logic. The communication between the iterator goroutine (the receiver) and the connExecutor goroutine (the sender) is done via a buffered (of 32 size in non-test setting) channel. The channel is closed when the connExecutor goroutine exits its run() loop. Care needs to be taken when closing the iterator - we need to make sure to close the stmtBuf (so that there are no more commands for the connExecutor goroutine to execute) and then we need to unblockingly drain the channel (since the connExecutor goroutine might be blocked on adding a row to the channel). After that we have to wait for the connExecutor goroutine to exit so that we can finish the tracing span. For convenience purposes, if the iterator is fully exhausted, it will get closed automatically. Addresses: #48595. Release note: None 60023: kv: wait for replica initialization in closed timestamp tests r=nvanbenschoten a=nvanbenschoten Fixes #60288. This commit updates tests in closed_timestamp_test.go to wait for replica initialization before returning replicas in `replsForRange`. This was causing flakiness on master. I bisected it back to c44b357. It appears that that change allowed for calls directly to `Replica.Send` (testing only) to make it deeper into replica code and eventually hit a panic in `Replica.checkSpanInRangeRLocked`. This commit fixes this in two ways: 1. it updates the closed timestamp tests to wait for replica initialization before calling directly into `Replica.Send`. Tests shouldn't be calling `Replica.Send` on uninitialized replicas. 2. it adds extra protection in `Replica.checkExecutionCanProceed` to throw an error if the replica is no initialized. This isn't strictly necessary, but it's cheap and limits the blast radius of getting this wrong elsewhere. Release note: None 60282: delegate: gate estimated_row_count on SHOW TABLES behind cluster setting r=rafiss a=otan Refs: #58189 Release note (sql change): Introduced a cluster setting sql.show_tables.estimated_row_count.enabled, which defaults to true. If false, estimated_row_count will not display on SHOW TABLES which improves performance. 60285: geo/wkt: add support for parsing multilinestrings with Z and M dims r=otan a=andyyang890 This patch extends the capabilities of the WKT parser to include parsing of multilinestrings with Z and M dimensions. Refs: #53091 Release note: None 60305: sql: use kv_node_status to determine node liveness for multiregion r=arulajmani a=otan This was a relic before the decision was made -- node liveness was detected by gossip instead. This changes the code to use SHOW REGIONS FROM CLUSTER, which aggregates from kv_node_status. Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Oliver Tan <otan@cockroachlabs.com> Co-authored-by: Andy Yang <ayang@cockroachlabs.com>

60428: sqlutil: remove Query and QueryEx from InternalExecutor interface r=yuzefovich a=yuzefovich All usages of `sqlutil.InternalExecutor.Query` and `sqlutil.InternalExecutor.QueryEx` have been refactored: most of the changes are taking advantage of the iterator API, in a couple of places we now use `Exec` and `ExecEx` methods since in those places we only need the number of affected rows. In a few places where we still need to buffer all rows, the caller currently performs the buffering. In the follow-up commit `QueryBuffered` and `QueryBufferedEx` will be added into the interface (essentially renaming `Query` and `QueryEx` to make it explicit that the buffering occurs), and the custom logic will be removed. Addresses: #48595. Release note: None 60443: logictest: deflake crdb_internal.gossip_liveness test r=otan a=rafiss The regex needs to also match `0,0` Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>

60572: sqlutil: remove QueryWithCols in favor of QueryRowExWithCols r=yuzefovich a=yuzefovich This commit refactors all usages of `sqlutil.InternalExecutor.QueryWithCols` method in favor of newly added `QueryRowExWithCols` since in almost all places where the former was used, we expected exactly one row. The only noticeable change is the refactor of `jobScheduler.executeSchedules` where we now use the iterator pattern. The difference there is that now it is possible to execute some schedules before encountering an error on the iterator (previously, we would buffer up all rows first), but this change is acceptable because each schedule is executed under savepoint to guard against errors in schedule planning/execution. Addresses: #48595. Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>

60593: colexec: make external sorter respect memory limit better r=yuzefovich a=yuzefovich **colexec: register memory used by dequeued batches from partitioned queue** Previously, we forgot to perform the memory accounting of the batches that are dequeued from the partitions in the external sort (which could be substantial when we're merging multiple partitions at once and the tuples are wide) and in the hash based partitioner. This is now fixed. Additionally, this commit retains references to some internal operators in the external sort in order to reuse the memory under the dequeued batches (this will be beneficial if we perform repeated merging). Also, this commit fixes an issue with repeated re-initializing of the disk-backed operators in the disk spiller if the latter has been reset (the problem would lead to redundant allocations and not reusing of the available memory). Slight complication with accounting was because of the fact that we were using the same allocator for all usages. This would be quite wrong because in the merge phase we have two distinct memory usage with different lifecycles - the memory under the dequeued batches is kept (and reused later) whereas the memory under the output batch of the ordered synchronizer is released. We now correctly handle these lifecycles by using separate allocators. Release note (bug fix): CockroachDB previously didn't account for some RAM used when disk-spilling operations (like sorts and hash joins) were using the temporary storage in the vectorized execution engine. This could result in OOM crashes, especially when the rows are large in size. **colexec: make external sorter respect memory limit better** This commit improves in how the external sorter manages its available RAM. There are two different main usages that overlap because we are keeping the references to both at all times: 1. during the spilling/sorting phase, we use a single in-memory sorter 2. during the merging phase, we use the ordered synchronizer that reads one batch from each of the partitions and also allocates an output batch. Previously, we would give the whole memory limit to the in-memory sorter in 1. which resulted in the external sorter using at least 2x of its memory limit. This is now fixed by giving only a half to the in-memory sorter. The handling of 2. was even worse - we didn't have any logic that would limit the number of active partitions based on the memory footprint. If the batches are large (say 1GB in size), during the merge phase we would be using on the order of 16GB of RAM (number 16 would be determined based on the number of file descriptors). Additionally, we would give the whole memory limit to the output batch too. This misbehavior is also now fixed by tracking the maximum size of a single batch in each active partition and computing the actual maximum number of partitions to have using those sizes. Fixes: #60017. Release note: None 60604: sql: remove QueryWithCols method from the internal executor r=yuzefovich a=yuzefovich Previous commit removed this method from the interface, and this commit follows up to remove the method entirely. This is done in a similar fashion - by changing to using `QueryRowExWithCols` (when at most one row is expected) and to using the iterator API (avoiding the buffering of rows in all cases). Addresses: #48595. Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>

60693: sql: audit all usages of QueryEx to use iterator pattern r=yuzefovich a=yuzefovich This commit audits the usage of `QueryEx` method of the internal executor in the following manner: - if the caller only needed to execute the statement, `ExecEx` is now used - if the query can return at most one row, then `QueryRowEx` is now used - if the caller can be refactored to use the iterator pattern, it is done so. As a result, almost all usages have been refactored (most notably the virtual `crdb_internal.jobs` table now uses the iterator pattern, thus aleviating OOM concerns - the ad-hoc memory accounting logic has been removed). `QueryEx` has been renamed to `QueryBufferedEx` to highlight that the full buffering occurs, and it was added to `sqlutil.InternalExecutor` interface. The method is now used only in two places. Addresses: #48595. Release justification: bug fix. Release note (bug fix): `crdb_internal.jobs` virtual table is now populated in a paginated fashion, thus, alleviating memory related concerns when previously we could encounter OOM crash. Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>

knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. A-sql-executor SQL txn logic labels May 8, 2020

This was referenced May 8, 2020

cli/zip: do not retrieve crdb_internal.jobs in v19.2 and v20.1 #48486

Closed

cli/zip: avoid collecting crdb_internal.jobs in clusters prior to v20.2 #48488

Closed

knz added backport-19.2.x backport-20.1 labels May 8, 2020

knz mentioned this issue May 8, 2020

sql,bulk: missing memory accounting in stable releases #46367

Closed

asubiotto assigned jordanlewis May 19, 2020

asubiotto assigned yuzefovich and unassigned jordanlewis Sep 30, 2020

yuzefovich mentioned this issue Oct 25, 2020

sql: some users of the InternalExecutor could benefit from results streaming / mem accounting #25302

Closed

This was referenced Jan 24, 2021

sql: make internal executor streaming #59330

Merged

sql: audit all usages of Query to use iterator pattern #59339

Merged

yuzefovich mentioned this issue Feb 10, 2021

sqlutil: remove Query and QueryEx from InternalExecutor interface #60428

Merged

yuzefovich mentioned this issue Feb 14, 2021

sqlutil: remove QueryWithCols in favor of QueryRowExWithCols #60572

Merged

yuzefovich mentioned this issue Feb 16, 2021

sql: remove QueryWithCols method from the internal executor #60604

Merged

yuzefovich mentioned this issue Feb 17, 2021

sql: audit all usages of QueryEx to use iterator pattern #60693

Merged

craig bot closed this as completed in 4ab29d4 Mar 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: missing memory accounting for internal executor (+crash w/ crdb_internal.jobs) #48595

sql: missing memory accounting for internal executor (+crash w/ crdb_internal.jobs) #48595

knz commented May 8, 2020 •

edited

Loading

knz commented May 8, 2020 •

edited

Loading

andreimatei commented May 8, 2020

knz commented May 8, 2020

knz commented May 8, 2020 •

edited

Loading

ajwerner commented May 8, 2020

knz commented May 8, 2020

spaskob commented May 8, 2020

knz commented May 9, 2020 •

edited

Loading

sql: missing memory accounting for internal executor (+crash w/ crdb_internal.jobs) #48595

sql: missing memory accounting for internal executor (+crash w/ crdb_internal.jobs) #48595

Comments

knz commented May 8, 2020 • edited Loading

knz commented May 8, 2020 • edited Loading

andreimatei commented May 8, 2020

knz commented May 8, 2020

knz commented May 8, 2020 • edited Loading

ajwerner commented May 8, 2020

knz commented May 8, 2020

spaskob commented May 8, 2020

knz commented May 9, 2020 • edited Loading

knz commented May 8, 2020 •

edited

Loading

knz commented May 8, 2020 •

edited

Loading

knz commented May 8, 2020 •

edited

Loading

knz commented May 9, 2020 •

edited

Loading