Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: missing memory accounting for internal executor (+crash w/ crdb_internal.jobs) #48595

Closed
knz opened this issue May 8, 2020 · 8 comments · Fixed by #59339
Closed

sql: missing memory accounting for internal executor (+crash w/ crdb_internal.jobs) #48595

knz opened this issue May 8, 2020 · 8 comments · Fixed by #59339
Assignees
Labels
A-sql-executor SQL txn logic backport-19.2.x C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.

Comments

@knz
Copy link
Contributor

knz commented May 8, 2020

(NB: this issue is both applicable to the 20.1 tree which materializes the vtables lazily, and the previous approach)

Describe the problem

In PR #45758 new code was added to account for memory of the materialized rows in crdb_internal.jobs.

However the calls to the memory monitor API is performed after all the rows have been loaded in RAM already.

If there was not enough RAM to start with, the node crashes with an OOM error before the point where the monitor API is called.

This issue affects all uses of the internal executor, not just crdb_internal.jobs.

This is why the OOM crashes upon large system jobs tables are still happening despite #45758 being merged.

Expected resolution

The proper approach is to enhance the InternalExecutor to use a SQL memory monitor when filling its bufferedCommandResult stream results to its caller, which is not done currently.

@knz knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. A-sql-executor SQL txn logic labels May 8, 2020
@knz
Copy link
Contributor Author

knz commented May 8, 2020

cc @jordanlewis for triage

cc @andreimatei: maybe you can make an architectural suggestion as to where to find a good monitor inside (r *bufferedCommandResult) AddRow()

@andreimatei
Copy link
Contributor

What I would encourage would be making the InternalExecutor stream its results back instead of buffering them (in that bufferedCommandResult). Like, stream them out on a channel, perhaps with the channel's buffering controlled by the caller, or even through actually implementing a tiny database/sql driver and then using the standard go Rows object.
I know @spaskob was looking at using a channel somehow a while ago.

@knz
Copy link
Contributor Author

knz commented May 8, 2020

Ok a channel is also good. In any case, this fix must be lightweight enough as to be backportable, so I wouldn't recommend a large architectural change.

@knz
Copy link
Contributor Author

knz commented May 8, 2020

@dt and I had a separate discussion: as explained in #46367 it would be good if the solution here was not based on memory accounting, and instead use streaming as suggested by Andrei. This ensures that we don't force more memory errors onto users when the solution to this gets backported.

@ajwerner
Copy link
Contributor

ajwerner commented May 8, 2020

Your expected resolution: memory accounting in the internal executor would be great. It just didn't seem easy at the time. Not only are the results handed back from the InternalExecutor neither bufferred nor memory-accounted, the execution of the InternalExecutor isn't memory accounted at all.

We knew that the memory accounting was bogus when we added #45758. The idea was that it was a modest improvement for two reasons. At the time we also buffered deserialized copies of the data in RAM, additionally, the hope was that if we got to that point, at least we might prevent a different, concurrent request from pushing us over the limit. It was not a great patch but the hope was that it was better than nothing.

@knz
Copy link
Contributor Author

knz commented May 8, 2020

It was not a great patch but the hope was that it was better than nothing.

I saw your comment in the code. I agree the patch was better than nothing. Unfortunately our CC SREs also discovered it was not sufficient. Hence this followup.

@spaskob
Copy link
Contributor

spaskob commented May 8, 2020

This is the current state on my work on this: #48615
It turned out to be a lot of work so I have abandoned it for quite some in favor of 20.1 release work.
Let me know if it is worth revisiting.

@knz
Copy link
Contributor Author

knz commented May 9, 2020

Yes I appreciate that it is a lot of work and the problem is somewhat complex to solve. However, the impact is high and would be a time/money saver for CC. This is definitely worth the investment. I encourage you to discuss this with other TLs/PMs to ensure it gets budgeted/scheduled appropriately.

craig bot pushed a commit that referenced this issue Feb 10, 2021
59330: sql: make internal executor streaming r=yuzefovich a=yuzefovich

This commit updates the internal executor to operate in a streaming
fashion by refactoring its internal logic to implement an iterator
pattern. A new method `QueryInternalEx` (and its counterpart
`QueryInternal`) is introduced (both not used currently) while all
existing methods of `InternalExecutor` interface are implemented
using the new iterator logic.

The communication between the iterator goroutine (the receiver) and the
connExecutor goroutine (the sender) is done via a buffered (of 32 size
in non-test setting) channel. The channel is closed when the
connExecutor goroutine exits its run() loop.

Care needs to be taken when closing the iterator - we need to make sure
to close the stmtBuf (so that there are no more commands for the
connExecutor goroutine to execute) and then we need to unblockingly
drain the channel (since the connExecutor goroutine might be blocked on
adding a row to the channel). After that we have to wait for the
connExecutor goroutine to exit so that we can finish the tracing span.
For convenience purposes, if the iterator is fully exhausted, it will
get closed automatically.

Addresses: #48595.

Release note: None

60023: kv: wait for replica initialization in closed timestamp tests r=nvanbenschoten a=nvanbenschoten

Fixes #60288.

This commit updates tests in closed_timestamp_test.go to wait for
replica initialization before returning replicas in `replsForRange`.
This was causing flakiness on master. I bisected it back to c44b357. It
appears that that change allowed for calls directly to `Replica.Send`
(testing only) to make it deeper into replica code and eventually hit a
panic in `Replica.checkSpanInRangeRLocked`.

This commit fixes this in two ways:
1. it updates the closed timestamp tests to wait for replica
   initialization before calling directly into `Replica.Send`. Tests
   shouldn't be calling `Replica.Send` on uninitialized replicas.
2. it adds extra protection in `Replica.checkExecutionCanProceed` to throw
   an error if the replica is no initialized. This isn't strictly
   necessary, but it's cheap and limits the blast radius of getting this
   wrong elsewhere.

Release note: None

60282: delegate: gate estimated_row_count on SHOW TABLES behind cluster setting r=rafiss a=otan

 Refs: #58189

Release note (sql change): Introduced a cluster setting
sql.show_tables.estimated_row_count.enabled, which defaults to true. If
false, estimated_row_count will not display on SHOW TABLES which
improves performance.

60285: geo/wkt: add support for parsing multilinestrings with Z and M dims r=otan a=andyyang890

This patch extends the capabilities of the WKT parser to include
parsing of multilinestrings with Z and M dimensions.

Refs: #53091

Release note: None

60305: sql: use kv_node_status to determine node liveness for multiregion r=arulajmani a=otan

This was a relic before the decision was made -- node liveness was
detected by gossip instead. This changes the code to use SHOW REGIONS
FROM CLUSTER, which aggregates from kv_node_status.

Release note: None

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Oliver Tan <otan@cockroachlabs.com>
Co-authored-by: Andy Yang <ayang@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Feb 10, 2021
60428: sqlutil: remove Query and QueryEx from InternalExecutor interface r=yuzefovich a=yuzefovich

All usages of `sqlutil.InternalExecutor.Query` and
`sqlutil.InternalExecutor.QueryEx` have been refactored: most of
the changes are taking advantage of the iterator API, in a couple
of places we now use `Exec` and `ExecEx` methods since in those
places we only need the number of affected rows.

In a few places where we still need to buffer all rows, the caller
currently performs the buffering. In the follow-up commit
`QueryBuffered` and `QueryBufferedEx` will be added into the interface
(essentially renaming `Query` and `QueryEx` to make it explicit that the
buffering occurs), and the custom logic will be removed.

Addresses: #48595.

Release note: None

60443: logictest: deflake crdb_internal.gossip_liveness test r=otan a=rafiss

The regex needs to also match `0,0`

Release note: None

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Feb 16, 2021
60572: sqlutil: remove QueryWithCols in favor of QueryRowExWithCols r=yuzefovich a=yuzefovich

This commit refactors all usages of
`sqlutil.InternalExecutor.QueryWithCols` method in favor of newly added
`QueryRowExWithCols` since in almost all places where the former was
used, we expected exactly one row. The only noticeable change is the
refactor of `jobScheduler.executeSchedules` where we now use the
iterator pattern. The difference there is that now it is possible to
execute some schedules before encountering an error on the iterator
(previously, we would buffer up all rows first), but this change is
acceptable because each schedule is executed under savepoint to guard
against errors in schedule planning/execution.

Addresses: #48595.

Release note: None

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Feb 17, 2021
60593: colexec: make external sorter respect memory limit better r=yuzefovich a=yuzefovich

**colexec: register memory used by dequeued batches from partitioned queue**

Previously, we forgot to perform the memory accounting of the batches
that are dequeued from the partitions in the external sort (which could
be substantial when we're merging multiple partitions at once and the
tuples are wide) and in the hash based partitioner. This is now fixed.

Additionally, this commit retains references to some internal operators
in the external sort in order to reuse the memory under the dequeued
batches (this will be beneficial if we perform repeated merging).

Also, this commit fixes an issue with repeated re-initializing of the
disk-backed operators in the disk spiller if the latter has been reset
(the problem would lead to redundant allocations and not reusing of the
available memory).

Slight complication with accounting was because of the fact that we were
using the same allocator for all usages. This would be quite wrong
because in the merge phase we have two distinct memory usage with
different lifecycles - the memory under the dequeued batches is kept
(and reused later) whereas the memory under the output batch of the
ordered synchronizer is released. We now correctly handle these
lifecycles by using separate allocators.

Release note (bug fix): CockroachDB previously didn't account for some
RAM used when disk-spilling operations (like sorts and hash joins) were
using the temporary storage in the vectorized execution engine. This
could result in OOM crashes, especially when the rows are large in size.

**colexec: make external sorter respect memory limit better**

This commit improves in how the external sorter manages its available
RAM. There are two different main usages that overlap because we are
keeping the references to both at all times:
1. during the spilling/sorting phase, we use a single in-memory sorter
2. during the merging phase, we use the ordered synchronizer that reads
one batch from each of the partitions and also allocates an output
batch.

Previously, we would give the whole memory limit to the in-memory sorter
in 1. which resulted in the external sorter using at least 2x of its
memory limit. This is now fixed by giving only a half to the in-memory
sorter.

The handling of 2. was even worse - we didn't have any logic that would
limit the number of active partitions based on the memory footprint. If
the batches are large (say 1GB in size), during the merge phase we would
be using on the order of 16GB of RAM (number 16 would be determined
based on the number of file descriptors). Additionally, we would give
the whole memory limit to the output batch too.

This misbehavior is also now fixed by tracking the maximum size of
a single batch in each active partition and computing the actual maximum
number of partitions to have using those sizes.

Fixes: #60017.

Release note: None

60604: sql: remove QueryWithCols method from the internal executor r=yuzefovich a=yuzefovich

Previous commit removed this method from the interface, and this commit
follows up to remove the method entirely. This is done in a similar
fashion - by changing to using `QueryRowExWithCols` (when at most one
row is expected) and to using the iterator API (avoiding the buffering
of rows in all cases).

Addresses: #48595.

Release note: None

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Feb 26, 2021
60693: sql: audit all usages of QueryEx to use iterator pattern r=yuzefovich a=yuzefovich

This commit audits the usage of `QueryEx` method of the internal
executor in the following manner:
- if the caller only needed to execute the statement, `ExecEx` is now
used
- if the query can return at most one row, then `QueryRowEx` is now used
- if the caller can be refactored to use the iterator pattern, it is
done so.

As a result, almost all usages have been refactored (most notably the
virtual `crdb_internal.jobs` table now uses the iterator pattern, thus
aleviating OOM concerns - the ad-hoc memory accounting logic has been
removed).

`QueryEx` has been renamed to `QueryBufferedEx` to highlight that the
full buffering occurs, and it was added to `sqlutil.InternalExecutor`
interface. The method is now used only in two places.

Addresses: #48595.

Release justification: bug fix.

Release note (bug fix): `crdb_internal.jobs` virtual table is now
populated in a paginated fashion, thus, alleviating memory related
concerns when previously we could encounter OOM crash.

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
@craig craig bot closed this as completed in 4ab29d4 Mar 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-executor SQL txn logic backport-19.2.x C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants