-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: missing memory accounting for internal executor (+crash w/ crdb_internal.jobs) #48595
Comments
cc @jordanlewis for triage cc @andreimatei: maybe you can make an architectural suggestion as to where to find a good monitor inside |
What I would encourage would be making the |
Ok a channel is also good. In any case, this fix must be lightweight enough as to be backportable, so I wouldn't recommend a large architectural change. |
Your expected resolution: memory accounting in the internal executor would be great. It just didn't seem easy at the time. Not only are the results handed back from the We knew that the memory accounting was bogus when we added #45758. The idea was that it was a modest improvement for two reasons. At the time we also buffered deserialized copies of the data in RAM, additionally, the hope was that if we got to that point, at least we might prevent a different, concurrent request from pushing us over the limit. It was not a great patch but the hope was that it was better than nothing. |
I saw your comment in the code. I agree the patch was better than nothing. Unfortunately our CC SREs also discovered it was not sufficient. Hence this followup. |
This is the current state on my work on this: #48615 |
Yes I appreciate that it is a lot of work and the problem is somewhat complex to solve. However, the impact is high and would be a time/money saver for CC. This is definitely worth the investment. I encourage you to discuss this with other TLs/PMs to ensure it gets budgeted/scheduled appropriately. |
59330: sql: make internal executor streaming r=yuzefovich a=yuzefovich This commit updates the internal executor to operate in a streaming fashion by refactoring its internal logic to implement an iterator pattern. A new method `QueryInternalEx` (and its counterpart `QueryInternal`) is introduced (both not used currently) while all existing methods of `InternalExecutor` interface are implemented using the new iterator logic. The communication between the iterator goroutine (the receiver) and the connExecutor goroutine (the sender) is done via a buffered (of 32 size in non-test setting) channel. The channel is closed when the connExecutor goroutine exits its run() loop. Care needs to be taken when closing the iterator - we need to make sure to close the stmtBuf (so that there are no more commands for the connExecutor goroutine to execute) and then we need to unblockingly drain the channel (since the connExecutor goroutine might be blocked on adding a row to the channel). After that we have to wait for the connExecutor goroutine to exit so that we can finish the tracing span. For convenience purposes, if the iterator is fully exhausted, it will get closed automatically. Addresses: #48595. Release note: None 60023: kv: wait for replica initialization in closed timestamp tests r=nvanbenschoten a=nvanbenschoten Fixes #60288. This commit updates tests in closed_timestamp_test.go to wait for replica initialization before returning replicas in `replsForRange`. This was causing flakiness on master. I bisected it back to c44b357. It appears that that change allowed for calls directly to `Replica.Send` (testing only) to make it deeper into replica code and eventually hit a panic in `Replica.checkSpanInRangeRLocked`. This commit fixes this in two ways: 1. it updates the closed timestamp tests to wait for replica initialization before calling directly into `Replica.Send`. Tests shouldn't be calling `Replica.Send` on uninitialized replicas. 2. it adds extra protection in `Replica.checkExecutionCanProceed` to throw an error if the replica is no initialized. This isn't strictly necessary, but it's cheap and limits the blast radius of getting this wrong elsewhere. Release note: None 60282: delegate: gate estimated_row_count on SHOW TABLES behind cluster setting r=rafiss a=otan Refs: #58189 Release note (sql change): Introduced a cluster setting sql.show_tables.estimated_row_count.enabled, which defaults to true. If false, estimated_row_count will not display on SHOW TABLES which improves performance. 60285: geo/wkt: add support for parsing multilinestrings with Z and M dims r=otan a=andyyang890 This patch extends the capabilities of the WKT parser to include parsing of multilinestrings with Z and M dimensions. Refs: #53091 Release note: None 60305: sql: use kv_node_status to determine node liveness for multiregion r=arulajmani a=otan This was a relic before the decision was made -- node liveness was detected by gossip instead. This changes the code to use SHOW REGIONS FROM CLUSTER, which aggregates from kv_node_status. Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Oliver Tan <otan@cockroachlabs.com> Co-authored-by: Andy Yang <ayang@cockroachlabs.com>
60428: sqlutil: remove Query and QueryEx from InternalExecutor interface r=yuzefovich a=yuzefovich All usages of `sqlutil.InternalExecutor.Query` and `sqlutil.InternalExecutor.QueryEx` have been refactored: most of the changes are taking advantage of the iterator API, in a couple of places we now use `Exec` and `ExecEx` methods since in those places we only need the number of affected rows. In a few places where we still need to buffer all rows, the caller currently performs the buffering. In the follow-up commit `QueryBuffered` and `QueryBufferedEx` will be added into the interface (essentially renaming `Query` and `QueryEx` to make it explicit that the buffering occurs), and the custom logic will be removed. Addresses: #48595. Release note: None 60443: logictest: deflake crdb_internal.gossip_liveness test r=otan a=rafiss The regex needs to also match `0,0` Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
60572: sqlutil: remove QueryWithCols in favor of QueryRowExWithCols r=yuzefovich a=yuzefovich This commit refactors all usages of `sqlutil.InternalExecutor.QueryWithCols` method in favor of newly added `QueryRowExWithCols` since in almost all places where the former was used, we expected exactly one row. The only noticeable change is the refactor of `jobScheduler.executeSchedules` where we now use the iterator pattern. The difference there is that now it is possible to execute some schedules before encountering an error on the iterator (previously, we would buffer up all rows first), but this change is acceptable because each schedule is executed under savepoint to guard against errors in schedule planning/execution. Addresses: #48595. Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
60593: colexec: make external sorter respect memory limit better r=yuzefovich a=yuzefovich **colexec: register memory used by dequeued batches from partitioned queue** Previously, we forgot to perform the memory accounting of the batches that are dequeued from the partitions in the external sort (which could be substantial when we're merging multiple partitions at once and the tuples are wide) and in the hash based partitioner. This is now fixed. Additionally, this commit retains references to some internal operators in the external sort in order to reuse the memory under the dequeued batches (this will be beneficial if we perform repeated merging). Also, this commit fixes an issue with repeated re-initializing of the disk-backed operators in the disk spiller if the latter has been reset (the problem would lead to redundant allocations and not reusing of the available memory). Slight complication with accounting was because of the fact that we were using the same allocator for all usages. This would be quite wrong because in the merge phase we have two distinct memory usage with different lifecycles - the memory under the dequeued batches is kept (and reused later) whereas the memory under the output batch of the ordered synchronizer is released. We now correctly handle these lifecycles by using separate allocators. Release note (bug fix): CockroachDB previously didn't account for some RAM used when disk-spilling operations (like sorts and hash joins) were using the temporary storage in the vectorized execution engine. This could result in OOM crashes, especially when the rows are large in size. **colexec: make external sorter respect memory limit better** This commit improves in how the external sorter manages its available RAM. There are two different main usages that overlap because we are keeping the references to both at all times: 1. during the spilling/sorting phase, we use a single in-memory sorter 2. during the merging phase, we use the ordered synchronizer that reads one batch from each of the partitions and also allocates an output batch. Previously, we would give the whole memory limit to the in-memory sorter in 1. which resulted in the external sorter using at least 2x of its memory limit. This is now fixed by giving only a half to the in-memory sorter. The handling of 2. was even worse - we didn't have any logic that would limit the number of active partitions based on the memory footprint. If the batches are large (say 1GB in size), during the merge phase we would be using on the order of 16GB of RAM (number 16 would be determined based on the number of file descriptors). Additionally, we would give the whole memory limit to the output batch too. This misbehavior is also now fixed by tracking the maximum size of a single batch in each active partition and computing the actual maximum number of partitions to have using those sizes. Fixes: #60017. Release note: None 60604: sql: remove QueryWithCols method from the internal executor r=yuzefovich a=yuzefovich Previous commit removed this method from the interface, and this commit follows up to remove the method entirely. This is done in a similar fashion - by changing to using `QueryRowExWithCols` (when at most one row is expected) and to using the iterator API (avoiding the buffering of rows in all cases). Addresses: #48595. Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
60693: sql: audit all usages of QueryEx to use iterator pattern r=yuzefovich a=yuzefovich This commit audits the usage of `QueryEx` method of the internal executor in the following manner: - if the caller only needed to execute the statement, `ExecEx` is now used - if the query can return at most one row, then `QueryRowEx` is now used - if the caller can be refactored to use the iterator pattern, it is done so. As a result, almost all usages have been refactored (most notably the virtual `crdb_internal.jobs` table now uses the iterator pattern, thus aleviating OOM concerns - the ad-hoc memory accounting logic has been removed). `QueryEx` has been renamed to `QueryBufferedEx` to highlight that the full buffering occurs, and it was added to `sqlutil.InternalExecutor` interface. The method is now used only in two places. Addresses: #48595. Release justification: bug fix. Release note (bug fix): `crdb_internal.jobs` virtual table is now populated in a paginated fashion, thus, alleviating memory related concerns when previously we could encounter OOM crash. Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
(NB: this issue is both applicable to the 20.1 tree which materializes the vtables lazily, and the previous approach)
Describe the problem
In PR #45758 new code was added to account for memory of the materialized rows in crdb_internal.jobs.
However the calls to the memory monitor API is performed after all the rows have been loaded in RAM already.
If there was not enough RAM to start with, the node crashes with an OOM error before the point where the monitor API is called.
This issue affects all uses of the internal executor, not just crdb_internal.jobs.
This is why the OOM crashes upon large system jobs tables are still happening despite #45758 being merged.
Expected resolution
The proper approach is to enhance the InternalExecutor to
use a SQL memory monitor when filling its bufferedCommandResultstream results to its caller, which is not done currently.The text was updated successfully, but these errors were encountered: