cli/zip: avoid collecting crdb_internal.jobs in clusters prior to v20.2 #48488

knz · 2020-05-06T13:21:28Z

The crdb_internal.jobs vtables materializes in RAM.
When there are many rows due to frequent backup jobs, this can
cause the node where the data is collected to fail with OOM errors.

In v20.2, this failure mode is eliminated by preventing RAM
materialization of vtables. However this optimization is not available
in prior versions.

This patch changes zip to detect the version of the node performing
the collection, and skips over crdb_internal.jobs when v19.x or
v20.1 is detected. The table system.jobs is still collected as
its data is properly streamed and not materialized in RAM.

It's not possible to add automated testing of this functionality in
the code, because all test clusters run under fictional version
"v999.0.0". However I manually tested using v19.1 and v20.1 and
confirmed the table is skipped.

Release note (bug fix): Running cockroach debug zip does not any
more run the risk of growing memory usage of the remote
node to dangerous levels if there are many job entries currently
in system.jobs. This bug has always been present since debug zip
was first introduced.

The `crdb_internal.jobs` vtables materializes in RAM. When there are many rows due to frequent backup jobs, this can cause the node where the data is collected to fail with OOM errors. In v20.2, this failure mode is eliminated by preventing RAM materialization of vtables. However this optimization is not available in prior versions. This patch changes `zip` to detect the version of the node performing the collection, and skips over `crdb_internal.jobs` when v19.x or v20.1 is detected. It's not possible to add automated testing of this functionality in the code, because all test clusters run under fictional version "`v999.0.0`". However I manually tested using v19.1 and v20.1 and confirmed the table is skipped. Release note (bug fix): Running `cockroach debug zip` does not any more run the risk of growing memory usage of the remote node to dangerous levels if there are many job entries currently in `system.jobs`. This bug has always been present since `debug zip` was first introduced.

cockroach-teamcity · 2020-05-06T13:21:35Z

This change is

tbg · 2020-05-06T13:35:08Z

Do we even support using a cli that doesn't match the server version? Seems like another mixed-version story that we don't need.

LGTM though

knz · 2020-05-06T13:38:46Z

Do we even support using a cli that doesn't match the server version?

Yes it's supported, in both directions even: v20.1 CLI client with v19.2 server, and v19.2 client with v20.1 server.

This is not enforced by unit tests, but we fix bugs when they are encountered. Also all the client-server RPCs are designed to ensure this works.

knz · 2020-05-06T13:44:34Z

@dt would you be comfortable with only retrieving system.jobs and skipping over crdb_internal.jobs in all cases? This would simplify the solution altogether.

tbg · 2020-05-06T14:55:29Z

Doesn't system.jobs have proto-encoded blobs in it? I'd rather not have to deal with that. I routinely check the jobs just to get an idea of what's going on.

dt · 2020-05-06T15:07:31Z

yeah, it is, and AFAIK, the cli's printing of those binary fields in 19.x is actually unparsable so system.jobs in 19.x debug zips is unusable.

knz · 2020-05-06T15:41:42Z

the cli's printing of those binary fields in 19.x is actually unparsable so system.jobs in 19.x debug zips is unusable.

I thought we backported spas' fix to dump the field in hex. If not I will do it.

knz · 2020-05-06T15:43:02Z

I'd rather not have to deal with that.

My PR as is makes you have to deal with it (it removes the vtable from the dump).

If you want to have the vtable, I'd need to extend the PR to first query the table's size server-side, and only if it is known to be under some reasonable row count limit request it. Would that work?

tbg · 2020-05-06T15:52:58Z

I'm fine if the table is missing when the cli version does not match the server, on the assumption that it will be relatively rare.

knz · 2020-05-06T15:57:37Z

I'm fine if the table is missing when the cli version does not match the
server,

That's not what this does. The table gets skipped from any cluster running v19.x or v20.1. I intend to backport this change. This means we won't get it from any cluster running those versions from now on.

dt · 2020-05-06T16:44:59Z

I feel like this is pessimistic -- we're assuming that nodes have insufficient memory to serve this query, but this seems like a big hammer to bring in when we don't even know if we have a nail.

Also, if this is OOM'ing the node, so would SHOW JOBS or other inspection of the vtable. I feel like the "fix" here is to add memory accounting to vtable (which I thought @jordanlewis already did) and then make zip gracefully handle getting an error back rather than avoid the vtable unconditionally in debugging data collection. Just removing this entirely from our debugging collection will seriously hamper actual debugging.

knz · 2020-05-06T18:12:53Z

@jordanlewis do you confirm that the SHOW JOBS / crdb_internal.jobs memory accounting has been added and backported both to 20.1, 19.2 and 19.1?

If it has, then this PR is not necessary (and I can close it).
If it has not, maybe we can backport it? Or I can massage this PR further.

knz · 2020-05-08T17:17:31Z

I traced the root cause to a SQL issue that hasn't been fixed yet: #48595

I think that unless we prioritize a fix for that issue, this PR (or something similar) should be merged. The stability risk is real.

knz · 2020-05-18T08:19:15Z

suspending work on this as #49148 seems to be making progress.

knz requested review from dt and tbg May 6, 2020 13:21

tbg approved these changes May 6, 2020

View reviewed changes

knz added the X-noremind Bots won't notify about PRs with X-noremind label May 18, 2020

knz marked this pull request as draft May 18, 2020 08:19

knz closed this Sep 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cli/zip: avoid collecting crdb_internal.jobs in clusters prior to v20.2 #48488

cli/zip: avoid collecting crdb_internal.jobs in clusters prior to v20.2 #48488

knz commented May 6, 2020 •

edited

Loading

cockroach-teamcity commented May 6, 2020

tbg commented May 6, 2020

knz commented May 6, 2020 •

edited

Loading

knz commented May 6, 2020

tbg commented May 6, 2020

dt commented May 6, 2020

knz commented May 6, 2020

knz commented May 6, 2020

tbg commented May 6, 2020 via email

knz commented May 6, 2020

dt commented May 6, 2020

knz commented May 6, 2020 •

edited

Loading

knz commented May 8, 2020

knz commented May 18, 2020

cli/zip: avoid collecting crdb_internal.jobs in clusters prior to v20.2 #48488

cli/zip: avoid collecting crdb_internal.jobs in clusters prior to v20.2 #48488

Conversation

knz commented May 6, 2020 • edited Loading

cockroach-teamcity commented May 6, 2020

tbg commented May 6, 2020

knz commented May 6, 2020 • edited Loading

knz commented May 6, 2020

tbg commented May 6, 2020

dt commented May 6, 2020

knz commented May 6, 2020

knz commented May 6, 2020

tbg commented May 6, 2020 via email

knz commented May 6, 2020

dt commented May 6, 2020

knz commented May 6, 2020 • edited Loading

knz commented May 8, 2020

knz commented May 18, 2020

knz commented May 6, 2020 •

edited

Loading

knz commented May 6, 2020 •

edited

Loading

knz commented May 6, 2020 •

edited

Loading