Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs, observability: make SHOW JOBS non-blocking #76690

Closed
shermanCRL opened this issue Feb 16, 2022 · 7 comments
Closed

jobs, observability: make SHOW JOBS non-blocking #76690

shermanCRL opened this issue Feb 16, 2022 · 7 comments
Labels
A-jobs C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-jobs

Comments

@shermanCRL
Copy link
Contributor

shermanCRL commented Feb 16, 2022

Describe the problem

The SHOW JOBS query doesn’t return if a lock is held on any row in the jobs table. This is perceived as the jobs system itself struggling -- which may or may not be correct.

In any case, the SHOW JOBS query hanging is a common complaint from users, and is often escalated as a concern for the overall jobs system. It is expensive in terms of support.

To Reproduce

This typically happens when a long-running transaction is held open, such as backup planning. SHOW JOBS is a table scan, so any open transaction will block it.

Expected behavior

The goal of this issue is narrowly about finding ways to prevent the SHOW JOBS query from hanging: an observability problem.

Some ideas:

  • SHOW JOBS could use an AOST of (say) 5 or 10 seconds, to reduce the likelihood of hitting a locked record/open transaction
  • A “materialized view”, or cached version of SHOW JOBS, which never blocks. The trade-off would be some tolerance of staleness.
  • Encourage use of SHOW CHANGEFEED JOBS and create similar sugar (like SHOW BACKUP JOBS). These filters would at least reduce the chance of hitting a locked row, and would solve, say, backups interfering with observing changefeeds.
    • Even better, I’d like to see SHOW CHANGEFEEDS (drop the JOBS keyword)
  • Implement SKIP LOCKED sql: support FOR {UPDATE,SHARE} {SKIP LOCKED,NOWAIT} #40476
  • ...and use it in jobs jobs: job adoption can block on intents #62734

Environment:

  • CockroachDB version v21.2
@shermanCRL shermanCRL added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Feb 16, 2022
@blathers-crl blathers-crl bot added the T-jobs label Feb 16, 2022
@shermanCRL shermanCRL added A-jobs and removed T-jobs labels Feb 16, 2022
@blathers-crl blathers-crl bot added the T-jobs label Feb 16, 2022
@ajwerner
Copy link
Contributor

Another, somewhat controversial proposal would be to expose weaker isolation.

One thing that affects not SHOW JOBS but affects SHOW JOB is that we don't have any virtual indexes so rendering a single job page will encounter contention on all jobs pages.

Another note is that what pisses people off is stuff not loading. One inbuilt thing we have to push locks and permit reads is the pusher that sits under a rangefeed. We've long talked about leveraging rangefeeds with jobs for various things. If we did that, then magically we'd see writing transactions get pushed. That sort of catch-all process seems appealing to me in lieu of some other way to enforce that we don't run long-running transactions against the jobs table.

This all then comes back to other issues where the jobs data storage encourages bad transactions. If, from an API perspective, we didn't really allow for any combination of updating a job with other work, we'd not see these long-running transactions. Eliminating all the existing use cases, is, of course, work.

@ajwerner
Copy link
Contributor

I don't have a clear vision on how SKIP LOCKED helps here. Certainly the SHOW JOBS query doesn't want to SKIP LOCKED. What are the queries we'd use that for?

@shermanCRL
Copy link
Contributor Author

shermanCRL commented Feb 16, 2022

I don't have a clear vision on how SKIP LOCKED helps here. Certainly the SHOW JOBS query doesn't want to SKIP LOCKED. What are the queries we'd use that for?

Ah, I was under the impression this would be helpful, but you’re saying it’s not desirable semantics for SHOW JOBS. Good point.

@ajwerner
Copy link
Contributor

I would love to see an analysis of what long-running transactions we know to exist. Yes there are systemic solutions here, and they may be valuable, but the root cause to some extent is the individual transactions. I feel like one possible source of problems might be transactions which might work hard to avoid writing to the jobs table until the end (which seems like a good strategy), but then experiences restarts and holds the lock over restarts. If that's an issue, maybe we can add some functionality to invalidate certain locks upon restart. The rows in the jobs table shouldn't be at risk of getting overwritten after a restart.

@shermanCRL
Copy link
Contributor Author

@ajwerner When a primary index row is locked, does it imply that the corresponding secondary index row would be locked? I ask this because I wonder if querying with an index hint would be helpful -- the index would act as a materialized view.

@ajwerner
Copy link
Contributor

Generally, yes, it does. Put differently, if the secondary index contains any rows in any column families which are locked, the secondary index will be locked.

@shermanCRL
Copy link
Contributor Author

shermanCRL commented Feb 23, 2022

I don’t think any of these ideas (strictly about the observability/SELECT side) has legs, so closing this issue. We’ll continue on other efforts, focused on the core issue of contention: #73133

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-jobs C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-jobs
Projects
None yet
Development

No branches or pull requests

2 participants