[batch] Compact And Drop Records from `job_group_inst_coll_cancellable_resources` #14645

ehigham · 2024-07-30T22:39:09Z

Records in the job_group_inst_coll_cancellable_resources table are dead once a
job group completes. We already compact records when a job group is cancelled.
We are yet to do this for finished job groups. See the linked issue for a more
detailed motivation.

This change adds two background tasks:

finds uncompacted groups of records for finished job groups and
compacts them by summing across the token field.
finds compacted records for finished job groups and deletes them if
all associated resources are 0.

The results of both tasks converge to a fixed point where the only remaining
records are for those jobs groups that are unfinished, cancelled or have
resources outstanding.

I've taken care to optimise the underlying SQL queries as best as I can. Both
make heavy use of lateral joins to avoid explodes - the natural implementation
of both are prohibitively expensive.

I've tested these tasks in a dev deploy where I created a number of batches and
observed that records from this table have indeed been compacted and destroyed
on completion. It's not immediately obvious to me how to automate testing for
these. AFAICT, we lack any automated integration testing for these background
tasks.

Resolves: #14623

…e_resources` Resolves: hail-is#14623

ehigham · 2024-07-31T18:38:32Z

batch/batch/driver/main.py

+) AS R ON TRUE
+WHERE G.time_completed IS NOT NULL
+  AND C.id IS NULL
+LIMIT 1000;


Calling out the limit on both queries. Seems other queries also limit to 1000 but not sure where this comes from. Without compacting, the query to find compacted rows takes for ever as it scans through a large chunk of the db. On the other hand, there are millions of rows so reducing this number would make the background task take longer to churn through records. Suggestions?

FWIW, my 2-week old prod snapshot has 173561655 rows in job_group_inst_coll_cancellable_resources and 8567769 job_groups. Assuming (incorrectly) instant execution, It'll will take 100 days to churn through the db.

chrisvittal

Ok. I'm convinced that the SQL works as advertised. I'd love to see an easy query to just delete all the unnecessary records from job_group_inst_coll_cancellable_resources, but I certainly haven't spent enough time understanding the stored procedures and triggers that cause this table to be updated and what invariants hold for it.

Guess that's a future project that will only be relevant if this table grows faster than we can delete records from it.

ehigham · 2024-08-05T19:11:11Z

Updated queries to return job groups that do not have an ancestor or self job group that has been cancelled. This logic now mirrors that of delete_prev_cancelled_job_group_cancellable_resources_records, only in anti-join form.
Previous query returned those job groups that do not have a cancellation record for itself or a descendent job group.
SQL is hard.

chrisvittal

newest SQL update LGTM

ehigham · 2024-08-09T20:43:03Z

@daniel-goldstein @jigold I believe I've implemented this faithfully to the issue but I'm not confident about any fallout if I've got something wrong. Would you mind taking a look (sorry to drag you into this)?

I've grepped through the codebase as @daniel-goldstein suggested and AFAICT, these records are unused after a job group finish.
Why do we need them after they've finished? When would a job group terminate with non-zero resources?
Thanks so much for your help!

daniel-goldstein · 2024-08-09T21:23:44Z

Why do we need them after they've finished? When would a job group terminate with non-zero resources?

I can take a look at this on Monday probably but AFAIK we don't and it wouldn't, hence the copious amounts of garbage.

Fixes #14660 by using the graphQL API to query github directly. Replaces our current parallel interpretation of reviews into a review decision, which is brittle if we ever change review requirements in github again. Tested by manually updating the live CI to use the test batch generated image. Results: - Review decisions correctly fetched from github, not based on CI's parallel interpretation of individual reviews: ![image](https://github.com/user-attachments/assets/67c03aa9-000a-44e7-91aa-3a42d04238dc) - No merge candidate was being incorrectly nominated (in particular, #14645 is now considered pending, rather than approved, which is what we are currently, incorrectly, calculating)

ehigham added 12 commits July 30, 2024 18:35

[batch] Compact And Drop Records from `job_group_inst_coll_cancellabl…

fd1b7eb

…e_resources` Resolves: hail-is#14623

dont need to ORDER BY anymore

31d74b8

delete unused records

b89f5be

hack less

494a9a5

fix nargin error

95e8295

dont pass generator to db methods

0d10677

join lateral wins again

634c10b

formatting

9cdc9d7

sql syntax error

21e2421

sql syntax error

7a3a33c

sql syntax error

e012bb0

sql syntax error

8b40080

ehigham marked this pull request as ready for review July 31, 2024 18:07

ehigham requested review from patrick-schultz and chrisvittal July 31, 2024 18:34

ehigham assigned patrick-schultz and chrisvittal Jul 31, 2024

ehigham commented Jul 31, 2024

View reviewed changes

remove unapplicable feature flag

af92cd6

ehigham force-pushed the ehigham/14623-compact-job_group_inst_coll_cancellable_resources branch from 339c9fe to af92cd6 Compare July 31, 2024 19:09

remove unused arg

5ed4355

chrisvittal approved these changes Aug 5, 2024

View reviewed changes

ehigham added the WIP label Aug 5, 2024

query for non-cancelled job groups

7d576f8

ehigham removed the WIP label Aug 5, 2024

ehigham requested a review from chrisvittal August 5, 2024 19:07

chrisvittal approved these changes Aug 5, 2024

View reviewed changes

cjllanwarne mentioned this pull request Aug 21, 2024

[CI] Use github's graphQL to query review state directly #14661

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[batch] Compact And Drop Records from `job_group_inst_coll_cancellable_resources` #14645

[batch] Compact And Drop Records from `job_group_inst_coll_cancellable_resources` #14645

ehigham commented Jul 30, 2024 •

edited

Loading

ehigham Jul 31, 2024

ehigham Aug 1, 2024 •

edited

Loading

chrisvittal left a comment

ehigham commented Aug 5, 2024 •

edited

Loading

chrisvittal left a comment

ehigham commented Aug 9, 2024

daniel-goldstein commented Aug 9, 2024

[batch] Compact And Drop Records from job_group_inst_coll_cancellable_resources #14645

Are you sure you want to change the base?

[batch] Compact And Drop Records from job_group_inst_coll_cancellable_resources #14645

Conversation

ehigham commented Jul 30, 2024 • edited Loading

ehigham Jul 31, 2024

Choose a reason for hiding this comment

ehigham Aug 1, 2024 • edited Loading

Choose a reason for hiding this comment

chrisvittal left a comment

Choose a reason for hiding this comment

ehigham commented Aug 5, 2024 • edited Loading

chrisvittal left a comment

Choose a reason for hiding this comment

ehigham commented Aug 9, 2024

daniel-goldstein commented Aug 9, 2024

[batch] Compact And Drop Records from `job_group_inst_coll_cancellable_resources` #14645

[batch] Compact And Drop Records from `job_group_inst_coll_cancellable_resources` #14645

ehigham commented Jul 30, 2024 •

edited

Loading

ehigham Aug 1, 2024 •

edited

Loading

ehigham commented Aug 5, 2024 •

edited

Loading