cylc.rundb: replace state dump #1827

matthewrmshin · 2016-05-03T14:06:30Z

This change replaces state dump functionality with the runtime DB.

Runtime DB changes:

New table for suite parameters, e.g. run mode, initial/final cycle points.
- And a similar table to store suite parameters at check points.
New table for current tasks in task pool, with their "spawned" status.
- And a similar table to store task pool status at check points.
New table to store broadcast states at check points.
Remove duplicated columns from task_states and task_events tables.
The information can all be found under task_jobs.

Restart now loads previous states and other information from the runtime DB only.

Where relevant old DB + state file will be upgraded automatically on restart.

cylc ls-checkpoints: new command to list the task pool checkpoints.

Latest
Restarts
Before and after reloads.

Close #421.

See also:

Release, trigger, insert (etc.) a whole sub-graph #1314 ability for multi-task control commands to target a dependency branch.
Explicit restart checkpointing? #1735 explicit restart checkpoints.

This change breaks backward compatibility on restart (as it no longer read/write state dumps), so I'd like to start a conversation on what old features and/or backward compatibility we want to retain (or not).

arjclark · 2016-05-03T14:36:08Z

How do we roll back to a previous suite state in your new system?

matthewrmshin · 2016-05-03T14:38:45Z

How do we roll back to a previous suite state in your new system?

It cannot be done at the moment. I want to discuss the main usages of this before putting this functionality back in.

hjoliver · 2016-05-03T23:17:15Z

I want to discuss the main usages of this before putting this functionality back in.

The current rolling archive of state dumps is probably of little use - by default it's very short, and it isn't easy to figure out which to use if not the most recent (however, I'll ask our operations team if they ever use it). But I think a more deliberate suite check-pointing system might be really useful, e.g. during suite development, which typically involves a lot of re-running - see #1735.

arjclark · 2016-05-04T07:04:43Z

It cannot be done at the moment. I want to discuss the main usages of this before putting this functionality back in.

Main use to date (to my knowledge) has been in operations where items of graphing have gone missing following a restart or reload, allowing us to 1) easily diff the contents of the task pool before and after reload/restart and 2) allowing us to roll back when something has messed up rather than sitting there trying to manually insert and reset states for not insignificant numbers of tasks.

matthewrmshin · 2016-05-04T08:13:08Z

It cannot be done at the moment. I want to discuss the main usages of this before putting this functionality back in.

Main use to date (to my knowledge) has been in operations where items of graphing have gone missing following a restart or reload, allowing us to 1) easily diff the contents of the task pool before and after reload/restart and 2) allowing us to roll back when something has messed up rather than sitting there trying to manually insert and reset states for not insignificant numbers of tasks.

Both of these are better solved by #1735 (thanks to @hjoliver for raising this) instead of the rolling archive:

We can store the task pool before/after a restart/reload. This should allow us to diff the task pool properly. The rolling archive is only effective for diff if the user remembers to hold the suite on restart or reload.
The rolling archive may be too far gone by the time the problem is spotted.

I think it is easy to extend the task_pool table (or add a new task pool snapshot? table) to store snapshots of the task pool at various points. (It is probably not too difficult to implement some housekeeping mechanism to the table as well.) We can then have a command to list possible restart points (and to dump content of the task pool at each restart point).

arjclark · 2016-05-04T08:35:05Z

Both of these are better solved by #1735 (thanks to @hjoliver for raising this) instead of the rolling archive:

We can store the task pool before/after a restart/reload. This should allow us to diff the task pool properly. The rolling archive is only effective for diff if the user remembers to hold the suite on restart or reload.

So long as @hjoliver's proposal and/or point 1 are in the same release then that's fine. I don't want to end up in a position where we've lost the state dumps and have no way of rolling back/checkpointing for a given cylc version.

(point 2 is rarely the case in my experience of our operational problem suites - it's usually pretty clear something's gone horrendously wrong fairly quickly and the task becomes determining what's happened and how to recover)

I think it is easy to extend the task_pool table (or add a new task pool snapshot? table) to store snapshots of the task pool at various points. (It is probably not too difficult to implement some housekeeping mechanism to the table as well.) We can then have a command to list possible restart points (and to dump content of the task pool at each restart point).

Sounds good to me - as a basic starting point I'd snapshot at initialisation of the suite/restart/warmstart and on (planned) shutdowns, otherwise just using wherever the suite was last up to.

hjoliver · 2016-05-05T22:33:53Z

Operations here reports they only ever (and infrequently at that) use the "most recent non-corrupted state dump", i.e. (presumably) the most-recent-but-one file in the unlikely event that the suite was killed in the middle of writing the most recent one. So that's fine - the atomic nature of the the DB means it should never be corrupted!

matthewrmshin · 2016-05-06T10:47:54Z

"most recent non-corrupted state dump"

(The logic since #825 should mean that the default state file (which is a symbolic link to the latest state file that has been fully written and fsync'ed) is never corrupted.)

I think the only (useful?) functionality missing from this change is the ability to easily inspect the last 10 state dumps (or whatever is the state dump rolling archive length in global.rc) as well as the state dumps at previous restarts.

The former is probably not useful. For the latter, I'll add a table to snapshot the task pool and a table to snapshot broadcast states on restart (and possibly before/after reload).

benfitzpatrick · 2016-07-13T11:22:43Z

lib/cylc/task_pool.py

-                self.pub_dao.add_delete_item(table_name, where_args)
+        # Record suite parameters and tasks in pool
+        # Record any broadcast settings to be dumped out
+        for obj in self, BroadcastServer.get_inst():


Can you (or do you want me to) do some profiling with a busy suite for this routine? It was a bit of a bottleneck before.

Running my 100-task busy suite for 30 cycles in my environment, process_queued_db_ops has a cumulative run time of 67.2s. This is faster when compared to master's 63.6s + 10.8s on state file dumping.

benfitzpatrick · 2016-07-13T11:43:21Z

Lots of comments but I really like the change.

benfitzpatrick · 2016-07-13T11:43:51Z

I think the most important thing is to profile it against a busy suite.

matthewrmshin · 2016-07-13T13:12:32Z

@benfitzpatrick most comments addressed or fixed. I'll do some profiling.

matthewrmshin · 2016-07-22T13:40:39Z

Branch squashed and re-based. Profiling using my 100-task busy suite suggests little or no change in performance.

New tables for: * Suite parameters, e.g. run mode, initial/final cycle points. * Current tasks in task pool, with their "spawned" status. * Snapshots of above and broadcast states. Restart loads previous states from run DB. * Will upgrade DB with old state dump file automatically. Remove duplicated columns from task_states and task_events tables. Remove state dump functionality. cylc ls-checkpoints: new command to list the task pool checkpoints. Rename cylc-suite.db * cylc-suite-private.db <- state/cylc-suite.db * cylc-suite-public.db <- cylc-suite.db * cylc-suite.db now a symbolic link to cylc-suite-public.db

matthewrmshin · 2016-07-25T12:01:28Z

Branch re-based again.

…cylc-1827 Fix t: rose-bush/00, broken since cylc/cylc-flow#1827

A missing `break` means that we were still querying the `task_events` table instead of the `task_jobs` table for a set of unique job hosts.

matthewrmshin self-assigned this May 3, 2016

matthewrmshin added this to the soon milestone May 3, 2016

hjoliver force-pushed the master branch from 7e6b3a1 to e81601b Compare May 4, 2016 06:41

matthewrmshin force-pushed the rundb-state-dump branch from a9174ac to 984169a Compare May 4, 2016 08:24

hjoliver force-pushed the master branch 2 times, most recently from 17fe934 to 3bdb067 Compare May 5, 2016 01:37

matthewrmshin force-pushed the rundb-state-dump branch from 984169a to 161d65d Compare May 5, 2016 10:23

matthewrmshin force-pushed the rundb-state-dump branch 5 times, most recently from cef549e to 495b2cd Compare May 11, 2016 07:49

arjclark mentioned this pull request May 11, 2016

cylc suite database: suite events #705

Closed

matthewrmshin force-pushed the rundb-state-dump branch 6 times, most recently from 90e1fec to 465f141 Compare May 12, 2016 11:37

matthewrmshin mentioned this pull request May 12, 2016

File housekeeping utility. #1159

Open

matthewrmshin force-pushed the rundb-state-dump branch 2 times, most recently from 5bac391 to ad6bd30 Compare May 12, 2016 13:21

benfitzpatrick reviewed Jul 13, 2016
View reviewed changes

matthewrmshin force-pushed the rundb-state-dump branch 2 times, most recently from 1feeb7b to 00badf1 Compare July 13, 2016 13:02

matthewrmshin force-pushed the rundb-state-dump branch 3 times, most recently from ec2c162 to 2ae9a80 Compare July 22, 2016 13:39

matthewrmshin force-pushed the rundb-state-dump branch from 2ae9a80 to bf903bd Compare July 25, 2016 12:00

benfitzpatrick modified the milestones: next release, soon Jul 27, 2016

benfitzpatrick merged commit 3199aa5 into cylc:master Jul 27, 2016

matthewrmshin deleted the rundb-state-dump branch July 27, 2016 07:58

matthewrmshin added a commit to matthewrmshin/rose that referenced this pull request Aug 16, 2016

Fix t: rose-bush/00, broken since cylc/cylc-flow#1827

da9b012

matthewrmshin mentioned this pull request Aug 16, 2016

Fix t: rose-bush/00, broken since cylc/cylc#1827 metomi/rose#1961

Merged

arjclark added a commit to metomi/rose that referenced this pull request Aug 16, 2016

Merge pull request #1961 from matthewrmshin/fix-t-rose-bush-00-basic-…

1421298

…cylc-1827 Fix t: rose-bush/00, broken since cylc/cylc-flow#1827

matthewrmshin added a commit to matthewrmshin/rose that referenced this pull request Aug 22, 2016

t: fix broken tests since cylc/cylc-flow#1827

63bac40

A missing `break` means that we were still querying the `task_events` table instead of the `task_jobs` table for a set of unique job hosts.

oliver-sanders mentioned this pull request Sep 7, 2016

Suite log upgrade. #1958

Merged

matthewrmshin mentioned this pull request Oct 10, 2016

restart: allow restart from a different checkpoint #2033

Merged

matthewrmshin mentioned this pull request Jan 4, 2019

Use json instead of pickle for broadcast_states in cylc-cat-state? #2912

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cylc.rundb: replace state dump #1827

cylc.rundb: replace state dump #1827

matthewrmshin commented May 3, 2016 •

edited

Loading

arjclark commented May 3, 2016

matthewrmshin commented May 3, 2016

hjoliver commented May 3, 2016

arjclark commented May 4, 2016

matthewrmshin commented May 4, 2016

arjclark commented May 4, 2016

hjoliver commented May 5, 2016

matthewrmshin commented May 6, 2016 •

edited

Loading

benfitzpatrick Jul 13, 2016

matthewrmshin Jul 13, 2016

benfitzpatrick commented Jul 13, 2016

benfitzpatrick commented Jul 13, 2016

matthewrmshin commented Jul 13, 2016

matthewrmshin commented Jul 22, 2016

matthewrmshin commented Jul 25, 2016

cylc.rundb: replace state dump #1827

cylc.rundb: replace state dump #1827

Conversation

matthewrmshin commented May 3, 2016 • edited Loading

arjclark commented May 3, 2016

matthewrmshin commented May 3, 2016

hjoliver commented May 3, 2016

arjclark commented May 4, 2016

matthewrmshin commented May 4, 2016

arjclark commented May 4, 2016

hjoliver commented May 5, 2016

matthewrmshin commented May 6, 2016 • edited Loading

benfitzpatrick Jul 13, 2016

Choose a reason for hiding this comment

matthewrmshin Jul 13, 2016

Choose a reason for hiding this comment

benfitzpatrick commented Jul 13, 2016

benfitzpatrick commented Jul 13, 2016

matthewrmshin commented Jul 13, 2016

matthewrmshin commented Jul 22, 2016

matthewrmshin commented Jul 25, 2016

matthewrmshin commented May 3, 2016 •

edited

Loading

matthewrmshin commented May 6, 2016 •

edited

Loading