feat: recoverable processing + merging #646

andrzejnovak · 2022-03-07T17:38:13Z

We discussed in the past how it's bad when a big job crashes and many hours and core hours can be lost. Here's a draft of how we could go about solving that (ignore some of the merge logic, I used the FutureHolder class from my job merging PR because this needs a bit finer control than the _futures_handler generator, but it's not necessarily connected)

Workflow can look like this.

run_instance = processor.Runner(....
)

filemeta = run_instance(samples, "Events", processor_instance=MyZPeak(), prepro_only=True)` 
# save it in a pickle or sth for reference

# passing existing pre-made chunks to the Runner instance will skip the prepro step
output, processed, metrics = run_instance(filemeta , "Events", processor_instance=MyZPeak())` 
# WorkItem is attached to the accumulator like metrics are
# if exception is raised, report it, but return futures that already finished

missing = set(filemeta) - set(processed)
# resubmit
output, processed, metrics = run_instance(list(missing), "Events", processor_instance=MyZPeak())`

@nsmith-

lgray · 2022-03-07T17:43:10Z

nsmith-

Some early comments, mostly I'm concerned about object lifetime. One of the first problems we had was holding onto futures longer than necessary and blowing up our memory, as discussed in #97

coffea/processor/executor.py

andrzejnovak · 2022-03-14T15:32:22Z

The magic sauce seems to be a secondary executor (either spawned automatically or passed in) to run the merge jobs in parallel.

I am not sure where we landed on having rich as a dependency, but rich progress bars make the code with multiple bars a bit neater https://github.com/andrzejnovak/coffea/pull/1/files

NJManganelli · 2022-03-14T16:27:29Z

pip made rich a vendored dependency for 22.0, so was it transitively via that?

andrzejnovak · 2022-03-14T18:59:56Z

Actually seems quite a bit faster this time around, two runs each, merge pool with 2 workers, which can probably run anywhere.

lgray · 2022-03-14T19:08:51Z

Can you replicate the work to be done a little bit and see how the memory scales?

andrzejnovak · 2022-03-14T19:47:49Z

Same files though, so I am guessing the files are cached somehow so the time doesn't scale x10. Looking at htop while that happens, the mem consumption looks consistent throughout. I am running a slightly modded MyZPeak example with 4x60bins axes, so should be not entirely negligible.

lgray · 2022-03-14T21:19:36Z

Be careful with tests in windows - forking and multiprocessing is strange there.

andrzejnovak · 2022-03-15T07:54:08Z

Alright, I guess that makes it reviewable now. Specifically, I'd like to hear your thoughts on the added API and particularly for how the error state/processed items are returned back up by the executor(s). This is currently passed as a tuple, but it's not very elegant. I assume the reason the returns are formed like eg. (accumulator, metrics) as opposed to just returning the wrapped_out dict, is that somewhere it has to be immutable?

lgray · 2022-03-15T14:35:03Z

We should test this on a many-core (64+) machine and check the scaling in extremum as well.

lgray · 2022-03-15T14:38:05Z

We'll still run into some nasty issues with big histograms... hmm.

andrzejnovak · 2022-03-15T14:38:45Z

We should test this on a many-core machine and check the scaling in extremum as well.

I can test this on a node up to 80 cores, but there's a lot of RAM available, so it's not necessarily stressing the code.

The next scale up is obv the parsl executor (either with Futures or another parsl dfk doing the merging), which should maybe be a separate PR.

lgray · 2022-03-15T14:54:16Z

Sure - the point is to observe trends rather than really stress anything.

andrzejnovak · 2022-03-20T13:50:44Z

Alright, this crept up in scope a bit.

New features (for futures/parsl executors)
- recoverable - Optionally return currently processed chunks along with the error in case of failure, instead of raising directly
- merging - Implement merging in batches (either in main process/in main executor/in separate executor)
  - On small test jobs at least as fast as master or faster
  - Will particularly improve when many small chunks are being processed or when many workers are available
- rich progress bars
API changes - breaking
- convert sets to list in metrics (it's annoying they are not be default json dumpable)
- Executors return (out, 0) or (out, error) if recoverable is implemented
API changes
- Calling Runner type standardized - instead of a variable-length tuple, return dict
  - isolated into Runner().run(...)
  - current Runner()() remain unchanged - under the hood calls run()
  - deprecate and switch in the future?

Ready to review. We can expand the merge logic to dask and overall use of rich progress bars in future PRs.

coffea/processor/executor.py

nsmith-

As an overall comment, if you can sprinkle some typing in some of the new function signatures that would be helpful.

coffea/processor/executor.py

tests/test_processor.py

andrzejnovak · 2022-03-27T18:38:43Z

Instead of passing "prepro" to run, I tried to factor it out into a separate fcn, which makes more sense, but having to comply with the dynamic chunking there makes it a bit awkward (relies on both filemeta and chunk generator, so run needs to be able to take fileset as input.

andrzejnovak · 2022-03-27T23:13:25Z

@nsmith- Cleaned up the commits and also added rich.progress for Iterative. I'd say this is ready

lgray · 2022-03-28T14:25:44Z

Since we can't capture it very well in CI have we tried this PR at scale with all the executors?

andrzejnovak · 2022-03-28T14:35:19Z

Since we can't capture it very well in CI have we tried this PR at scale with all the executors?

Reasonable scale for futures, full 2016 production of my analysis with parsl about 3 times. The speed there is dependent on how busy the disk, but in each it was faster or as fast as master

lgray · 2022-03-29T15:26:07Z

@nsmith- you happy here?

nsmith-

Just a small type thing, not making it required but would be nice.

coffea/processor/executor.py

andrzejnovak · 2022-03-29T17:56:21Z

@nsmith- seems we're good to go

andrzejnovak force-pushed the test2 branch 2 times, most recently from bbede57 to 306552e Compare March 8, 2022 23:02

nsmith- reviewed Mar 8, 2022

View reviewed changes

coffea/processor/executor.py Show resolved Hide resolved

coffea/processor/executor.py Outdated Show resolved Hide resolved

coffea/processor/executor.py Outdated Show resolved Hide resolved

andrzejnovak force-pushed the test2 branch from 359016f to 8674cfb Compare March 14, 2022 18:50

andrzejnovak changed the title ~~feat: recoverable processing~~ feat: recoverable processing + merging Mar 14, 2022

andrzejnovak force-pushed the test2 branch from 8674cfb to ada2dde Compare March 14, 2022 19:24

andrzejnovak force-pushed the test2 branch 2 times, most recently from 26bf2e8 to 25c3ca5 Compare March 14, 2022 20:43

andrzejnovak force-pushed the test2 branch 2 times, most recently from 332d8cb to 368de32 Compare March 15, 2022 00:39

andrzejnovak force-pushed the test2 branch 4 times, most recently from 6fd2fc1 to 6d43743 Compare March 20, 2022 12:37

andrzejnovak force-pushed the test2 branch from 7e731d0 to a65d0c8 Compare March 20, 2022 14:35

andrzejnovak marked this pull request as ready for review March 22, 2022 14:04

andrzejnovak commented Mar 22, 2022

View reviewed changes

coffea/processor/executor.py Show resolved Hide resolved

nsmith- requested changes Mar 25, 2022

View reviewed changes

andrzejnovak force-pushed the test2 branch from 53849db to 0c32b85 Compare March 27, 2022 18:31

andrzejnovak added 7 commits March 28, 2022 00:56

feat: parse/save file meta

5927833

fix(futures): merge as chunks return

4b604d6

feat(futures): add separate merging executor

76b1cce

feat: use rich progress bars

731999f

feat(parsl): jobs merging + recoverable processing

bbee07a

feat: make metrics output json serializable

83f7d0f

feat(Runner): backcomp, put new signature into .run

e60c098

andrzejnovak force-pushed the test2 branch from b54b08a to 036c44d Compare March 27, 2022 22:59

feat: use rich for IterativeExe

cb98223

andrzejnovak force-pushed the test2 branch from 036c44d to cb98223 Compare March 27, 2022 23:02

Merge branch 'master' into test2

c5f2bd7

fix: check None properly

834a232

andrzejnovak force-pushed the test2 branch from 225ccad to 834a232 Compare March 28, 2022 19:50

Merge branch 'master' into test2

bfd1891

nsmith- approved these changes Mar 29, 2022

View reviewed changes

coffea/processor/executor.py Outdated Show resolved Hide resolved

coffea/processor/executor.py Outdated Show resolved Hide resolved

refactor: internal merging parsing

e641c1b

andrzejnovak force-pushed the test2 branch from d5538f9 to e641c1b Compare March 29, 2022 17:36

lgray merged commit d235b82 into CoffeaTeam:master Mar 29, 2022

nsmith- mentioned this pull request Apr 20, 2022

wq: adds log of accumulated chunks #657

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: recoverable processing + merging #646

feat: recoverable processing + merging #646

andrzejnovak commented Mar 7, 2022

lgray commented Mar 7, 2022

nsmith- left a comment

andrzejnovak commented Mar 14, 2022

NJManganelli commented Mar 14, 2022

andrzejnovak commented Mar 14, 2022

lgray commented Mar 14, 2022

andrzejnovak commented Mar 14, 2022

lgray commented Mar 14, 2022

andrzejnovak commented Mar 15, 2022

lgray commented Mar 15, 2022 •

edited

Loading

lgray commented Mar 15, 2022

andrzejnovak commented Mar 15, 2022

lgray commented Mar 15, 2022

andrzejnovak commented Mar 20, 2022

nsmith- left a comment

andrzejnovak commented Mar 27, 2022

andrzejnovak commented Mar 27, 2022

lgray commented Mar 28, 2022

andrzejnovak commented Mar 28, 2022

lgray commented Mar 29, 2022

nsmith- left a comment

andrzejnovak commented Mar 29, 2022

feat: recoverable processing + merging #646

feat: recoverable processing + merging #646

Conversation

andrzejnovak commented Mar 7, 2022

lgray commented Mar 7, 2022

nsmith- left a comment

Choose a reason for hiding this comment

andrzejnovak commented Mar 14, 2022

NJManganelli commented Mar 14, 2022

andrzejnovak commented Mar 14, 2022

lgray commented Mar 14, 2022

andrzejnovak commented Mar 14, 2022

lgray commented Mar 14, 2022

andrzejnovak commented Mar 15, 2022

lgray commented Mar 15, 2022 • edited Loading

lgray commented Mar 15, 2022

andrzejnovak commented Mar 15, 2022

lgray commented Mar 15, 2022

andrzejnovak commented Mar 20, 2022

nsmith- left a comment

Choose a reason for hiding this comment

andrzejnovak commented Mar 27, 2022

andrzejnovak commented Mar 27, 2022

lgray commented Mar 28, 2022

andrzejnovak commented Mar 28, 2022

lgray commented Mar 29, 2022

nsmith- left a comment

Choose a reason for hiding this comment

andrzejnovak commented Mar 29, 2022

lgray commented Mar 15, 2022 •

edited

Loading