-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: recoverable processing + merging #646
Conversation
bbede57
to
306552e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some early comments, mostly I'm concerned about object lifetime. One of the first problems we had was holding onto futures longer than necessary and blowing up our memory, as discussed in #97
The magic sauce seems to be a secondary executor (either spawned automatically or passed in) to run the merge jobs in parallel. I am not sure where we landed on having rich as a dependency, but rich progress bars make the code with multiple bars a bit neater https://github.com/andrzejnovak/coffea/pull/1/files |
pip made rich a vendored dependency for 22.0, so was it transitively via that? |
Can you replicate the work to be done a little bit and see how the memory scales? |
26bf2e8
to
25c3ca5
Compare
Be careful with tests in windows - forking and multiprocessing is strange there. |
332d8cb
to
368de32
Compare
Alright, I guess that makes it reviewable now. Specifically, I'd like to hear your thoughts on the added API and particularly for how the error state/processed items are returned back up by the executor(s). This is currently passed as a tuple, but it's not very elegant. I assume the reason the returns are formed like eg. |
We should test this on a many-core (64+) machine and check the scaling in extremum as well. |
We'll still run into some nasty issues with big histograms... hmm. |
I can test this on a node up to 80 cores, but there's a lot of RAM available, so it's not necessarily stressing the code. The next scale up is obv the parsl executor (either with Futures or another parsl dfk doing the merging), which should maybe be a separate PR. |
Sure - the point is to observe trends rather than really stress anything. |
6fd2fc1
to
6d43743
Compare
Alright, this crept up in scope a bit.
Ready to review. We can expand the merge logic to dask and overall use of rich progress bars in future PRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an overall comment, if you can sprinkle some typing in some of the new function signatures that would be helpful.
Instead of passing "prepro" to run, I tried to factor it out into a separate fcn, which makes more sense, but having to comply with the dynamic chunking there makes it a bit awkward (relies on both filemeta and chunk generator, so |
@nsmith- Cleaned up the commits and also added rich.progress for Iterative. I'd say this is ready |
Since we can't capture it very well in CI have we tried this PR at scale with all the executors? |
Reasonable scale for futures, full 2016 production of my analysis with parsl about 3 times. The speed there is dependent on how busy the disk, but in each it was faster or as fast as master |
@nsmith- you happy here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small type thing, not making it required but would be nice.
@nsmith- seems we're good to go |
We discussed in the past how it's bad when a big job crashes and many hours and core hours can be lost. Here's a draft of how we could go about solving that (ignore some of the merge logic, I used the
FutureHolder
class from my job merging PR because this needs a bit finer control than the_futures_handler
generator, but it's not necessarily connected)Workflow can look like this.
@nsmith-