-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor with multiprocessing #75
Conversation
if opt: | ||
if not quiet: | ||
print('Generating git commit-graph... If you wish, this file is deletable later at .git/objects/info') | ||
repo.git.execute(['git', 'commit-graph', 'write', '--changed-paths']) # repo.git.commit_graph('write --changed-paths') doesn't work for some reason |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
git blame learned to used git commit-graph a while back, it speeds up git blame by 20-40%
Cool – I think this looks good to me! Let me try it on a few repos. |
I realize separately I should set this up to run Github Actions... I'll do that at some point this week |
i'm ok merging this if you're ready for it! |
I’ve noticed some memory issues last night by when I tried to run it on the Linux kernel repo, maybe hold off on it for a day and I’ll see if I can come up with a solution. Memory use went up to 8GB when the parallel processes were started up, not sure why yet |
Should be good to go now |
Looking at the code now it feels like there's a bit too much that has nothing to do with git processing and more like multiprocessing stuff. Is there some existing open source package that can be used for this instead? Maybe joblib? Or at least I'd prefer to factor it out into its own submodule. |
Realistically I could remove the signal handler and pause/resume and process de/spawner to cut 50 lines, but those are pretty nice to have features, and the entire multiprocessing code takes like 125 lines anyway. I'd prefer to look at it as a heavily optimized (for our purposes) multiprocessing git blamer
Unless somebody writes a multiprocessing git interface library, there's really nothing that can be done. And even if such a library existed, using it will probably cause performance to suffer, since pickling is required for IPC. Most of the multiprocessing libraries I've seen deal with stateless processing, which is not what we've got here: we need to preserve a
Fair enough, but the multiprocessing code also has processing code within it, so I personally think it'd be less confusing (with the current size of the file) for it to stay in just one file. |
77a5436
to
39d2763
Compare
I get it if there's no obvious open source lib does this, but looking at the code, I still think that BlameProc and BlameDriver could probably be split up into (a) some sort of general-purpose multiprocessing helper that can go into a separate module (b) the actual business code stays in analyze.py. This feels like it would keep the code a lot cleaner. Just as an example (but I'm not trying to be prescriptive here), BlameProc could be rewritten into
Can you give it a thought and see if you can think of some easy way to do it? I think multiple modules should be fine – the entrypoint is really just a pointer to a module installed in the usual place where Python modules are installed, so it should be possible to refer to submodules just like you'd normally do. |
Do you mean a single (as in single instance) helper class/function outside of the class described in (1)? Because that would be impossible with the way Python does/has to do multiprocessing, each process needs to be 'self contained' with all the things it needs. An outside function would be in the global scope, and to communicate with (1), it'd have to use queues. I realize there might be some confusion about the queues, so just to be clear those are the communication channels between processes, not the typical general purpose FIFO queue we'd expect in languages like C++ And I do feel like you just described BlameProc in (1), and BlameDriver in (2)
Yea, now that I reinstalled the package that seems to be the case, previously (when I had broken permissions) pip installed a copy of the .py script in my |
Ok ok – you might be right. Yeah I guess I'm fine merging this! Maybe I'll try to refactor it a bit afterwards, but I might possibly fail |
I dropped the ball, but I've used this locally for random things and it's a very nice speedup so I'll merge it! |
Thanks @Northbadge ! |
It was running a tad slow for my liking, so added in multiprocessing & changed out some implementations.
Core logic is essentially the same, though I did change the behavior of multiple
--only
flags from 'AND' to 'OR', otherwise there'd need to be a loop to match each pattern sequentially. I don't really see a use case where the user can't just convert an 'AND' only to an 'OR' only, so I think it should be fine?Other changes
--quiet
, which suppresses all console output. Should close Make progress bar optional #53 , and idk why Open the API #51 is still up--opt
which makes git generate a commit-graph, which speeds up git blame by ~20-40% in my tests--procs
which specifies how many Python processes to run (and in turn how many git processes are run)stack_plot.py
more pixelsprogressbar2
totqdm
(Has iterations/s) and fromfnmatch
towcmatch.fnmatch
(faster, by a lot)analyze
from trying to check out a branch to check if it existsI do believe the minimum python requirement is 3.6 with this PR, though I only tested on 3.9
Performance
Random Windows weirdness