Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor with multiprocessing #75

Merged
merged 5 commits into from
Oct 22, 2021
Merged

Conversation

Northbadge
Copy link
Contributor

It was running a tad slow for my liking, so added in multiprocessing & changed out some implementations.

Core logic is essentially the same, though I did change the behavior of multiple --only flags from 'AND' to 'OR', otherwise there'd need to be a loop to match each pattern sequentially. I don't really see a use case where the user can't just convert an 'AND' only to an 'OR' only, so I think it should be fine?

Other changes

  • Added --quiet, which suppresses all console output. Should close Make progress bar optional #53 , and idk why Open the API #51 is still up
  • Added --opt which makes git generate a commit-graph, which speeds up git blame by ~20-40% in my tests
  • Added --procs which specifies how many Python processes to run (and in turn how many git processes are run)
  • Gave stack_plot.py more pixels
  • Switched from progressbar2 to tqdm (Has iterations/s) and from fnmatch to wcmatch.fnmatch (faster, by a lot)
  • Stopped analyze from trying to check out a branch to check if it exists
  • Added/updated some pics

I do believe the minimum python requirement is 3.6 with this PR, though I only tested on 3.9

Performance

  • Entry discovery is now ~5-10x faster
  • Analyzing is now at least 2x faster, depending on how many processes is configured
  • node/src took ~50 minutes on the old code, ~9 minutes with 12 processes (With maxing out my hexa core 5820k)
  • tensorflow took 2 or 3 hours
  • rust was about 3 hours
  • git's about 30 minutes

Random Windows weirdness

  • Windows defender doesn't quite like having so many git/python processes running and eats some CPU; Turning it off makes it faster, but if anyone knows how to stop defender from doing this in the first place, would love to hear

if opt:
if not quiet:
print('Generating git commit-graph... If you wish, this file is deletable later at .git/objects/info')
repo.git.execute(['git', 'commit-graph', 'write', '--changed-paths']) # repo.git.commit_graph('write --changed-paths') doesn't work for some reason
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git blame learned to used git commit-graph a while back, it speeds up git blame by 20-40%

@erikbern
Copy link
Owner

Cool – I think this looks good to me! Let me try it on a few repos.

@erikbern
Copy link
Owner

I realize separately I should set this up to run Github Actions... I'll do that at some point this week

@erikbern
Copy link
Owner

i'm ok merging this if you're ready for it!

@Northbadge
Copy link
Contributor Author

I’ve noticed some memory issues last night by when I tried to run it on the Linux kernel repo, maybe hold off on it for a day and I’ll see if I can come up with a solution. Memory use went up to 8GB when the parallel processes were started up, not sure why yet

@Northbadge
Copy link
Contributor Author

Should be good to go now

@erikbern
Copy link
Owner

erikbern commented Jul 20, 2021

Looking at the code now it feels like there's a bit too much that has nothing to do with git processing and more like multiprocessing stuff. Is there some existing open source package that can be used for this instead? Maybe joblib? Or at least I'd prefer to factor it out into its own submodule.

@Northbadge
Copy link
Contributor Author

Northbadge commented Jul 21, 2021

Looking at the code now it feels like there's a bit too much that has nothing to do with git processing and more like multiprocessing stuff.

Realistically I could remove the signal handler and pause/resume and process de/spawner to cut 50 lines, but those are pretty nice to have features, and the entire multiprocessing code takes like 125 lines anyway. I'd prefer to look at it as a heavily optimized (for our purposes) multiprocessing git blamer

Is there some existing open source package that can be used for this instead?

Unless somebody writes a multiprocessing git interface library, there's really nothing that can be done. And even if such a library existed, using it will probably cause performance to suffer, since pickling is required for IPC.

Most of the multiprocessing libraries I've seen deal with stateless processing, which is not what we've got here: we need to preserve a git instance and the commit2cohort dictionary. As for joblib', its underlying process manager, loky is essentially trying to be a better concurrent.futures.ProcessPoolExecutor from what I can tell, but even lacks features like imap, so afaik the main process can't do work while the workers are doing work. Implementing it in the default lib gives the most customization and it's also probably the most battle tested library for python multiprocessing

Or at least I'd prefer to factor it out into its own submodule.

Fair enough, but the multiprocessing code also has processing code within it, so I personally think it'd be less confusing (with the current size of the file) for it to stay in just one file. BlameDriver and BlameProc can easily be copied over to another file though if you'd like. I'm not exactly sure how to change setup.py to handle a dependent module though, since I think it makes a copy of the script itself to put in Python's Scripts path so that it can be called from a terminal

@erikbern
Copy link
Owner

erikbern commented Jul 21, 2021

I get it if there's no obvious open source lib does this, but looking at the code, I still think that BlameProc and BlameDriver could probably be split up into (a) some sort of general-purpose multiprocessing helper that can go into a separate module (b) the actual business code stays in analyze.py. This feels like it would keep the code a lot cleaner.

Just as an example (but I'm not trying to be prescriptive here), BlameProc could be rewritten into

  1. A class that gets initialized with the git repo and commit2cohort. That class has method that does the business logic
  2. A separate helper class/function handles the queue stuff and communicates with the class mentioned above

Can you give it a thought and see if you can think of some easy way to do it?

I think multiple modules should be fine – the entrypoint is really just a pointer to a module installed in the usual place where Python modules are installed, so it should be possible to refer to submodules just like you'd normally do.

@Northbadge
Copy link
Contributor Author

Northbadge commented Jul 21, 2021

A separate helper class/function handles the queue stuff and communicates with the class mentioned above

Do you mean a single (as in single instance) helper class/function outside of the class described in (1)? Because that would be impossible with the way Python does/has to do multiprocessing, each process needs to be 'self contained' with all the things it needs. An outside function would be in the global scope, and to communicate with (1), it'd have to use queues. I realize there might be some confusion about the queues, so just to be clear those are the communication channels between processes, not the typical general purpose FIFO queue we'd expect in languages like C++

And I do feel like you just described BlameProc in (1), and BlameDriver in (2)

I think multiple modules should be fine – the entrypoint is really just a pointer to a module installed in the usual place where Python modules are installed, so it should be possible to refer to submodules just like you'd normally do.

Yea, now that I reinstalled the package that seems to be the case, previously (when I had broken permissions) pip installed a copy of the .py script in my Python/Scripts directory and another copy in site-packages

@erikbern
Copy link
Owner

Ok ok – you might be right. Yeah I guess I'm fine merging this! Maybe I'll try to refactor it a bit afterwards, but I might possibly fail

@erikbern
Copy link
Owner

I dropped the ball, but I've used this locally for random things and it's a very nice speedup so I'll merge it!

@erikbern erikbern merged commit 6823d57 into erikbern:master Oct 22, 2021
@erikbern
Copy link
Owner

Thanks @Northbadge !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make progress bar optional
2 participants