Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Gengo to take a generic FileSource #191

Merged
merged 28 commits into from
Sep 26, 2023

Conversation

spenserblack
Copy link
Owner

@spenserblack spenserblack commented Sep 25, 2023

This changes the implementation of Gengo to take a generic FileSource type, moving git-specific
stuff to its own module. This breaks usage by moving the rev argument from Gengo::analyze to
Builder::new (there's a good chance that this will break again once other file sources are implemented).

To do

  • Fix snapshot tests that fail due to ordering
  • Set Rust version Nevermind
  • Figure out why this compiles locally but not in CI

Closes #150
Resolves #165
Related to #149

@spenserblack spenserblack added the semver-major This pull request bumps the major version label Sep 25, 2023
@spenserblack
Copy link
Owner Author

cc @Byron

As someone who has shown a lot of interest in gengo's performance, I'd love to hear your thoughts! I think there's definitely some room for performance improvements, and also probably some areas where the code can be simplified.

I got lazy and used rayon for parallelization -- converting an iterator into a parallel iterator was just too easy not to 😆

My benchmarks showed both performance gains and drops locally between this branch and main, but that's probably due to the fact that they're called on HEAD to HEAD^^.

@codecov
Copy link

codecov bot commented Sep 25, 2023

Codecov Report

Merging #191 (6e6f609) into main (e5afacc) will decrease coverage by 2.57%.
The diff coverage is 81.19%.

@@            Coverage Diff             @@
##             main     #191      +/-   ##
==========================================
- Coverage   85.68%   83.12%   -2.57%     
==========================================
  Files          14       16       +2     
  Lines         552      569      +17     
==========================================
  Hits          473      473              
- Misses         79       96      +17     
Flag Coverage Δ
83.12% <81.19%> (-2.57%) ⬇️
macOS-latest 82.92% <80.17%> (-2.72%) ⬇️
ubuntu-latest 78.14% <69.23%> (-4.14%) ⬇️
windows-latest 77.64% <80.17%> (-2.55%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
gengo-bin/src/cli.rs 89.28% <100.00%> (ø)
gengo/src/analysis/mod.rs 84.61% <100.00%> (-6.30%) ⬇️
gengo/src/builder.rs 100.00% <100.00%> (+11.53%) ⬆️
gengo/src/lib.rs 85.71% <100.00%> (-5.88%) ⬇️
gengo/src/file_source/git.rs 91.66% <91.66%> (ø)
gengo/src/file_source/mod.rs 0.00% <0.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Byron
Copy link
Contributor

Byron commented Sep 26, 2023

Thanks for reeling me in. I browsed through the changes quickly and the diff was interrupted by codecov every couple of lines, so I might not actually get what's happening here.

But from what I think I get, I have a few notes to share.

  • I see that the parallelization was generalized so that sources provided files, and parallelization is done through par_iter()
  • when the source is git, it's more performant if the input is chunked - so each thread should see a couple of (naturally sorted) paths in a row to have more efficient attribute access. If I were you, I'd compare the performance of the pre-rayon git (the "original") with the new one and not stop until the performance is equal at least.
  • when the source is filesystem, then you should benefit from parallel filesystem iteration. jwalk can do that and it's pretty good, but if you want to be even better, then the traversal algorithm of pdu would be the one to copy if license permits.

And that's already it. I also think that rayon is a good choice here even though I'd not use it - I think it's very complex and I have seen too many rayon related bugs (one, I think ;)) to feel 'fearless' around it anymore 😅.

@spenserblack
Copy link
Owner Author

Thanks for the feedback and insight!

@Byron
Copy link
Contributor

Byron commented Sep 26, 2023

You are welcome! I forgot to mention to definitely use Instruments or any other profiling tool at your disposal to see where time is spent. That usually leads to low-hanging optimizations.

I'd be interested in performance numbers when available by the way - in particular, does the new rayon-based parallelization end up being faster or slower than the one that was implemented before. The reason I think it could be faster is better work stealing. And the reason I think it could be slower is if attribute access performance is reduced due to too-much randomization of input paths (i.e. each thread should see a sorted list of input paths as is present in the index, needs chunking).

@spenserblack
Copy link
Owner Author

spenserblack commented Sep 26, 2023

Well, I added a benchmark to run on test/javascript since HEAD will be unreliable (naturally, benchmarks will take longer the bigger this project gets). test/javascript is more fixed and should be reliable between main and development branches.

To my surprise, Criterion's unit of time duration went from ms to µs 😆
So it's an improvement in my local dev environment, at least. I'll look into Instruments.

@spenserblack spenserblack marked this pull request as ready for review September 26, 2023 13:22
@spenserblack spenserblack merged commit e84c746 into main Sep 26, 2023
@spenserblack spenserblack deleted the chore/165/handlers-providing-iterator branch September 26, 2023 13:23
@spenserblack
Copy link
Owner Author

Nevermind #197

😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-major This pull request bumps the major version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

chore: Move git-related stuff to its own module
2 participants