Change `Gengo` to take a generic `FileSource` #191

spenserblack · 2023-09-25T13:29:15Z

This changes the implementation of Gengo to take a generic FileSource type, moving git-specific
stuff to its own module. This breaks usage by moving the rev argument from Gengo::analyze to
Builder::new (there's a good chance that this will break again once other file sources are implemented).

To do

Fix snapshot tests that fail due to ordering
~~Set Rust version~~ Nevermind
Figure out why this compiles locally but not in CI

Closes #150
Resolves #165
Related to #149

Merge remote-tracking branch 'origin/main' into chore/165/handlers-providing-iterator

…oviding-iterator

spenserblack · 2023-09-25T20:24:59Z

cc @Byron

As someone who has shown a lot of interest in gengo's performance, I'd love to hear your thoughts! I think there's definitely some room for performance improvements, and also probably some areas where the code can be simplified.

I got lazy and used rayon for parallelization -- converting an iterator into a parallel iterator was just too easy not to 😆

My benchmarks showed both performance gains and drops locally between this branch and main, but that's probably due to the fact that they're called on HEAD to HEAD^^.

This reverts commit 4b98fd3.

codecov · 2023-09-25T20:31:27Z

Codecov Report

Merging #191 (6e6f609) into main (e5afacc) will decrease coverage by 2.57%.
The diff coverage is 81.19%.

@@            Coverage Diff             @@
##             main     #191      +/-   ##
==========================================
- Coverage   85.68%   83.12%   -2.57%     
==========================================
  Files          14       16       +2     
  Lines         552      569      +17     
==========================================
  Hits          473      473              
- Misses         79       96      +17

Flag	Coverage Δ
	`83.12% <81.19%> (-2.57%)`	⬇️
macOS-latest	`82.92% <80.17%> (-2.72%)`	⬇️
ubuntu-latest	`78.14% <69.23%> (-4.14%)`	⬇️
windows-latest	`77.64% <80.17%> (-2.55%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
gengo-bin/src/cli.rs	`89.28% <100.00%> (ø)`
gengo/src/analysis/mod.rs	`84.61% <100.00%> (-6.30%)`	⬇️
gengo/src/builder.rs	`100.00% <100.00%> (+11.53%)`	⬆️
gengo/src/lib.rs	`85.71% <100.00%> (-5.88%)`	⬇️
gengo/src/file_source/git.rs	`91.66% <91.66%> (ø)`
gengo/src/file_source/mod.rs	`0.00% <0.00%> (ø)`

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Byron · 2023-09-26T06:36:05Z

Thanks for reeling me in. I browsed through the changes quickly and the diff was interrupted by codecov every couple of lines, so I might not actually get what's happening here.

But from what I think I get, I have a few notes to share.

I see that the parallelization was generalized so that sources provided files, and parallelization is done through par_iter()
when the source is git, it's more performant if the input is chunked - so each thread should see a couple of (naturally sorted) paths in a row to have more efficient attribute access. If I were you, I'd compare the performance of the pre-rayon git (the "original") with the new one and not stop until the performance is equal at least.
when the source is filesystem, then you should benefit from parallel filesystem iteration. jwalk can do that and it's pretty good, but if you want to be even better, then the traversal algorithm of pdu would be the one to copy if license permits.

And that's already it. I also think that rayon is a good choice here even though I'd not use it - I think it's very complex and I have seen too many rayon related bugs (one, I think ;)) to feel 'fearless' around it anymore 😅.

spenserblack · 2023-09-26T12:22:54Z

Thanks for the feedback and insight!

Byron · 2023-09-26T12:42:00Z

You are welcome! I forgot to mention to definitely use Instruments or any other profiling tool at your disposal to see where time is spent. That usually leads to low-hanging optimizations.

I'd be interested in performance numbers when available by the way - in particular, does the new rayon-based parallelization end up being faster or slower than the one that was implemented before. The reason I think it could be faster is better work stealing. And the reason I think it could be slower is if attribute access performance is reduced due to too-much randomization of input paths (i.e. each thread should see a sorted list of input paths as is present in the index, needs chunking).

spenserblack · 2023-09-26T13:02:16Z

Well, I added a benchmark to run on test/javascript since HEAD will be unreliable (naturally, benchmarks will take longer the bigger this project gets). test/javascript is more fixed and should be reliable between main and development branches.

To my surprise, Criterion's unit of time duration went from ms to µs 😆
So it's an improvement in my local dev environment, at least. I'll look into Instruments.

spenserblack · 2023-09-27T11:32:01Z

Nevermind #197

😅

spenserblack added 18 commits September 10, 2023 10:43

Create Provider trait

439dac7

Reword

84e4b3d

TEMP

bd00cd4

Add Send/Sync constraints

073736d

Use git::Git

3aaebed

TEMP

bd0de50

Merge over dropped submodule support

b00b956

Merge remote-tracking branch 'origin/main' into chore/165/handlers-providing-iterator

Create git state builder

ca59cc3

Set up overrides for git file source

ea531ad

Clone state

0813a8e

Use shortcut

c875c70

Set up index state

48bf94e

TEMP

28a18b1

Build entries

68bf917

Create iterator

d074130

Clippy and format

51ab6a4

Update builder

423f3ae

Move rev arg to builder

19de50b

spenserblack added the semver-major This pull request bumps the major version label Sep 25, 2023

spenserblack added 6 commits September 25, 2023 12:55

Bump 0.x

2a2a1ad

Reimplement gengo with generic file source

1503fc1

Merge remote-tracking branch 'origin/main' into chore/165/handlers-pr…

ed62bae

…oviding-iterator

Install Rust 1.72.1 in CI

4b98fd3

Update tests

b246666

Force gix's parallel feature even with no default features

5a50eae

Revert "Install Rust 1.72.1 in CI"

3f5b4c7

This reverts commit 4b98fd3.

Order results by using a Vec

0690413

Merge over new benchmark

650c043

Fetch test branches for benchmarks

6e6f609

spenserblack marked this pull request as ready for review September 26, 2023 13:22

spenserblack merged commit e84c746 into main Sep 26, 2023

spenserblack deleted the chore/165/handlers-providing-iterator branch September 26, 2023 13:23

spenserblack mentioned this pull request Sep 26, 2023

Significant performance degredation #197

Closed

spenserblack mentioned this pull request Sep 27, 2023

Stop cloning the index state #200

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change `Gengo` to take a generic `FileSource` #191

Change `Gengo` to take a generic `FileSource` #191

spenserblack commented Sep 25, 2023 •

edited

Loading

spenserblack commented Sep 25, 2023

codecov bot commented Sep 25, 2023 •

edited

Loading

Byron commented Sep 26, 2023

spenserblack commented Sep 26, 2023

Byron commented Sep 26, 2023

spenserblack commented Sep 26, 2023 •

edited

Loading

spenserblack commented Sep 27, 2023

Change Gengo to take a generic FileSource #191

Change Gengo to take a generic FileSource #191

Conversation

spenserblack commented Sep 25, 2023 • edited Loading

To do

spenserblack commented Sep 25, 2023

codecov bot commented Sep 25, 2023 • edited Loading

Codecov Report

Byron commented Sep 26, 2023

spenserblack commented Sep 26, 2023

Byron commented Sep 26, 2023

spenserblack commented Sep 26, 2023 • edited Loading

spenserblack commented Sep 27, 2023

Change `Gengo` to take a generic `FileSource` #191

Change `Gengo` to take a generic `FileSource` #191

spenserblack commented Sep 25, 2023 •

edited

Loading

codecov bot commented Sep 25, 2023 •

edited

Loading

spenserblack commented Sep 26, 2023 •

edited

Loading