Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Optimize performance for large-scale harvest/enrichment #83

Merged
merged 8 commits into from
Aug 30, 2024

Conversation

motizuki
Copy link
Contributor

  • Implement folder grouping for extraction files to mitigate performance issues caused by large numbers of files in a single directory.

  • Previously, when enriching a collection of 20k items, performance degraded significantly as the extraction folder grew in size due to slow file operations.

  • Files are now grouped into subdirectories in batches of 100, significantly improving performance for large datasets.

  • ⚠️ BREAKING CHANGE: Existing extractions must be recreated to be compatible with the new directory structure.

- Implement folder grouping for extraction files to mitigate performance
issues caused by large numbers of files in a single directory.

- Previously, when enriching a collection of 20k items, performance
degraded significantly as the extraction folder grew in size due to
slow file operations.

- Files are now grouped into subdirectories in batches of 100,
significantly improving performance for large datasets.

- ⚠️ BREAKING CHANGE: Existing extractions must be recreated to be
compatible with the new directory structure.
Copy link

github-actions bot commented Aug 26, 2024

Code quality score

Uh oh! The code quality got worse for this PR! Better take a look!! 🚨

Ruby file count Similarity score (flay) ABC complexity (flog) Code smells (reek) TOTALS
base 104 7.71 5.28 16.9 29.89
this branch 104 7.71 5.28 17.35 30.34
difference 0 0.0 0.0 ⚠️ 0.45 0.45

@motizuki motizuki force-pushed the gm/extraction-improv branch from cfd6c65 to 141838e Compare August 27, 2024 03:37
@motizuki motizuki force-pushed the gm/extraction-improv branch from 1ab8a7a to 4c055d4 Compare August 27, 2024 20:38
Copy link
Contributor

@richardmatthewsdev richardmatthewsdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks heaps for doing this bro! It looks like it is making a massive difference to the performance of the enrichments :)

app/sidekiq/split_worker.rb Outdated Show resolved Hide resolved
app/sidekiq/text_extraction_worker.rb Outdated Show resolved Hide resolved
Copy link
Contributor

@paul-mesnilgrente paul-mesnilgrente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM I just had one comment

app/supplejack/extraction/enrichment_extraction.rb Outdated Show resolved Hide resolved
@motizuki motizuki force-pushed the gm/extraction-improv branch 14 times, most recently from 6c90279 to 9756bc9 Compare August 30, 2024 02:49
- loop through folders and files
@motizuki motizuki force-pushed the gm/extraction-improv branch from 9756bc9 to 371837a Compare August 30, 2024 03:01
@motizuki motizuki merged commit 29c70aa into main Aug 30, 2024
7 of 8 checks passed
@motizuki motizuki deleted the gm/extraction-improv branch August 30, 2024 03:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants