feat: Optimize performance for large-scale harvest/enrichment #83

motizuki · 2024-08-26T20:00:16Z

Implement folder grouping for extraction files to mitigate performance issues caused by large numbers of files in a single directory.
Previously, when enriching a collection of 20k items, performance degraded significantly as the extraction folder grew in size due to slow file operations.
Files are now grouped into subdirectories in batches of 100, significantly improving performance for large datasets.
⚠️ BREAKING CHANGE: Existing extractions must be recreated to be compatible with the new directory structure.

- Implement folder grouping for extraction files to mitigate performance issues caused by large numbers of files in a single directory. - Previously, when enriching a collection of 20k items, performance degraded significantly as the extraction folder grew in size due to slow file operations. - Files are now grouped into subdirectories in batches of 100, significantly improving performance for large datasets. - ⚠️ BREAKING CHANGE: Existing extractions must be recreated to be compatible with the new directory structure.

github-actions · 2024-08-26T20:00:53Z

Code quality score

Uh oh! The code quality got worse for this PR! Better take a look!! 🚨

	Ruby file count	Similarity score (flay)	ABC complexity (flog)	Code smells (reek)	TOTALS
base	104	7.71	5.28	16.9	29.89
this branch	104	7.71	5.28	17.35	30.34
difference	0	0.0	0.0	⚠️ 0.45	0.45

richardmatthewsdev

Thanks heaps for doing this bro! It looks like it is making a massive difference to the performance of the enrichments :)

app/sidekiq/split_worker.rb

app/sidekiq/text_extraction_worker.rb

paul-mesnilgrente

LGTM I just had one comment

app/supplejack/extraction/enrichment_extraction.rb

- loop through folders and files

motizuki added 2 commits August 27, 2024 11:54

refactor: Remove sorting to find correct file in the file system

2c5025e

refactor: Update TextExtractions and Split to use folder structure

141838e

motizuki force-pushed the gm/extraction-improv branch from cfd6c65 to 141838e Compare August 27, 2024 03:37

test: Fix specs and rubocop

4c055d4

motizuki force-pushed the gm/extraction-improv branch from 1ab8a7a to 4c055d4 Compare August 27, 2024 20:38

richardmatthewsdev approved these changes Aug 28, 2024

View reviewed changes

app/sidekiq/split_worker.rb Outdated Show resolved Hide resolved

app/sidekiq/text_extraction_worker.rb Outdated Show resolved Hide resolved

refactor: move folder_number method to parent class

d207640

paul-mesnilgrente approved these changes Aug 28, 2024

View reviewed changes

app/supplejack/extraction/enrichment_extraction.rb Outdated Show resolved Hide resolved

refactor: Move folder_logic to reusable method

340e42b

motizuki force-pushed the gm/extraction-improv branch 14 times, most recently from 6c90279 to 9756bc9 Compare August 30, 2024 02:49

fix: file extraction

371837a

- loop through folders and files

motizuki force-pushed the gm/extraction-improv branch from 9756bc9 to 371837a Compare August 30, 2024 03:01

Merge branch 'main' into gm/extraction-improv

d6b5ac2

motizuki merged commit 29c70aa into main Aug 30, 2024
7 of 8 checks passed

motizuki deleted the gm/extraction-improv branch August 30, 2024 03:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Optimize performance for large-scale harvest/enrichment #83

feat: Optimize performance for large-scale harvest/enrichment #83

motizuki commented Aug 26, 2024

github-actions bot commented Aug 26, 2024 •

edited

Loading

richardmatthewsdev left a comment

paul-mesnilgrente left a comment

feat: Optimize performance for large-scale harvest/enrichment #83

feat: Optimize performance for large-scale harvest/enrichment #83

Conversation

motizuki commented Aug 26, 2024

github-actions bot commented Aug 26, 2024 • edited Loading

Code quality score

richardmatthewsdev left a comment

Choose a reason for hiding this comment

paul-mesnilgrente left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 26, 2024 •

edited

Loading