-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Optimize performance for large-scale harvest/enrichment #83
Conversation
- Implement folder grouping for extraction files to mitigate performance issues caused by large numbers of files in a single directory. - Previously, when enriching a collection of 20k items, performance degraded significantly as the extraction folder grew in size due to slow file operations. - Files are now grouped into subdirectories in batches of 100, significantly improving performance for large datasets. -⚠️ BREAKING CHANGE: Existing extractions must be recreated to be compatible with the new directory structure.
Code quality scoreUh oh! The code quality got worse for this PR! Better take a look!! 🚨
|
cfd6c65
to
141838e
Compare
1ab8a7a
to
4c055d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks heaps for doing this bro! It looks like it is making a massive difference to the performance of the enrichments :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM I just had one comment
6c90279
to
9756bc9
Compare
- loop through folders and files
9756bc9
to
371837a
Compare
Implement folder grouping for extraction files to mitigate performance issues caused by large numbers of files in a single directory.
Previously, when enriching a collection of 20k items, performance degraded significantly as the extraction folder grew in size due to slow file operations.
Files are now grouped into subdirectories in batches of 100, significantly improving performance for large datasets.