-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add page range support to PDF converters. #3965
feat: Add page range support to PDF converters. #3965
Conversation
@danielbichuetti could we add a new unit test to make sure this behavior works as expected? It would also be good to make sure to test this behavior in a pipeline so we can understand how nodes further down the pipeline (such as the |
Ok, let's look over this. But the behavior hasn't been changed, each page is identified via \f in the next nodes. If code is not trimming them, it should be ok. |
That's a great point, but always worth double checking and testing :) |
I'll add one to check the results after preprocessor runs. |
@sjrl I've looked over other Converters, e.g. May I just override that base method? And refactor the base function to avoid code duplication? |
Hmm, I think usually we would recommend adding the new parameters to the I think we'll need to ask @julian-risch and @ZanSara here for advice on changing the |
I think this refactoring should be on another PR. There are other converters where it should be applied. It could be rolled out at once, and we will avoid the mix of scope. But my earlier thought was to implement the parameters in a PDFToTextConverter |
Definitely, a great thing to discuss in a separate PR or even a proposal depending on the scope of the refactor. |
I have added a test that checks for the correct page numbers after PreProcessor handling has been completed. It is being handled appropriately by them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Thanks for the addition.
The failing Weaviate test should be fixed with this PR. It's not related to your changes. |
@sjrl @danielbichuetti the Weaviate fix is merged, if you update the branch the failure should go away |
*Poppler and Tesseract not installed on CI
Maybe, it should be considered to install Tesseract and Poppler on windows to allow some the OCR related tests to run. Poppler for Windows |
I'll go ahead and merge this PR now since it is ready, but let's open a new issue for adding support for Windows tests. I opened a new issue for this here #4001 |
Related Issues
Proposed Changes:
Add
start_page
andend_page
parameter to allow PDF converters (PDFToTextConverter and PDFToTextOCRConverter) to process only user defined range.Especially useful in extremely large documents, even more when OCR is needed.
How did you test it?
Locally
CI
Notes for the reviewer
Both
pdf2image
andxpdfreader
start page count at 1Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.