Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add page range support to PDF converters. #3965

Merged
merged 14 commits into from
Jan 30, 2023

Conversation

danielbichuetti
Copy link
Contributor

Related Issues

Proposed Changes:

Add start_page and end_page parameter to allow PDF converters (PDFToTextConverter and PDFToTextOCRConverter) to process only user defined range.
Especially useful in extremely large documents, even more when OCR is needed.

How did you test it?

Locally
CI

Notes for the reviewer

Both pdf2image and xpdfreader start page count at 1

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added tests that demonstrate the correct behavior of the change
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@danielbichuetti danielbichuetti requested a review from a team as a code owner January 26, 2023 20:25
@danielbichuetti danielbichuetti requested review from silvanocerza and removed request for a team January 26, 2023 20:25
@sjrl sjrl added type:feature New feature or request topic:file_converter labels Jan 27, 2023
@sjrl
Copy link
Contributor

sjrl commented Jan 27, 2023

@danielbichuetti could we add a new unit test to make sure this behavior works as expected? It would also be good to make sure to test this behavior in a pipeline so we can understand how nodes further down the pipeline (such as the PreProcessor and DocumentStore) handle the empty documents. In particular, I'd want to make sure that the PreProcessor combined with the option add_page_number=True saves the page number starting at start_page.

@danielbichuetti
Copy link
Contributor Author

Ok, let's look over this.

But the behavior hasn't been changed, each page is identified via \f in the next nodes. If code is not trimming them, it should be ok.

@sjrl
Copy link
Contributor

sjrl commented Jan 27, 2023

Ok, let's look over this.

But the behavior hasn't been changed, each page is identified via \f in the next nodes. If code is not trimming them, it should be ok.

That's a great point, but always worth double checking and testing :)

@danielbichuetti
Copy link
Contributor Author

I'll add one to check the results after preprocessor runs.

@danielbichuetti
Copy link
Contributor Author

@sjrl I've looked over other Converters, e.g. AzureConverter, it allows custom parameters on the convert method. But since run (used by Pipeline) is set on the base class, it won't be used in such case.

May I just override that base method? And refactor the base function to avoid code duplication?

@sjrl
Copy link
Contributor

sjrl commented Jan 27, 2023

@sjrl I've looked over other Converters, e.g. AzureConverter, it allows custom parameters on the convert method. But since run (used by Pipeline) is set on the base class, it won't be used in such case.

May I just override that base method? And refactor the base function to avoid code duplication?

Hmm, I think usually we would recommend adding the new parameters to the __init__ method so we don't need to modify the BaseConverter.run method, but I could see that you may want to change the start and end page numbers at runtime depending on what pdf is being converted.

I think we'll need to ask @julian-risch and @ZanSara here for advice on changing the BaseConverter.run method. And if we do proceed with that refactoring we will probably want to open a separate PR for this change. What do you think? @danielbichuetti @julian-risch @ZanSara

@danielbichuetti
Copy link
Contributor Author

danielbichuetti commented Jan 27, 2023

I think this refactoring should be on another PR. There are other converters where it should be applied. It could be rolled out at once, and we will avoid the mix of scope.

But my earlier thought was to implement the parameters in a PDFToTextConverter run method. The BaseConverter ligature cleaning could be moved to a private method, and the PDF converter could implement its own run method, which calls the base class ligature cleaning one.

@sjrl
Copy link
Contributor

sjrl commented Jan 27, 2023

But my earlier thought was to implement the parameters in a PDFToTextConverter run method. The BaseConverter ligature cleaning could be moved to a private method, and the PDF converter could implement its own run method, which calls the base class ligature cleaning one.

Definitely, a great thing to discuss in a separate PR or even a proposal depending on the scope of the refactor.

@danielbichuetti
Copy link
Contributor Author

danielbichuetti commented Jan 27, 2023

I have added a test that checks for the correct page numbers after PreProcessor handling has been completed. It is being handled appropriately by them.

Copy link
Contributor

@sjrl sjrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks for the addition.

@sjrl
Copy link
Contributor

sjrl commented Jan 27, 2023

The failing Weaviate test should be fixed with this PR. It's not related to your changes.

@ZanSara
Copy link
Contributor

ZanSara commented Jan 27, 2023

@sjrl @danielbichuetti the Weaviate fix is merged, if you update the branch the failure should go away

@danielbichuetti
Copy link
Contributor Author

Maybe, it should be considered to install Tesseract and Poppler on windows to allow some the OCR related tests to run.

Poppler for Windows
winget install Tesseract-OCR

@sjrl
Copy link
Contributor

sjrl commented Jan 30, 2023

Maybe, it should be considered to install Tesseract and Poppler on windows to allow some the OCR related tests to run.

Poppler for Windows winget install Tesseract-OCR

I'll go ahead and merge this PR now since it is ready, but let's open a new issue for adding support for Windows tests. I opened a new issue for this here #4001

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PDFToTextConverter/PDFToTextOCRConverter: get specific page(s) from document file
3 participants