feat: Add page range support to PDF converters. #3965

danielbichuetti · 2023-01-26T20:25:36Z

Related Issues

fixes PDFToTextConverter/PDFToTextOCRConverter: get specific page(s) from document file #3964

Proposed Changes:

Add start_page and end_page parameter to allow PDF converters (PDFToTextConverter and PDFToTextOCRConverter) to process only user defined range.
Especially useful in extremely large documents, even more when OCR is needed.

How did you test it?

Locally
CI

Notes for the reviewer

Both pdf2image and xpdfreader start page count at 1

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added tests that demonstrate the correct behavior of the change
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

haystack/nodes/file_converter/pdf.py

sjrl · 2023-01-27T12:48:27Z

@danielbichuetti could we add a new unit test to make sure this behavior works as expected? It would also be good to make sure to test this behavior in a pipeline so we can understand how nodes further down the pipeline (such as the PreProcessor and DocumentStore) handle the empty documents. In particular, I'd want to make sure that the PreProcessor combined with the option add_page_number=True saves the page number starting at start_page.

danielbichuetti · 2023-01-27T12:50:30Z

Ok, let's look over this.

But the behavior hasn't been changed, each page is identified via \f in the next nodes. If code is not trimming them, it should be ok.

sjrl · 2023-01-27T12:51:44Z

Ok, let's look over this.

But the behavior hasn't been changed, each page is identified via \f in the next nodes. If code is not trimming them, it should be ok.

That's a great point, but always worth double checking and testing :)

danielbichuetti · 2023-01-27T13:47:52Z

I'll add one to check the results after preprocessor runs.

danielbichuetti · 2023-01-27T14:04:53Z

@sjrl I've looked over other Converters, e.g. AzureConverter, it allows custom parameters on the convert method. But since run (used by Pipeline) is set on the base class, it won't be used in such case.

May I just override that base method? And refactor the base function to avoid code duplication?

sjrl · 2023-01-27T14:23:28Z

@sjrl I've looked over other Converters, e.g. AzureConverter, it allows custom parameters on the convert method. But since run (used by Pipeline) is set on the base class, it won't be used in such case.

May I just override that base method? And refactor the base function to avoid code duplication?

Hmm, I think usually we would recommend adding the new parameters to the __init__ method so we don't need to modify the BaseConverter.run method, but I could see that you may want to change the start and end page numbers at runtime depending on what pdf is being converted.

I think we'll need to ask @julian-risch and @ZanSara here for advice on changing the BaseConverter.run method. And if we do proceed with that refactoring we will probably want to open a separate PR for this change. What do you think? @danielbichuetti @julian-risch @ZanSara

danielbichuetti · 2023-01-27T14:28:58Z

I think this refactoring should be on another PR. There are other converters where it should be applied. It could be rolled out at once, and we will avoid the mix of scope.

But my earlier thought was to implement the parameters in a PDFToTextConverter run method. The BaseConverter ligature cleaning could be moved to a private method, and the PDF converter could implement its own run method, which calls the base class ligature cleaning one.

sjrl · 2023-01-27T14:34:27Z

But my earlier thought was to implement the parameters in a PDFToTextConverter run method. The BaseConverter ligature cleaning could be moved to a private method, and the PDF converter could implement its own run method, which calls the base class ligature cleaning one.

Definitely, a great thing to discuss in a separate PR or even a proposal depending on the scope of the refactor.

danielbichuetti · 2023-01-27T15:13:21Z

I have added a test that checks for the correct page numbers after PreProcessor handling has been completed. It is being handled appropriately by them.

sjrl

This looks great! Thanks for the addition.

sjrl · 2023-01-27T15:47:59Z

The failing Weaviate test should be fixed with this PR. It's not related to your changes.

ZanSara · 2023-01-27T17:15:09Z

@sjrl @danielbichuetti the Weaviate fix is merged, if you update the branch the failure should go away

*Poppler and Tesseract not installed on CI

danielbichuetti · 2023-01-27T19:15:37Z

Maybe, it should be considered to install Tesseract and Poppler on windows to allow some the OCR related tests to run.

Poppler for Windows
winget install Tesseract-OCR

sjrl · 2023-01-30T13:08:59Z

Maybe, it should be considered to install Tesseract and Poppler on windows to allow some the OCR related tests to run.

Poppler for Windows winget install Tesseract-OCR

I'll go ahead and merge this PR now since it is ready, but let's open a new issue for adding support for Windows tests. I opened a new issue for this here #4001

danielbichuetti and others added 2 commits January 26, 2023 16:42

feat: add start and eng page to PDF converters

477fa48

Merge branch 'deepset-ai:main' into pdfconverter_pagerange

5a6577d

danielbichuetti requested a review from a team as a code owner January 26, 2023 20:25

danielbichuetti requested review from silvanocerza and removed request for a team January 26, 2023 20:25

docs: add missing docstrings

13b6d8c