Update pdf2parquet to Docling v2 #756

dolfim-ibm · 2024-10-30T06:12:14Z

Why are these changes needed?

The new Docling v2 allows to

Process more input formats: PDF, DOCX, PPTX, HTML, Markdown, ASCII Docs
Faster PDF backend
Improvements in the generated document
New DoclingDocument
Additional export format to plain text

In progress

upgrade dependencies and adapt code
add safe-guard for the download of model weights on multi-processes
implement flushing mechanism for creating batches of files
update doc_chunk tansform
propagate parameters to kfp

Related issue number (if any).

expose document hash of the input document. Refs [Bug] pdf2parquet must calculate hash and size on the file #605
use new faster PDF backend. Refs [Bug] improve performance of pdf2parquet #573
safe-guard the download of model weights on multi-processes. Refs [Bug] pdf2parquet ray version erroring out when downloading models for the very first time #667

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

new parameters and input formats faster backend revalidated the test results Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

transforms/language/doc_chunk/python/pyproject.toml

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

daw3rd

Do any of. your tests exercise self.buffer and flush_binary()?

transforms/language/doc_chunk/python/src/doc_chunk_transform.py

daw3rd · 2024-10-30T18:41:03Z

transforms/language/pdf2parquet/python/requirements.txt

@@ -1,6 +1,6 @@
 data-prep-toolkit==0.2.2.dev1


Should this be >= 0.2.2.dev1. dev1 may eventually go away.

here I rely on @daw3rd and @touma-I expertise. let me know what is the correct value to use

@touma-I I defer to you on this one.

@dolfim-ibm @daw3rd this will be reset during the release process or whenever we push to pypi. for now, I wouldn't worry about it. You're good.

transforms/language/pdf2parquet/python/src/pdf2parquet_transform.py

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

daw3rd

LGTM modulo @touma-I's call on the >= versioning.

dolfim-ibm added 11 commits October 29, 2024 14:16

update to docling v2 and expose new parameters

f90f134

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

update to docling v2

1436480

new parameters and input formats faster backend revalidated the test results Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

add lock

8c22f0d

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

add batch_size

d55e6bd

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

update parameter in README

261230c

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

fix multilock with default parameters

e396e16

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

use multilock with fix

5095c1b

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

propagate new param to kfp

f62e6de

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

update new models download in Dockerfile

7e5ea90

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

update doc_chunk with new docling v2

e929903

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

update to 2.3.1 with initialize_pipeline

b4eb978

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

touma-I requested a review from daw3rd October 30, 2024 18:10

Merge remote-tracking branch 'origin/dev' into docling-v2

622ea4c

dolfim-ibm marked this pull request as ready for review October 30, 2024 18:27

touma-I self-requested a review October 30, 2024 18:32

daw3rd reviewed Oct 30, 2024

View reviewed changes

transforms/language/doc_chunk/python/pyproject.toml Show resolved Hide resolved

remove debug log

8395a25

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

daw3rd requested changes Oct 30, 2024

View reviewed changes

dolfim-ibm added 3 commits October 31, 2024 13:12

improve parsing of metadata

4c693f9

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

add test case for batch_size

26b429a

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

notify users about the deprecated argument

269d732

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

daw3rd approved these changes Oct 31, 2024

View reviewed changes

touma-I merged commit 5877e4d into dev Oct 31, 2024
14 checks passed

This was referenced Nov 1, 2024

[Bug] pdf2parquet must calculate hash and size on the file #605

Open

[Bug] pdf2parquet ray version erroring out when downloading models for the very first time #667

Open

[Bug] improve performance of pdf2parquet #573

Open

shahrokhDaijavad mentioned this pull request Nov 14, 2024

Rename the "Intro" notebooks to call out specific functionality it supports (PDF to Embedings) #782

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update pdf2parquet to Docling v2 #756

Update pdf2parquet to Docling v2 #756

dolfim-ibm commented Oct 30, 2024 •

edited

Loading

daw3rd left a comment

daw3rd Oct 30, 2024

dolfim-ibm Oct 31, 2024

daw3rd Oct 31, 2024

touma-I Oct 31, 2024

daw3rd left a comment

Update pdf2parquet to Docling v2 #756

Update pdf2parquet to Docling v2 #756

Conversation

dolfim-ibm commented Oct 30, 2024 • edited Loading

Why are these changes needed?

Related issue number (if any).

daw3rd left a comment

Choose a reason for hiding this comment

daw3rd Oct 30, 2024

Choose a reason for hiding this comment

dolfim-ibm Oct 31, 2024

Choose a reason for hiding this comment

daw3rd Oct 31, 2024

Choose a reason for hiding this comment

touma-I Oct 31, 2024

Choose a reason for hiding this comment

daw3rd left a comment

Choose a reason for hiding this comment

dolfim-ibm commented Oct 30, 2024 •

edited

Loading