Advanced text extraction on columns, tables, equations #38

gunnsth · 2019-01-01T14:40:16Z

To properly extract certain text in PDF, it may be necessary to detect/group lines, identify tables, equations. This may either be done post-extraction of objects or before, depending on what is easier to implement and gives good results.

Also need to assemble a solid corpus for testing, as well as an API prototyping. Tabular extraction may need a different approach than equations and possibly a different API.

At this point we are collecting input so that we can define this issue better.

Ben-harder · 2020-08-26T17:53:30Z

Any update on this? I've found that unfortunately table extraction seems to be about 50% correct at the moment

gunnsth · 2020-08-26T18:01:01Z

@Ben-harder Can you post some cases where it falls short? And also the code you are using.

Ben-harder · 2020-08-26T18:14:55Z

Sure. We're using v3.9.0 and we're getting the pageText object for each page and then I've iterated through pageText.Tables() and drawn out the cell bbox's in blue and their bbox's unioned in red to represent the whole table. You can ignore the green outlines.

Here's an example of what we want with one of the tables it picked up perfectly:

And here are some examples of where it either picks up lists and thinks they're tables, or it misses some cells in an actual table:
1.

2.

3.

It also picked up a few numbered lists as well.

Alttaf · 2020-08-26T18:37:53Z

+1 for this

gunnsth · 2020-08-26T18:53:33Z

@Ben-harder Can you share some PDFs that we can use for testing and include in our QA and automated test suites?

Ben-harder · 2020-08-26T19:33:38Z

Sure I can give you the one from the images
Speer_Permit.pdf
Speer_Permit_overlay.pdf

peterwilliams97 · 2020-08-28T00:16:43Z

Ben, I will investigate this.
I have am working on a few versions of table extraction code that I have not submitted yet. They address most/all of the issues you raise but they make other trade-offs so have been holding them back.

Some of the things I am working on are:

Grid line detection fixes a lot of cases
Sparse table detection can be tricky
Detecting tables without gridlines requires making some judgments

I will see if I can make a small commit that addresses your specific issues next week.
Have you been using any other PDF table extractors? If so, can you tell me which one does the best job on you files?

Ben-harder · 2020-08-28T01:44:09Z

Hi Peter, thank you that sounds great!

And no, I haven't used any other PDF table extractors.

Ben-harder · 2020-08-31T17:54:22Z

So we actually have used AWS Textract, my bad. The results from it on the same document are attached. It's a JSON file, just had to convert it so GitHub would let me upload it.

Speer_Permit 18WE0486.CP1_Blowdown vent.txt

peterwilliams97 · 2020-09-01T11:36:33Z

Thanks. That will give me a benchmark to compare against.

Elikrag · 2020-10-02T23:51:44Z

Following up on the examples @Ben-harder posted. Examples 1 and 2 are fixed from v3.11.1, but the issue with 3 remains. Here's two more examples from the same PDF:

Along with Ben's 3rd example still not getting fully picked up:

Seems like table identification improved, but cell identification within a table didn't. Curious if there's any update on this? Thanks!

gunnsth · 2020-10-19T16:59:17Z

Table extractions have been improved in v3.13.0 and you should see much better results with your files.

anovik · 2024-12-23T06:40:24Z

In current version https://github.com/unidoc/unipdf/releases/tag/v3.65.0 detection of tables and lines is done by default. If you need to disable it for some file you should use simple extraction like in the example https://github.com/unidoc/unipdf-examples/blob/master/extract/pdf_simple_extraction.go.

In case of any problem, feel free to open a new issue.

gunnsth transferred this issue from unidoc/unidoc May 24, 2019

gunnsth added enhancement New feature or request extract feature New feature labels Jun 2, 2020

gunnsth mentioned this issue Sep 14, 2020

Prepare release of UniPDF v3.11.1 #410

Merged

gunnsth mentioned this issue Oct 19, 2020

Prepare release of UniPDF v3.13.0 #420

Merged

traitman mentioned this issue Jan 25, 2023

[BUG] text extraction for list not work as exptected #508

Closed

anovik closed this as completed Dec 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advanced text extraction on columns, tables, equations #38

Advanced text extraction on columns, tables, equations #38

gunnsth commented Jan 1, 2019

Ben-harder commented Aug 26, 2020

gunnsth commented Aug 26, 2020

Ben-harder commented Aug 26, 2020

Alttaf commented Aug 26, 2020

gunnsth commented Aug 26, 2020

Ben-harder commented Aug 26, 2020

peterwilliams97 commented Aug 28, 2020

Ben-harder commented Aug 28, 2020

Ben-harder commented Aug 31, 2020

peterwilliams97 commented Sep 1, 2020

Elikrag commented Oct 2, 2020

gunnsth commented Oct 19, 2020

anovik commented Dec 23, 2024

Advanced text extraction on columns, tables, equations #38

Advanced text extraction on columns, tables, equations #38

Comments

gunnsth commented Jan 1, 2019

Ben-harder commented Aug 26, 2020

gunnsth commented Aug 26, 2020

Ben-harder commented Aug 26, 2020

Alttaf commented Aug 26, 2020

gunnsth commented Aug 26, 2020

Ben-harder commented Aug 26, 2020

peterwilliams97 commented Aug 28, 2020

Ben-harder commented Aug 28, 2020

Ben-harder commented Aug 31, 2020

peterwilliams97 commented Sep 1, 2020

Elikrag commented Oct 2, 2020

gunnsth commented Oct 19, 2020

anovik commented Dec 23, 2024