-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Advanced text extraction on columns, tables, equations #38
Comments
Any update on this? I've found that unfortunately table extraction seems to be about 50% correct at the moment |
@Ben-harder Can you post some cases where it falls short? And also the code you are using. |
+1 for this |
@Ben-harder Can you share some PDFs that we can use for testing and include in our QA and automated test suites? |
Sure I can give you the one from the images |
Ben, I will investigate this. Some of the things I am working on are:
I will see if I can make a small commit that addresses your specific issues next week. |
Hi Peter, thank you that sounds great! And no, I haven't used any other PDF table extractors. |
So we actually have used AWS Textract, my bad. The results from it on the same document are attached. It's a JSON file, just had to convert it so GitHub would let me upload it. |
Thanks. That will give me a benchmark to compare against. |
Following up on the examples @Ben-harder posted. Examples 1 and 2 are fixed from v3.11.1, but the issue with 3 remains. Here's two more examples from the same PDF: Along with Ben's 3rd example still not getting fully picked up: Seems like table identification improved, but cell identification within a table didn't. Curious if there's any update on this? Thanks! |
Table extractions have been improved in v3.13.0 and you should see much better results with your files. |
In current version https://github.com/unidoc/unipdf/releases/tag/v3.65.0 detection of tables and lines is done by default. If you need to disable it for some file you should use simple extraction like in the example https://github.com/unidoc/unipdf-examples/blob/master/extract/pdf_simple_extraction.go. In case of any problem, feel free to open a new issue. |
To properly extract certain text in PDF, it may be necessary to detect/group lines, identify tables, equations. This may either be done post-extraction of objects or before, depending on what is easier to implement and gives good results.
Also need to assemble a solid corpus for testing, as well as an API prototyping. Tabular extraction may need a different approach than equations and possibly a different API.
At this point we are collecting input so that we can define this issue better.
The text was updated successfully, but these errors were encountered: