Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExtractText2 #929

Merged
merged 7 commits into from
Jun 5, 2022
Merged

Conversation

pubpub-zz
Copy link
Collaborator

New proposal for evaluation for the current being

new proposal with deeper analysis of font data and text positionning
@pubpub-zz
Copy link
Collaborator Author

new proposal.
@MartinThoma,
can you review this proposal with the testbench test?

@MartinThoma
Copy link
Member

I'll start it and will post the results this evening (might take 1-2h; I need to finish some other stuff)

@MartinThoma
Copy link
Member

The average stayed the same. Most files improved, but one became drastically worse:

https://arxiv.org/pdf/1601.03642 : 0.9438654353562005 -> 0.95,
https://arxiv.org/pdf/1602.06541 : 0.8978933061501869 -> 0.91
https://arxiv.org/pdf/1707.09725 : 0.9100581720093184 -> 0.94
https://arxiv.org/pdf/2201.00021 : 0.9499215589133845 -> 0.97
https://arxiv.org/pdf/2201.00022 : 0.9102201679631884 -> 0.93
https://arxiv.org/pdf/2201.00029 : 0.0 -> 0.0,
https://arxiv.org/pdf/2201.00037 : 0.9155486607869612 -> 0.94
https://arxiv.org/pdf/2201.00069 : 0.8980679211032767 -> 0.91
https://arxiv.org/pdf/2201.00151 : 0.8859883219294902 -> 0.64 <------
https://arxiv.org/pdf/2201.00178 : 0.8927337030785306 -> 0.92
https://arxiv.org/pdf/2201.00200 : 0.9683510183687691 -> 0.98
https://arxiv.org/pdf/2201.00201 : 0.9747879942829919 -> 0.99
https://arxiv.org/pdf/2201.00214 : 0.8850769765492426 -> 0.81 <----
https://github.com/py-pdf/sample-files/raw/main/009-pdflatex-geotopo/GeoTopo-book : 0.7860457992901709 -> 0.86

@MartinThoma
Copy link
Member

This is an excerpt from the file that became so much worse (left is the current PyPDF2==1.28.4 version, right is this PRs version):

image

@pubpub-zz
Copy link
Collaborator Author

pubpub-zz commented May 30, 2022

new draft proposal where bugs (also applying on the first proposal) : @MartinThoma Can you rerun the bench?

I will have a look also to #858 in order to get the best of both

Includes :
* XObject Processing, 
* choice between encoding and tounicode fields
* partial compliance with Identify-H/V encoding (missing processing on 2-bytes)

*legacy conversion reintroduced as old for comparison
*debug extraction
*typing and test
increase test and refactory depreciation warning ignore in test
@MartinThoma MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Jun 1, 2022
@MartinThoma
Copy link
Member

MartinThoma commented Jun 4, 2022

@pubpub-zz I would like to get the Charmap support soon into PyPDF2 and give you ( + some others who made very similar PRs before) full credit for your work. For this reason I would like to avoid to merge #924.

I suggest the following:

  1. Improve Text Extraction #881 is the PR we merge into main next. Currently the CI is failing - I can take care of that if you want. Also, I need to check that the quality according to the benchmark stays roughly the same. I would add asabramo and VictorCarlquist as co-authored-by as they have done similar PRs in the past. Would that be ok for you?
  2. I close Pubpub zz extract text #924 - I just created that branch to show some minor mypy / style things I would change in Improve Text Extraction #881.
  3. We / I go through the following PRs to check if something is missing:

@pubpub-zz
Copy link
Collaborator Author

@MartinThoma sorry to bother you can you rerun the bench on this version.
I will have a look at the the others

@MartinThoma
Copy link
Member

No problem - I'm happy that you're doing the heavy-lifting 😄

I've just started the benchmark run. I'll share the results tomorrow morning (takes ~20 minutes and I'll go to bed now 😄 )

@MartinThoma
Copy link
Member

MartinThoma commented Jun 5, 2022

I get

Traceback (most recent call last):
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1331, in buildCharMap
    raise Exception("null width")
Exception: null width

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/moose/Github/py-pdf/benchmarks/benchmark.py", line 530, in <module>
    main(docs, libraries, add_text_extraction_quality=True)
  File "/home/moose/Github/py-pdf/benchmarks/benchmark.py", line 235, in main
    text = lib.text_extraction_function(data)
  File "/home/moose/Github/py-pdf/benchmarks/benchmark.py", line 140, in pypdf2_get_text
    text += page.extractText()
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1506, in extractText
    return self.extract_text(Tj_sep=Tj_sep, TJ_sep=TJ_sep)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1482, in extract_text
    return self._extract_text(self,self.pdf,space_width, PG.CONTENTS)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1357, in _extract_text
    cmaps[f] = buildCharMap(f)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1344, in buildCharMap
    sp_width = m / cpt / 2
ZeroDivisionError: division by zero

for https://github.com/py-pdf/sample-files/raw/main/009-pdflatex-geotopo/GeoTopo.pdf - reader.pages[13]:

reader = PyPDF2.PdfFileReader("GeoTopo.pdf")
page = reader.pages[13]
page.extract_text()

@MartinThoma
Copy link
Member

MartinThoma commented Jun 5, 2022

I've added the fallback

if cpt == 0:
    cpt = 1

With that fallback, your PR currently boosts the average from 86% to 90% 96%!
edit: That means PyPDF2 has better text extraction than pdfminer.six and pdftotext 🎉

Looking at the single files:

            "1601.03642": 0.9789762968052216, -> 99%
            "1602.06541": 0.9607310932031617, -> 98%
            "1707.09725": 0.9160059659313918, -> 94%
            "2201.00021": 0.92414829121734, -> 97%
            "2201.00022": 0.9581322751904328, -> 98%
            "2201.00029": 0.0,  -> 98% -- you managed to do it! You're so awesome! Thank you!
            "2201.00037": 0.9228385160911429, -> 94%
            "2201.00069": 0.9320819588347349, -> 96%
            "2201.00151": 0.8986238392139712, -> 93%
            "2201.00178": 0.9035859338326836, -> 93%
            "2201.00200": 0.9411056870547374, -> 97%
            "2201.00201": 0.9444251579563376, -> 98%
            "2201.00214": 0.9625399637918416, -> 97%
            "GeoTopo-book": 0.7924142197687146 -> 86%

@MartinThoma
Copy link
Member

MartinThoma commented Jun 5, 2022

@pubpub-zz I love you 🤩 🤗 This is a crazy improvement! Now I really want it to be merged 😄

Please let me know how you would like me to continue. Should I merge pubpub-zz:ExtractText2 into py-pdf:pubpub-zz-extractText and then that one into main?

@pubpub-zz
Copy link
Collaborator Author

the PR you've referenced will surely improve some translation.
What I would propose you :
a) I cleanup flake8 / mypy to confirm that we will pass all tests.
b) you merge Extract2 into pupbpub-Extract and then in main

In my current branch the legacy function is still present as extract_oldtext for people to reverse if they prefer
I will carry on this branch with a new PR from the latest main for introducing the other changes

@MartinThoma
Copy link
Member

Sounds good! Then I'll wait for your ok to get started :-)

@pubpub-zz
Copy link
Collaborator Author

I think you should be able to merge this release

@MartinThoma
Copy link
Member

You mean I can merge this PR now? (just want to be sure :-) )

@pubpub-zz
Copy link
Collaborator Author

Go :)

@MartinThoma MartinThoma merged commit d957d4d into py-pdf:pubpub-zz-extractText Jun 5, 2022
@pubpub-zz pubpub-zz deleted the ExtractText2 branch June 10, 2022 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants