Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type Error during extracting pages in some pdfs #720

Closed
psrubing opened this issue Feb 22, 2022 · 11 comments
Closed

Type Error during extracting pages in some pdfs #720

psrubing opened this issue Feb 22, 2022 · 11 comments
Labels

Comments

@psrubing
Copy link

Hello,

I've encountered bug during extrating pages using extract_pages() function from pdfminer.high_level module. This only happens to some pdf-s.
Image below provides this bug:

screenshot

Below pdf implies this bug:
pdf_bug.pdf

Environment:
Python - 3.7.11
pdfminer.six - 20201018

@pietermarsman
Copy link
Member

pietermarsman commented Feb 22, 2022

Can replicate:

$ PYTHONPATH=. python tools/pdf2txt.py ~/Downloads/pdf_bug.pdf 
Traceback (most recent call last):
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdffont.py", line 920, in char_width
    return cast(Dict[int, float], self.widths)[cid] * self.hscale
KeyError: 67

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdffont.py", line 924, in char_width
    return str_widths[self.to_unichr(cid)] * self.hscale
KeyError: 'a'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/pieter/projects/pdfminer-upstream/tools/pdf2txt.py", line 313, in <module>
    sys.exit(main())
  File "/home/pieter/projects/pdfminer-upstream/tools/pdf2txt.py", line 307, in main
    outfp = extract_text(**vars(parsed_args))
  File "/home/pieter/projects/pdfminer-upstream/tools/pdf2txt.py", line 62, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/high_level.py", line 121, in extract_text_to_fp
    interpreter.process_page(page)
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdfinterp.py", line 991, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdfinterp.py", line 1010, in render_contents
    self.execute(list_value(streams))
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdfinterp.py", line 1036, in execute
    func(*args)
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdfinterp.py", line 896, in do_TJ
    self.device.render_string(
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdfdevice.py", line 133, in render_string
    textstate.linematrix = self.render_string_horizontal(
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdfdevice.py", line 173, in render_string_horizontal
    x += self.render_char(
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/converter.py", line 206, in render_char
    textwidth = font.char_width(cid)
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdffont.py", line 926, in char_width
    return self.default_width * self.hscale
TypeError: unsupported operand type(s) for *: 'PDFObjRef' and 'float'

@pietermarsman
Copy link
Member

Probably the solution is to call resolve1 when getting the default width.

@psrubing
Copy link
Author

Hi, thanks for response, but I don't understand your comment. Function extract_pages() doesn't take any parameter related to resolve1 as from my knowledge based on documentation:

image

Or did I missed something?
Best regards

@pietermarsman
Copy link
Member

I mean, to fix this issue we have to make a change to pdfminer.six, using resolve1(). This is a bug in the current code.

@psrubing
Copy link
Author

Okey, I understand now :) Do you know approximate time of release with this fix?

@pietermarsman
Copy link
Member

Nobody is working on it as far as I know.

Do you have time to work on this?

@psrubing
Copy link
Author

Unfortunately I don't :/ Have to work on different projects, but if something change I will update and could look at this bug.

gosiafilipek added a commit to gosiafilipek/pdfminer.six that referenced this issue Jun 22, 2022
resolve1 when getting the default width.
pietermarsman added a commit that referenced this issue Jun 25, 2022
* Issue #720

resolve1 when getting the default width.

* Add CHANGELOG.md

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
@gosiafilipek
Copy link
Contributor

image
I found another file that generate similar error.
I already found solution so I will create pull request.

Below pdf implies this bug
pdf_bug2.pdf

gosiafilipek added a commit to gosiafilipek/pdfminer.six that referenced this issue Jun 28, 2022
self.attrs['MediaBox'] contains params with type PDFObjRef insted of int.
I used resolve1 on all params in self.attrs['MediaBox'] to eliminate problem
@datatalking
Copy link

@pietermarsman @psrubing This was an issue for me in a past project and they ended up using an OCR solution. I was going to say I could take a look at this to debug it but noticed @gosiafilipek created a PR solution?

If those are done I have time in the next 2 months to contribute, but didn't see a 'good first issue' icon or whatever its called so I looked back and found these I could start with. Does anyone have requests or recommendations on where I should start?

#470
#154
#499
#497

@pietermarsman
Copy link
Member

Hi @datatalking,

Thanks for reaching out! And for wanting to help! You can get in touch on gitter.im. In the private or group chat. We can have a sync about what to work on.

I'll try and see if I can create a good-first-issue label.

@pietermarsman pietermarsman added component: converter Related to any PDFLayoutAnalyzer and removed component:converter labels Aug 8, 2022
@pietermarsman
Copy link
Member

Fixed by #772

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants