Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError: 'PDFStream' object has no attribute 'replace' #210

Closed
panoptikum opened this issue Nov 12, 2018 · 19 comments
Closed

AttributeError: 'PDFStream' object has no attribute 'replace' #210

panoptikum opened this issue Nov 12, 2018 · 19 comments

Comments

@panoptikum
Copy link

Hello everybody,

At the moment I'm parsing tons of PDFs, but pdfminer.six fails on one of them. Any suggestions? I can open the PDF, but maybe pdfminer.six can't handle it properly. All the other PDFs origin from the same author/organization...

Traceback (most recent call last):
  File "/home/felix/anaconda3/bin/pdf2txt.py", line 136, in <module>
    if __name__ == '__main__': sys.exit(main())
  File "/home/felix/anaconda3/bin/pdf2txt.py", line 131, in main
    outfp = extract_text(**vars(A))
  File "/home/felix/anaconda3/bin/pdf2txt.py", line 63, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "/home/felix/anaconda3/lib/python3.6/site-packages/pdfminer/high_level.py", line 82, in extract_text_to_fp
    interpreter.process_page(page)    
  File "/home/felix/anaconda3/lib/python3.6/site-packages/pdfminer/pdfinterp.py", line 852, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/home/felix/anaconda3/lib/python3.6/site-packages/pdfminer/pdfinterp.py", line 862, in render_contents
    self.init_resources(resources)
  File "/home/felix/anaconda3/lib/python3.6/site-packages/pdfminer/pdfinterp.py", line 362, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "/home/felix/anaconda3/lib/python3.6/site-packages/pdfminer/pdfinterp.py", line 212, in get_font
    font = self.get_font(None, subspec)
  File "/home/felix/anaconda3/lib/python3.6/site-packages/pdfminer/pdfinterp.py", line 203, in get_font
    font = PDFCIDFont(self, spec)
  File "/home/felix/anaconda3/lib/python3.6/site-packages/pdfminer/pdffont.py", line 658, in __init__
    self.cmap = CMapDB.get_cmap(name)
  File "/home/felix/anaconda3/lib/python3.6/site-packages/pdfminer/cmapdb.py", line 259, in get_cmap
    data = klass._load_data(name)
  File "/home/felix/anaconda3/lib/python3.6/site-packages/pdfminer/cmapdb.py", line 233, in _load_data
    name = name.replace("\0", "")
AttributeError: 'PDFStream' object has no attribute 'replace'

@tataganesh
Copy link
Member

Is it possible for you to post the problematic PDF?

@panoptikum
Copy link
Author

panoptikum commented Nov 12, 2018

Yes.

Problematic PDF

OR

Problematic PDF 2

@marcelhekking
Copy link

I am encountering the same error

@skewed91
Copy link

Same error

@bimbocant
Copy link

I'm encountering the same error when trying to parse

@euarthuurr
Copy link

I am encountering the same error when I run on the Mac (OS 10.10.5), but this error doesn't happen when I'm using Windows 10.

@euarthuurr
Copy link

If I open file Problematic PDF and saved as PDF on Mac, pdfminer works correctly.
Problematic PDF works

@bimbocant
Copy link

I tried saving it as PDF on Windows 10. Still not working.

@bimbocant
Copy link

#228 works fine.

@pforero
Copy link

pforero commented Mar 11, 2019

Hi! I am getting still the same issue. I have found that on the problematics PDFs the specs seemed to be mixed. The 'Encoding' spec is a PDFStream object, and the 'ToUnicode' spec is 'Identity-H'. I have also tried #228 but it does not resolve it, the final 'name' value is 'CM20' which does not Map to Identity-H/V.

I am not sure why for some PDFs (it is a reasonable amount of them, close to 10% of those I have) it mixes ToUnicode and Encoding. Hope you guys can have some further insight on the cause

EDIT: I have tried to work around by switching the Encoder with the ToUnicode... while in this manner it parses, all characters just show as (cid:75)(cid:101)(cid:121)....

EDIT2: I continued to go down the rabbit hole. So the difference between the PDFs where I get an error or no seem to be in the CIDSystemInfo Registry. If the Registry is Adobe all is good. If it is Actuate then the problems arise. Besides the ToUnicode and Encoding differences (which is the one that launches the error in this thread) in DescendantFonts, Actuate PDFs do not have a CIDToGIDMap, but instead have a DW (?) field. It seems strange that it can only handle PDFs built on Adobe and not on Actuate.

@tangxfei
Copy link

To modify the “cmapdb.py”(\Lib\site-packages\pdfminer\cmapdb.py)
Line:233
def _load_data(klass, name):
name = name.replace("\0","") #Modify to=> name=str(name).replace("\0","")

The problem can be solved.

@benzkji
Copy link

benzkji commented Jun 20, 2019

I am encountering the same error, on (probably) one of around 1000 files. Will investigate, report if re-saving it (I am on Linux) will help...

@tangxfei mind a PR ;-)

@igormp
Copy link
Contributor

igormp commented Jun 28, 2019

@benzkji there's already a PR at #228 that fixes this, but the project seems kinda dead

@vinayak-mehta
Copy link
Contributor

@0xabu Looks like you have merge access since you merged #230. Please look into #228 and merge it :)

@vinayak-mehta
Copy link
Contributor

@0xabu Do you also have access to push a new release to PyPI? cc: @ganeshtata

@0xabu
Copy link
Contributor

0xabu commented Jul 22, 2019

@vinayak-mehta Someone (I think @goulu) merged some bugfix PRs and then added me to the org back in early 2017, but like you I just depend on pdfminer so I'm not comfortable (not to mention don't have time) taking on responsibility for it. I don't know anything about the PyPi package.

@pinnnip
Copy link

pinnnip commented Jul 23, 2019

I have tried all the methods mentioned in all the related threads, it is still not working on most of my PDFs. Are there any more updates?

@pinnnip
Copy link

pinnnip commented Jul 23, 2019

Some of output only has a string '\x0c' with no content from pdf file.

@pietermarsman
Copy link
Member

Fixed by #283

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests