AttributeError: 'PDFStream' object has no attribute 'replace' #210

panoptikum · 2018-11-12T11:46:36Z

Hello everybody,

At the moment I'm parsing tons of PDFs, but pdfminer.six fails on one of them. Any suggestions? I can open the PDF, but maybe pdfminer.six can't handle it properly. All the other PDFs origin from the same author/organization...

Traceback (most recent call last):
  File "/home/felix/anaconda3/bin/pdf2txt.py", line 136, in <module>
    if __name__ == '__main__': sys.exit(main())
  File "/home/felix/anaconda3/bin/pdf2txt.py", line 131, in main
    outfp = extract_text(**vars(A))
  File "/home/felix/anaconda3/bin/pdf2txt.py", line 63, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "/home/felix/anaconda3/lib/python3.6/site-packages/pdfminer/high_level.py", line 82, in extract_text_to_fp
    interpreter.process_page(page)    
  File "/home/felix/anaconda3/lib/python3.6/site-packages/pdfminer/pdfinterp.py", line 852, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/home/felix/anaconda3/lib/python3.6/site-packages/pdfminer/pdfinterp.py", line 862, in render_contents
    self.init_resources(resources)
  File "/home/felix/anaconda3/lib/python3.6/site-packages/pdfminer/pdfinterp.py", line 362, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "/home/felix/anaconda3/lib/python3.6/site-packages/pdfminer/pdfinterp.py", line 212, in get_font
    font = self.get_font(None, subspec)
  File "/home/felix/anaconda3/lib/python3.6/site-packages/pdfminer/pdfinterp.py", line 203, in get_font
    font = PDFCIDFont(self, spec)
  File "/home/felix/anaconda3/lib/python3.6/site-packages/pdfminer/pdffont.py", line 658, in __init__
    self.cmap = CMapDB.get_cmap(name)
  File "/home/felix/anaconda3/lib/python3.6/site-packages/pdfminer/cmapdb.py", line 259, in get_cmap
    data = klass._load_data(name)
  File "/home/felix/anaconda3/lib/python3.6/site-packages/pdfminer/cmapdb.py", line 233, in _load_data
    name = name.replace("\0", "")
AttributeError: 'PDFStream' object has no attribute 'replace'

The text was updated successfully, but these errors were encountered:

tataganesh · 2018-11-12T14:26:27Z

Is it possible for you to post the problematic PDF?

panoptikum · 2018-11-12T15:07:45Z

Yes.

Problematic PDF

OR

Problematic PDF 2

marcelhekking · 2018-11-28T08:27:57Z

I am encountering the same error

skewed91 · 2019-01-30T21:52:20Z

Same error

bimbocant · 2019-02-22T10:01:53Z

I'm encountering the same error when trying to parse

euarthuurr · 2019-02-22T21:06:24Z

I am encountering the same error when I run on the Mac (OS 10.10.5), but this error doesn't happen when I'm using Windows 10.

euarthuurr · 2019-02-22T23:27:05Z

If I open file Problematic PDF and saved as PDF on Mac, pdfminer works correctly.
Problematic PDF works

bimbocant · 2019-02-25T14:19:56Z

I tried saving it as PDF on Windows 10. Still not working.

bimbocant · 2019-02-27T19:47:29Z

#228 works fine.

pforero · 2019-03-11T15:14:47Z

Hi! I am getting still the same issue. I have found that on the problematics PDFs the specs seemed to be mixed. The 'Encoding' spec is a PDFStream object, and the 'ToUnicode' spec is 'Identity-H'. I have also tried #228 but it does not resolve it, the final 'name' value is 'CM20' which does not Map to Identity-H/V.

I am not sure why for some PDFs (it is a reasonable amount of them, close to 10% of those I have) it mixes ToUnicode and Encoding. Hope you guys can have some further insight on the cause

EDIT: I have tried to work around by switching the Encoder with the ToUnicode... while in this manner it parses, all characters just show as (cid:75)(cid:101)(cid:121)....

EDIT2: I continued to go down the rabbit hole. So the difference between the PDFs where I get an error or no seem to be in the CIDSystemInfo Registry. If the Registry is Adobe all is good. If it is Actuate then the problems arise. Besides the ToUnicode and Encoding differences (which is the one that launches the error in this thread) in DescendantFonts, Actuate PDFs do not have a CIDToGIDMap, but instead have a DW (?) field. It seems strange that it can only handle PDFs built on Adobe and not on Actuate.

tangxfei · 2019-05-26T16:15:24Z

To modify the “cmapdb.py”（\Lib\site-packages\pdfminer\cmapdb.py）
Line:233
def _load_data(klass, name):
name = name.replace("\0","") #Modify to=> name=str(name).replace("\0","")

The problem can be solved.

benzkji · 2019-06-20T18:40:07Z

I am encountering the same error, on (probably) one of around 1000 files. Will investigate, report if re-saving it (I am on Linux) will help...

@tangxfei mind a PR ;-)

igormp · 2019-06-28T14:17:57Z

@benzkji there's already a PR at #228 that fixes this, but the project seems kinda dead

vinayak-mehta · 2019-07-09T20:44:15Z

@0xabu Looks like you have merge access since you merged #230. Please look into #228 and merge it :)

vinayak-mehta · 2019-07-09T20:46:00Z

@0xabu Do you also have access to push a new release to PyPI? cc: @ganeshtata

See pdfminer/pdfminer.six#210

0xabu · 2019-07-22T17:38:59Z

@vinayak-mehta Someone (I think @goulu) merged some bugfix PRs and then added me to the org back in early 2017, but like you I just depend on pdfminer so I'm not comfortable (not to mention don't have time) taking on responsibility for it. I don't know anything about the PyPi package.

pinnnip · 2019-07-23T04:29:51Z

I have tried all the methods mentioned in all the related threads, it is still not working on most of my PDFs. Are there any more updates?

pinnnip · 2019-07-23T04:37:07Z

Some of output only has a string '\x0c' with no content from pdf file.

pietermarsman · 2019-10-15T14:29:26Z

Fixed by #283

panoptikum closed this as completed Nov 12, 2018

panoptikum reopened this Nov 12, 2018

jtkese mentioned this issue Feb 25, 2019

Handle PDFStream as character map name in PDFCIDFont #228

Closed

igormp mentioned this issue Jun 28, 2019

AttributeError from PDFMiner atlanhq/camelot#348

Closed

vinayak-mehta mentioned this issue Jul 6, 2019

AttributeError from PDFMiner camelot-dev/camelot#23

Closed

pietermarsman mentioned this issue Jul 14, 2019

Pdfstream as cmap #264

Merged

herrdiener added a commit to herrdiener/pdfminer3 that referenced this issue Jul 18, 2019

Fix failure on Actuate-generated PDFs

0442ea2

See pdfminer/pdfminer.six#210

herrdiener mentioned this issue Jul 18, 2019

Fix failure on Actuate-generated PDFs gwk/pdfminer3#6

Open

fakabbir mentioned this issue Aug 10, 2019

Pdfstream as cmap #283

Merged

pietermarsman added the type: bug label Oct 13, 2019

pietermarsman closed this as completed Oct 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AttributeError: 'PDFStream' object has no attribute 'replace' #210

AttributeError: 'PDFStream' object has no attribute 'replace' #210

panoptikum commented Nov 12, 2018

tataganesh commented Nov 12, 2018

panoptikum commented Nov 12, 2018 •

edited

Loading

marcelhekking commented Nov 28, 2018

skewed91 commented Jan 30, 2019

bimbocant commented Feb 22, 2019

euarthuurr commented Feb 22, 2019

euarthuurr commented Feb 22, 2019

bimbocant commented Feb 25, 2019

bimbocant commented Feb 27, 2019

pforero commented Mar 11, 2019 •

edited

Loading

tangxfei commented May 26, 2019

benzkji commented Jun 20, 2019

igormp commented Jun 28, 2019

vinayak-mehta commented Jul 9, 2019

vinayak-mehta commented Jul 9, 2019

0xabu commented Jul 22, 2019

pinnnip commented Jul 23, 2019

pinnnip commented Jul 23, 2019

pietermarsman commented Oct 15, 2019

AttributeError: 'PDFStream' object has no attribute 'replace' #210

AttributeError: 'PDFStream' object has no attribute 'replace' #210

Comments

panoptikum commented Nov 12, 2018

tataganesh commented Nov 12, 2018

panoptikum commented Nov 12, 2018 • edited Loading

marcelhekking commented Nov 28, 2018

skewed91 commented Jan 30, 2019

bimbocant commented Feb 22, 2019

euarthuurr commented Feb 22, 2019

euarthuurr commented Feb 22, 2019

bimbocant commented Feb 25, 2019

bimbocant commented Feb 27, 2019

pforero commented Mar 11, 2019 • edited Loading

tangxfei commented May 26, 2019

benzkji commented Jun 20, 2019

igormp commented Jun 28, 2019

vinayak-mehta commented Jul 9, 2019

vinayak-mehta commented Jul 9, 2019

0xabu commented Jul 22, 2019

pinnnip commented Jul 23, 2019

pinnnip commented Jul 23, 2019

pietermarsman commented Oct 15, 2019

panoptikum commented Nov 12, 2018 •

edited

Loading

pforero commented Mar 11, 2019 •

edited

Loading