-
Notifications
You must be signed in to change notification settings - Fork 930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AttributeError: 'PDFStream' object has no attribute 'replace' #210
Comments
Is it possible for you to post the problematic PDF? |
Yes. OR |
I am encountering the same error |
Same error |
I'm encountering the same error when trying to parse |
I am encountering the same error when I run on the Mac (OS 10.10.5), but this error doesn't happen when I'm using Windows 10. |
If I open file Problematic PDF and saved as PDF on Mac, pdfminer works correctly. |
I tried saving it as PDF on Windows 10. Still not working. |
#228 works fine. |
Hi! I am getting still the same issue. I have found that on the problematics PDFs the specs seemed to be mixed. The 'Encoding' spec is a PDFStream object, and the 'ToUnicode' spec is 'Identity-H'. I have also tried #228 but it does not resolve it, the final 'name' value is 'CM20' which does not Map to Identity-H/V. I am not sure why for some PDFs (it is a reasonable amount of them, close to 10% of those I have) it mixes ToUnicode and Encoding. Hope you guys can have some further insight on the cause EDIT: I have tried to work around by switching the Encoder with the ToUnicode... while in this manner it parses, all characters just show as (cid:75)(cid:101)(cid:121).... EDIT2: I continued to go down the rabbit hole. So the difference between the PDFs where I get an error or no seem to be in the CIDSystemInfo Registry. If the Registry is Adobe all is good. If it is Actuate then the problems arise. Besides the ToUnicode and Encoding differences (which is the one that launches the error in this thread) in DescendantFonts, Actuate PDFs do not have a CIDToGIDMap, but instead have a DW (?) field. It seems strange that it can only handle PDFs built on Adobe and not on Actuate. |
To modify the “cmapdb.py”(\Lib\site-packages\pdfminer\cmapdb.py) The problem can be solved. |
I am encountering the same error, on (probably) one of around 1000 files. Will investigate, report if re-saving it (I am on Linux) will help... @tangxfei mind a PR ;-) |
@0xabu Do you also have access to push a new release to PyPI? cc: @ganeshtata |
@vinayak-mehta Someone (I think @goulu) merged some bugfix PRs and then added me to the org back in early 2017, but like you I just depend on pdfminer so I'm not comfortable (not to mention don't have time) taking on responsibility for it. I don't know anything about the PyPi package. |
I have tried all the methods mentioned in all the related threads, it is still not working on most of my PDFs. Are there any more updates? |
Some of output only has a string '\x0c' with no content from pdf file. |
Fixed by #283 |
Hello everybody,
At the moment I'm parsing tons of PDFs, but pdfminer.six fails on one of them. Any suggestions? I can open the PDF, but maybe pdfminer.six can't handle it properly. All the other PDFs origin from the same author/organization...
The text was updated successfully, but these errors were encountered: