Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCRMyPDF - AttributeError: 'ArrayObject' object has no attribute 'getData' #220 #111

Closed
tuxasus opened this issue Aug 14, 2015 · 15 comments
Closed
Assignees
Labels

Comments

@tuxasus
Copy link

tuxasus commented Aug 14, 2015

Hey, I am trying to get OCRmyPDF running on some PDFs generated / archived using my scanner. So far the program should run, but when I try to run it on an existing PDF (freshly scanned) I get the following error message:

[code]
/usr/lib/python3/dist-packages/pkg_resources.py:1031: UserWarning: /home/florian/.python-eggs is writable by group/others and vulnerable to attack when used with get_resource_filename. Consider a more secure location (set with .set_extraction_path or the PYTHON_EGG_CACHE environment variable).
warnings.warn(msg, UserWarning)
Traceback (most recent call last):
File "/usr/local/bin/ocrmypdf", line 9, in
load_entry_point('ocrmypdf==3.0rc4', 'console_scripts', 'ocrmypdf')()
File "/usr/local/lib/python3.4/dist-packages/ocrmypdf-3.0rc4-py3.4.egg/ocrmypdf/main.py", line 848, in run_pipeline
cmdline.run(options)
File "/usr/local/lib/python3.4/dist-packages/ruffus-2.6.3-py3.4.egg/ruffus/cmdline.py", line 824, in run
**appropriate_options)
File "/usr/local/lib/python3.4/dist-packages/ruffus-2.6.3-py3.4.egg/ruffus/task.py", line 5938, in pipeline_run
raise job_errors
ruffus.ruffus_exceptions.RethrownJobError:

Original exception:
Exception #1
'builtins.AttributeError('ArrayObject' object has no attribute 'getData')' raised in ...
Task = def ocrmypdf.main.repair_pdf(...):
Job = [source.pdf -> .../com.github.ocrmypdf.gao5vxz1/source.repaired.pdf, <ocrmypdf.main.WrappedLogger>, [], <_thread.lock>]

Traceback (most recent call last):
File "/usr/local/lib/python3.4/dist-packages/ruffus-2.6.3-py3.4.egg/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
register_cleanup, touch_files_only)
File "/usr/local/lib/python3.4/dist-packages/ruffus-2.6.3-py3.4.egg/ruffus/task.py", line 567, in job_wrapper_io_files
ret_val = user_defined_work_func(*params)
File "/usr/local/lib/python3.4/dist-packages/ocrmypdf-3.0rc4-py3.4.egg/ocrmypdf/main.py", line 332, in repair_pdf
pdfinfo.extend(pdf_get_all_pageinfo(output_file))
File "/usr/local/lib/python3.4/dist-packages/ocrmypdf-3.0rc4-py3.4.egg/ocrmypdf/pageinfo.py", line 137, in pdf_get_all_pageinfo
return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
File "/usr/local/lib/python3.4/dist-packages/ocrmypdf-3.0rc4-py3.4.egg/ocrmypdf/pageinfo.py", line 137, in
return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
File "/usr/local/lib/python3.4/dist-packages/ocrmypdf-3.0rc4-py3.4.egg/ocrmypdf/pageinfo.py", line 118, in _pdf_get_pageinfo
if _page_has_inline_images(page):
File "/usr/local/lib/python3.4/dist-packages/ocrmypdf-3.0rc4-py3.4.egg/ocrmypdf/pageinfo.py", line 45, in _page_has_inline_images
data = contents.getData()
AttributeError: 'ArrayObject' object has no attribute 'getData'

[/code]

It doesn't matter which scanned file I use, as an example I attached a file printed and scanned from liquidweb
df98763e-3e03-11e5-8c19-39843f1808e0

jbarlow83 pushed a commit that referenced this issue Aug 14, 2015
@jbarlow83 jbarlow83 added the bug label Aug 14, 2015
@jbarlow83 jbarlow83 self-assigned this Aug 14, 2015
@jbarlow83
Copy link
Collaborator

I took a guess at what the problem is and I think fixed it in the develop branch.

Could you send me the actual PDF you used (Dropbox or something)? There's something unusual about the particular PDF you tried. At least, it's different from the other PDFs I've tested. I'd also like to add it to the test suite. Thanks!

@jbarlow83
Copy link
Collaborator

Presumed to be fixed

@tuxasus
Copy link
Author

tuxasus commented Aug 23, 2015

I am sorry for the delay, I managed to change my partition table so that ubunut wouldn't boot anylonger and it took some time to fix it. You'll find the PDF followig the link below:
https://www.hidrive.strato.com/lnk/bQDmOlCS

Edit: Using the current version, I get the message: xxx.pdf: not a valid PDF, and could not repair it

@jbarlow83
Copy link
Collaborator

Thank you for providing the PDF file. It's very helpful to have examples of all kinds of PDFs out there.

In -rc8 I fixed a bug that this PDF file triggered by not having the document info dictionary which is technically optional but present in almost all PDFs files. That problem, however, does not explain the error message you encountered.

That error message comes from qpdf. What is the output of qpdf --version on your system? I have 5.1.3. It could be that there is a bug in qpdf that is responsible. Acrobat XI and qpdf 5.1.3 both say your file is a valid PDF.

I think if you upgrade to qpdf to >= 5.1.3 and ocrmypdf to -rc8 you should see the problem fixed.

@jbarlow83 jbarlow83 reopened this Aug 24, 2015
@tuxasus
Copy link
Author

tuxasus commented Aug 24, 2015

I upgraded qpdf and ocrmypdf to the current versions and it works as long as I don't use ImageMagick and unpaper (ocrmypdf -i input.pdf output.pdf), but using one of them (ocrmypdf -d -c input.pdf output.pdf) generates the following error:

ocrmypdf -c Input.pdf Output.pdf
Input.pdf: not a valid PDF, and could not repair it.
Details:
open Output.pdf: No such file or directory

@jbarlow83
Copy link
Collaborator

ImageMagick is no longer used.

What version of unpaper and ocrmypdf are you trying that with? Using your
file, ocrmypdf -d and ocrmypdf -c both work for me on the Docker and OS
X versions.

On Mon, 24 Aug 2015 at 12:42 tuxasus notifications@github.com wrote:

I upgraded qpdf and ocrmypdf to the current versions and it works as long
as I don't use ImageMagick and unpaper (ocrmypdf -i input.pdf output.pdf),
but using one of them (ocrmypdf -d -c input.pdf output.pdf) generates the
following error:

ocrmypdf -c Input.pdf Output.pdf
Input.pdf: not a valid PDF, and could not repair it.
Details:
open Output.pdf: No such file or directory


Reply to this email directly or view it on GitHub
#111 (comment).

@tuxasus
Copy link
Author

tuxasus commented Aug 25, 2015

I am using unpaper 0.4.2 and 3.0rc8 and ubuntu as OS

Edit: same problem using unpaper 6.1

@jbarlow83
Copy link
Collaborator

unpaper 0.4.2 is the problem. It seems to produce invalid output files
sometimes. I do an install-time check for it, but not runtime.

Dockerfile shows how to build unpaper 6.1 from source.

On Tue, 25 Aug 2015 at 10:25 tuxasus notifications@github.com wrote:

I am using unpaper 0.4.2 and 3.0rc8 and ubuntu as OS


Reply to this email directly or view it on GitHub
#111 (comment).

@tuxasus
Copy link
Author

tuxasus commented Aug 26, 2015

I get the same Problem with unpaper 6.1

ocrmypdf -c Input.pdf Output.pdf
Input.pdf: not a valid PDF, and could not repair it.
Details:
open Output.pdf: No such file or Directory

@jbarlow83
Copy link
Collaborator

Thanks for your patience with this. What is the output of ocrmypdf -v 1 -c Input.pdf Output.pdf ?

@tuxasus
Copy link
Author

tuxasus commented Aug 27, 2015

ocrmypdf --version
3.0rc8
unpaper --version
6.1

ocrmypdf -v 1 -c Input.pdf Output.pdf


Tasks which will be run:

Task enters queue = 'ocrmypdf.main.repair_pdf'

[{'yres': Decimal('300.158'), 'height_inches': Decimal('6.34'), 'pageno': 0, 'images': [{'comp': 1, 'bpc': 1, 'color': 'gray', 'enc': 'ccitt', 'height': 1903, 'width': 944, 'dpi_h': Decimal('300.158'), 'dpi_w': Decimal('299.683'), 'dpi': Decimal('299.920')}], 'xres': Decimal('299.683'), 'height_pixels': 1903, 'width_inches': Decimal('3.15'), 'width_pixels': 944, 'has_text': False}, {'yres': Decimal('300.158'), 'height_inches': Decimal('6.34'), 'pageno': 1, 'images': [{'comp': 1, 'bpc': 1, 'color': 'gray', 'enc': 'ccitt', 'height': 1903, 'width': 944, 'dpi_h': Decimal('300.158'), 'dpi_w': Decimal('299.683'), 'dpi': Decimal('299.920')}], 'xres': Decimal('299.683'), 'height_pixels': 1903, 'width_inches': Decimal('3.15'), 'width_pixels': 944, 'has_text': False}]

Completed Task = 'ocrmypdf.main.repair_pdf'
Task enters queue = 'ocrmypdf.main.split_pages'
Task enters queue = 'ocrmypdf.main.generate_postscript_stub'
os.symlink(/tmp/com.github.ocrmypdf.mmjygyhw/000002.page.pdf, /tmp/com.github.ocrmypdf.mmjygyhw/000002.ocr.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.mmjygyhw/000001.page.pdf, /tmp/com.github.ocrmypdf.mmjygyhw/000001.ocr.page.pdf)
Completed Task = 'ocrmypdf.main.split_pages'
Task enters queue = 'ocrmypdf.main.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.main.skip_page'
Uptodate Task = 'ocrmypdf.main.skip_page'

WARNING:
In Task 'ocrmypdf.main.skip_page':
No jobs were run because no file names matched.
Please make sure that the regular expression is correctly specified.

Completed Task = 'ocrmypdf.main.generate_postscript_stub'
GPL Ghostscript 9.10 (2013-08-30)
Copyright (C) 2013 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 1.
Page 1

GPL Ghostscript 9.10 (2013-08-30)

Copyright (C) 2013 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 1.
Page 1

Completed Task = 'ocrmypdf.main.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.main.preprocess_deskew'
os.symlink(/tmp/com.github.ocrmypdf.mmjygyhw/000001.page.png, /tmp/com.github.ocrmypdf.mmjygyhw/000001.pp-deskew.png)
os.symlink(/tmp/com.github.ocrmypdf.mmjygyhw/000002.page.png, /tmp/com.github.ocrmypdf.mmjygyhw/000002.pp-deskew.png)
Completed Task = 'ocrmypdf.main.preprocess_deskew'
Task enters queue = 'ocrmypdf.main.preprocess_clean'

Original exceptions:

Exception #1
  'builtins.KeyError('P')' raised in ...
   Task = def ocrmypdf.main.preprocess_clean(...):
   Job  = [.../com.github.ocrmypdf.mmjygyhw/000001.pp-deskew.png -> .../com.github.ocrmypdf.mmjygyhw/000001.pp-clean.png, <ocrmypdf.main.WrappedLogger>, [{'yres': Decimal('300.158'), 'height_inches': Decimal('6.34'), 'pageno': 0, 'images': [{'comp': 1, 'bpc': 1, 'color': 'gray', 'enc': 'ccitt', 'height': 1903, 'width': 944, 'dpi_h': Decimal('300.158'), 'dpi_w': Decimal('299.683'), 'dpi': Decimal('299.920')}], 'xres': Decimal('299.683'), 'height_pixels': 1903, 'width_inches': Decimal('3.15'), 'width_pixels': 944, 'has_text': False}, {'yres': Decimal('300.158'), 'height_inches': Decimal('6.34'), 'pageno': 1, 'images': [{'comp': 1, 'bpc': 1, 'color': 'gray', 'enc': 'ccitt', 'height': 1903, 'width': 944, 'dpi_h': Decimal('300.158'), 'dpi_w': Decimal('299.683'), 'dpi': Decimal('299.920')}], 'xres': Decimal('299.683'), 'height_pixels': 1903, 'width_inches': Decimal('3.15'), 'width_pixels': 944, 'has_text': False}], <_thread.lock>]

Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
    register_cleanup, touch_files_only)
  File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 567, in job_wrapper_io_files
    ret_val = user_defined_work_func(*params)
  File "/home/florian/Downloads/OCRmyPDF-3.0-rc8/ocrmypdf/main.py", line 529, in preprocess_clean
    unpaper.clean(input_file, output_file, dpi, log)
  File "/home/florian/Downloads/OCRmyPDF-3.0-rc8/ocrmypdf/unpaper.py", line 83, in clean
    '--no-deskew',        # don't deskew
  File "/home/florian/Downloads/OCRmyPDF-3.0-rc8/ocrmypdf/unpaper.py", line 44, in run
    suffix = SUFFIXES[im.mode]
KeyError: 'P'


Exception #2
  'builtins.KeyError('P')' raised in ...
   Task = def ocrmypdf.main.preprocess_clean(...):
   Job  = [.../com.github.ocrmypdf.mmjygyhw/000002.pp-deskew.png -> .../com.github.ocrmypdf.mmjygyhw/000002.pp-clean.png, <ocrmypdf.main.WrappedLogger>, [{'yres': Decimal('300.158'), 'height_inches': Decimal('6.34'), 'pageno': 0, 'images': [{'comp': 1, 'bpc': 1, 'color': 'gray', 'enc': 'ccitt', 'height': 1903, 'width': 944, 'dpi_h': Decimal('300.158'), 'dpi_w': Decimal('299.683'), 'dpi': Decimal('299.920')}], 'xres': Decimal('299.683'), 'height_pixels': 1903, 'width_inches': Decimal('3.15'), 'width_pixels': 944, 'has_text': False}, {'yres': Decimal('300.158'), 'height_inches': Decimal('6.34'), 'pageno': 1, 'images': [{'comp': 1, 'bpc': 1, 'color': 'gray', 'enc': 'ccitt', 'height': 1903, 'width': 944, 'dpi_h': Decimal('300.158'), 'dpi_w': Decimal('299.683'), 'dpi': Decimal('299.920')}], 'xres': Decimal('299.683'), 'height_pixels': 1903, 'width_inches': Decimal('3.15'), 'width_pixels': 944, 'has_text': False}], <_thread.lock>]

Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
    register_cleanup, touch_files_only)
  File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 567, in job_wrapper_io_files
    ret_val = user_defined_work_func(*params)
  File "/home/florian/Downloads/OCRmyPDF-3.0-rc8/ocrmypdf/main.py", line 529, in preprocess_clean
    unpaper.clean(input_file, output_file, dpi, log)
  File "/home/florian/Downloads/OCRmyPDF-3.0-rc8/ocrmypdf/unpaper.py", line 83, in clean
    '--no-deskew',        # don't deskew
  File "/home/florian/Downloads/OCRmyPDF-3.0-rc8/ocrmypdf/unpaper.py", line 44, in run
    suffix = SUFFIXES[im.mode]
KeyError: 'P'

Input.pdf: not a valid PDF, and could not repair it.
Details:
open Output.pdf: No such file or directory

Output file: The generated PDF/A file is INVALID

@jbarlow83
Copy link
Collaborator

Should be fixed now (-rc9 and above).

@jbargil
Copy link

jbargil commented Oct 4, 2015

I have the same problem on Mac Pro yosemite, but I cannot find the -rc9 fix. How would I download it?

@jbarlow83
Copy link
Collaborator

@jbargil Sorry for slow reply.
Newer versions will be posted here: https://github.com/jbarlow83/OCRmyPDF

On Sun, 4 Oct 2015 at 01:57 jbargil notifications@github.com wrote:

I have the same problem on Mac Pro yosemite, but I cannot find the -rc9
fix. How would I download it?


Reply to this email directly or view it on GitHub
#111 (comment).

@tuxasus
Copy link
Author

tuxasus commented Jan 9, 2016

Thank you for your help using version 3.1 everything works pretty fine!
Great thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants