OCRMyPDF - AttributeError: 'ArrayObject' object has no attribute 'getData' #220 #111

tuxasus · 2015-08-14T05:18:06Z

Hey, I am trying to get OCRmyPDF running on some PDFs generated / archived using my scanner. So far the program should run, but when I try to run it on an existing PDF (freshly scanned) I get the following error message:

[code]
/usr/lib/python3/dist-packages/pkg_resources.py:1031: UserWarning: /home/florian/.python-eggs is writable by group/others and vulnerable to attack when used with get_resource_filename. Consider a more secure location (set with .set_extraction_path or the PYTHON_EGG_CACHE environment variable).
warnings.warn(msg, UserWarning)
Traceback (most recent call last):
File "/usr/local/bin/ocrmypdf", line 9, in
load_entry_point('ocrmypdf==3.0rc4', 'console_scripts', 'ocrmypdf')()
File "/usr/local/lib/python3.4/dist-packages/ocrmypdf-3.0rc4-py3.4.egg/ocrmypdf/main.py", line 848, in run_pipeline
cmdline.run(options)
File "/usr/local/lib/python3.4/dist-packages/ruffus-2.6.3-py3.4.egg/ruffus/cmdline.py", line 824, in run
**appropriate_options)
File "/usr/local/lib/python3.4/dist-packages/ruffus-2.6.3-py3.4.egg/ruffus/task.py", line 5938, in pipeline_run
raise job_errors
ruffus.ruffus_exceptions.RethrownJobError:

Original exception:
Exception #1
'builtins.AttributeError('ArrayObject' object has no attribute 'getData')' raised in ...
Task = def ocrmypdf.main.repair_pdf(...):
Job = [source.pdf -> .../com.github.ocrmypdf.gao5vxz1/source.repaired.pdf, <ocrmypdf.main.WrappedLogger>, [], <_thread.lock>]

Traceback (most recent call last):
File "/usr/local/lib/python3.4/dist-packages/ruffus-2.6.3-py3.4.egg/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
register_cleanup, touch_files_only)
File "/usr/local/lib/python3.4/dist-packages/ruffus-2.6.3-py3.4.egg/ruffus/task.py", line 567, in job_wrapper_io_files
ret_val = user_defined_work_func(*params)
File "/usr/local/lib/python3.4/dist-packages/ocrmypdf-3.0rc4-py3.4.egg/ocrmypdf/main.py", line 332, in repair_pdf
pdfinfo.extend(pdf_get_all_pageinfo(output_file))
File "/usr/local/lib/python3.4/dist-packages/ocrmypdf-3.0rc4-py3.4.egg/ocrmypdf/pageinfo.py", line 137, in pdf_get_all_pageinfo
return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
File "/usr/local/lib/python3.4/dist-packages/ocrmypdf-3.0rc4-py3.4.egg/ocrmypdf/pageinfo.py", line 137, in
return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
File "/usr/local/lib/python3.4/dist-packages/ocrmypdf-3.0rc4-py3.4.egg/ocrmypdf/pageinfo.py", line 118, in _pdf_get_pageinfo
if _page_has_inline_images(page):
File "/usr/local/lib/python3.4/dist-packages/ocrmypdf-3.0rc4-py3.4.egg/ocrmypdf/pageinfo.py", line 45, in _page_has_inline_images
data = contents.getData()
AttributeError: 'ArrayObject' object has no attribute 'getData'

[/code]

It doesn't matter which scanned file I use, as an example I attached a file printed and scanned from liquidweb

jbarlow83 · 2015-08-14T06:16:34Z

I took a guess at what the problem is and I think fixed it in the develop branch.

Could you send me the actual PDF you used (Dropbox or something)? There's something unusual about the particular PDF you tried. At least, it's different from the other PDFs I've tested. I'd also like to add it to the test suite. Thanks!

jbarlow83 · 2015-08-18T08:35:04Z

Presumed to be fixed

tuxasus · 2015-08-23T16:35:46Z

I am sorry for the delay, I managed to change my partition table so that ubunut wouldn't boot anylonger and it took some time to fix it. You'll find the PDF followig the link below:
https://www.hidrive.strato.com/lnk/bQDmOlCS

Edit: Using the current version, I get the message: xxx.pdf: not a valid PDF, and could not repair it

jbarlow83 · 2015-08-24T08:32:26Z

Thank you for providing the PDF file. It's very helpful to have examples of all kinds of PDFs out there.

In -rc8 I fixed a bug that this PDF file triggered by not having the document info dictionary which is technically optional but present in almost all PDFs files. That problem, however, does not explain the error message you encountered.

That error message comes from qpdf. What is the output of qpdf --version on your system? I have 5.1.3. It could be that there is a bug in qpdf that is responsible. Acrobat XI and qpdf 5.1.3 both say your file is a valid PDF.

I think if you upgrade to qpdf to >= 5.1.3 and ocrmypdf to -rc8 you should see the problem fixed.

tuxasus · 2015-08-24T19:42:06Z

I upgraded qpdf and ocrmypdf to the current versions and it works as long as I don't use ImageMagick and unpaper (ocrmypdf -i input.pdf output.pdf), but using one of them (ocrmypdf -d -c input.pdf output.pdf) generates the following error:

ocrmypdf -c Input.pdf Output.pdf
Input.pdf: not a valid PDF, and could not repair it.
Details:
open Output.pdf: No such file or directory

jbarlow83 · 2015-08-25T09:37:23Z

ImageMagick is no longer used.

What version of unpaper and ocrmypdf are you trying that with? Using your
file, ocrmypdf -d and ocrmypdf -c both work for me on the Docker and OS
X versions.

On Mon, 24 Aug 2015 at 12:42 tuxasus notifications@github.com wrote:

I upgraded qpdf and ocrmypdf to the current versions and it works as long
as I don't use ImageMagick and unpaper (ocrmypdf -i input.pdf output.pdf),
but using one of them (ocrmypdf -d -c input.pdf output.pdf) generates the
following error:

ocrmypdf -c Input.pdf Output.pdf
Input.pdf: not a valid PDF, and could not repair it.
Details:
open Output.pdf: No such file or directory

—
Reply to this email directly or view it on GitHub
#111 (comment).

tuxasus · 2015-08-25T17:25:06Z

I am using unpaper 0.4.2 and 3.0rc8 and ubuntu as OS

Edit: same problem using unpaper 6.1

jbarlow83 · 2015-08-25T20:42:49Z

unpaper 0.4.2 is the problem. It seems to produce invalid output files
sometimes. I do an install-time check for it, but not runtime.

Dockerfile shows how to build unpaper 6.1 from source.

On Tue, 25 Aug 2015 at 10:25 tuxasus notifications@github.com wrote:

I am using unpaper 0.4.2 and 3.0rc8 and ubuntu as OS

—
Reply to this email directly or view it on GitHub
#111 (comment).

tuxasus · 2015-08-26T11:26:53Z

I get the same Problem with unpaper 6.1

ocrmypdf -c Input.pdf Output.pdf
Input.pdf: not a valid PDF, and could not repair it.
Details:
open Output.pdf: No such file or Directory

jbarlow83 · 2015-08-27T04:57:33Z

Thanks for your patience with this. What is the output of ocrmypdf -v 1 -c Input.pdf Output.pdf ?

tuxasus · 2015-08-27T16:12:47Z

ocrmypdf --version
3.0rc8
unpaper --version
6.1

ocrmypdf -v 1 -c Input.pdf Output.pdf

Tasks which will be run:

Task enters queue = 'ocrmypdf.main.repair_pdf'

[{'yres': Decimal('300.158'), 'height_inches': Decimal('6.34'), 'pageno': 0, 'images': [{'comp': 1, 'bpc': 1, 'color': 'gray', 'enc': 'ccitt', 'height': 1903, 'width': 944, 'dpi_h': Decimal('300.158'), 'dpi_w': Decimal('299.683'), 'dpi': Decimal('299.920')}], 'xres': Decimal('299.683'), 'height_pixels': 1903, 'width_inches': Decimal('3.15'), 'width_pixels': 944, 'has_text': False}, {'yres': Decimal('300.158'), 'height_inches': Decimal('6.34'), 'pageno': 1, 'images': [{'comp': 1, 'bpc': 1, 'color': 'gray', 'enc': 'ccitt', 'height': 1903, 'width': 944, 'dpi_h': Decimal('300.158'), 'dpi_w': Decimal('299.683'), 'dpi': Decimal('299.920')}], 'xres': Decimal('299.683'), 'height_pixels': 1903, 'width_inches': Decimal('3.15'), 'width_pixels': 944, 'has_text': False}]

Completed Task = 'ocrmypdf.main.repair_pdf'
Task enters queue = 'ocrmypdf.main.split_pages'
Task enters queue = 'ocrmypdf.main.generate_postscript_stub'
os.symlink(/tmp/com.github.ocrmypdf.mmjygyhw/000002.page.pdf, /tmp/com.github.ocrmypdf.mmjygyhw/000002.ocr.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.mmjygyhw/000001.page.pdf, /tmp/com.github.ocrmypdf.mmjygyhw/000001.ocr.page.pdf)
Completed Task = 'ocrmypdf.main.split_pages'
Task enters queue = 'ocrmypdf.main.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.main.skip_page'
Uptodate Task = 'ocrmypdf.main.skip_page'

WARNING:
In Task 'ocrmypdf.main.skip_page':
No jobs were run because no file names matched.
Please make sure that the regular expression is correctly specified.

Completed Task = 'ocrmypdf.main.generate_postscript_stub'
GPL Ghostscript 9.10 (2013-08-30)
Copyright (C) 2013 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 1.
Page 1

GPL Ghostscript 9.10 (2013-08-30)

Completed Task = 'ocrmypdf.main.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.main.preprocess_deskew'
os.symlink(/tmp/com.github.ocrmypdf.mmjygyhw/000001.page.png, /tmp/com.github.ocrmypdf.mmjygyhw/000001.pp-deskew.png)
os.symlink(/tmp/com.github.ocrmypdf.mmjygyhw/000002.page.png, /tmp/com.github.ocrmypdf.mmjygyhw/000002.pp-deskew.png)
Completed Task = 'ocrmypdf.main.preprocess_deskew'
Task enters queue = 'ocrmypdf.main.preprocess_clean'

Original exceptions:

Exception #1
  'builtins.KeyError('P')' raised in ...
   Task = def ocrmypdf.main.preprocess_clean(...):
   Job  = [.../com.github.ocrmypdf.mmjygyhw/000001.pp-deskew.png -> .../com.github.ocrmypdf.mmjygyhw/000001.pp-clean.png, <ocrmypdf.main.WrappedLogger>, [{'yres': Decimal('300.158'), 'height_inches': Decimal('6.34'), 'pageno': 0, 'images': [{'comp': 1, 'bpc': 1, 'color': 'gray', 'enc': 'ccitt', 'height': 1903, 'width': 944, 'dpi_h': Decimal('300.158'), 'dpi_w': Decimal('299.683'), 'dpi': Decimal('299.920')}], 'xres': Decimal('299.683'), 'height_pixels': 1903, 'width_inches': Decimal('3.15'), 'width_pixels': 944, 'has_text': False}, {'yres': Decimal('300.158'), 'height_inches': Decimal('6.34'), 'pageno': 1, 'images': [{'comp': 1, 'bpc': 1, 'color': 'gray', 'enc': 'ccitt', 'height': 1903, 'width': 944, 'dpi_h': Decimal('300.158'), 'dpi_w': Decimal('299.683'), 'dpi': Decimal('299.920')}], 'xres': Decimal('299.683'), 'height_pixels': 1903, 'width_inches': Decimal('3.15'), 'width_pixels': 944, 'has_text': False}], <_thread.lock>]

Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
    register_cleanup, touch_files_only)
  File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 567, in job_wrapper_io_files
    ret_val = user_defined_work_func(*params)
  File "/home/florian/Downloads/OCRmyPDF-3.0-rc8/ocrmypdf/main.py", line 529, in preprocess_clean
    unpaper.clean(input_file, output_file, dpi, log)
  File "/home/florian/Downloads/OCRmyPDF-3.0-rc8/ocrmypdf/unpaper.py", line 83, in clean
    '--no-deskew',        # don't deskew
  File "/home/florian/Downloads/OCRmyPDF-3.0-rc8/ocrmypdf/unpaper.py", line 44, in run
    suffix = SUFFIXES[im.mode]
KeyError: 'P'


Exception #2
  'builtins.KeyError('P')' raised in ...
   Task = def ocrmypdf.main.preprocess_clean(...):
   Job  = [.../com.github.ocrmypdf.mmjygyhw/000002.pp-deskew.png -> .../com.github.ocrmypdf.mmjygyhw/000002.pp-clean.png, <ocrmypdf.main.WrappedLogger>, [{'yres': Decimal('300.158'), 'height_inches': Decimal('6.34'), 'pageno': 0, 'images': [{'comp': 1, 'bpc': 1, 'color': 'gray', 'enc': 'ccitt', 'height': 1903, 'width': 944, 'dpi_h': Decimal('300.158'), 'dpi_w': Decimal('299.683'), 'dpi': Decimal('299.920')}], 'xres': Decimal('299.683'), 'height_pixels': 1903, 'width_inches': Decimal('3.15'), 'width_pixels': 944, 'has_text': False}, {'yres': Decimal('300.158'), 'height_inches': Decimal('6.34'), 'pageno': 1, 'images': [{'comp': 1, 'bpc': 1, 'color': 'gray', 'enc': 'ccitt', 'height': 1903, 'width': 944, 'dpi_h': Decimal('300.158'), 'dpi_w': Decimal('299.683'), 'dpi': Decimal('299.920')}], 'xres': Decimal('299.683'), 'height_pixels': 1903, 'width_inches': Decimal('3.15'), 'width_pixels': 944, 'has_text': False}], <_thread.lock>]

Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
    register_cleanup, touch_files_only)
  File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 567, in job_wrapper_io_files
    ret_val = user_defined_work_func(*params)
  File "/home/florian/Downloads/OCRmyPDF-3.0-rc8/ocrmypdf/main.py", line 529, in preprocess_clean
    unpaper.clean(input_file, output_file, dpi, log)
  File "/home/florian/Downloads/OCRmyPDF-3.0-rc8/ocrmypdf/unpaper.py", line 83, in clean
    '--no-deskew',        # don't deskew
  File "/home/florian/Downloads/OCRmyPDF-3.0-rc8/ocrmypdf/unpaper.py", line 44, in run
    suffix = SUFFIXES[im.mode]
KeyError: 'P'

Input.pdf: not a valid PDF, and could not repair it.
Details:
open Output.pdf: No such file or directory

Output file: The generated PDF/A file is INVALID

jbarlow83 · 2015-09-07T22:19:09Z

Should be fixed now (-rc9 and above).

jbargil · 2015-10-04T08:57:24Z

I have the same problem on Mac Pro yosemite, but I cannot find the -rc9 fix. How would I download it?

jbarlow83 · 2015-10-15T21:52:06Z

@jbargil Sorry for slow reply.
Newer versions will be posted here: https://github.com/jbarlow83/OCRmyPDF

On Sun, 4 Oct 2015 at 01:57 jbargil notifications@github.com wrote:

I have the same problem on Mac Pro yosemite, but I cannot find the -rc9
fix. How would I download it?

—
Reply to this email directly or view it on GitHub
#111 (comment).

tuxasus · 2016-01-09T15:06:31Z

Thank you for your help using version 3.1 everything works pretty fine!
Great thanks!

jbarlow83 pushed a commit that referenced this issue Aug 14, 2015

Possible fix for issue #111

a4702bf

jbarlow83 added the bug label Aug 14, 2015

jbarlow83 self-assigned this Aug 14, 2015

jbarlow83 closed this as completed Aug 18, 2015

jbarlow83 reopened this Aug 24, 2015

jbarlow83 closed this as completed Sep 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCRMyPDF - AttributeError: 'ArrayObject' object has no attribute 'getData' #220 #111

OCRMyPDF - AttributeError: 'ArrayObject' object has no attribute 'getData' #220 #111

tuxasus commented Aug 14, 2015

jbarlow83 commented Aug 14, 2015

jbarlow83 commented Aug 18, 2015

tuxasus commented Aug 23, 2015

jbarlow83 commented Aug 24, 2015

tuxasus commented Aug 24, 2015

jbarlow83 commented Aug 25, 2015

tuxasus commented Aug 25, 2015

jbarlow83 commented Aug 25, 2015

tuxasus commented Aug 26, 2015

jbarlow83 commented Aug 27, 2015

tuxasus commented Aug 27, 2015

jbarlow83 commented Sep 7, 2015

jbargil commented Oct 4, 2015

jbarlow83 commented Oct 15, 2015

tuxasus commented Jan 9, 2016

OCRMyPDF - AttributeError: 'ArrayObject' object has no attribute 'getData' #220 #111

OCRMyPDF - AttributeError: 'ArrayObject' object has no attribute 'getData' #220 #111

Comments

tuxasus commented Aug 14, 2015

jbarlow83 commented Aug 14, 2015

jbarlow83 commented Aug 18, 2015

tuxasus commented Aug 23, 2015

jbarlow83 commented Aug 24, 2015

tuxasus commented Aug 24, 2015

jbarlow83 commented Aug 25, 2015

tuxasus commented Aug 25, 2015

jbarlow83 commented Aug 25, 2015

tuxasus commented Aug 26, 2015

jbarlow83 commented Aug 27, 2015

tuxasus commented Aug 27, 2015

jbarlow83 commented Sep 7, 2015

jbargil commented Oct 4, 2015

jbarlow83 commented Oct 15, 2015

tuxasus commented Jan 9, 2016