PDFMiner seems to be unable to extract images from scanned PDFs #773

pettzilla1 · 2022-06-23T12:36:38Z

Hi,

I'm trying to read the images off a scanned pdf to conduct OCR on them , I can't include this PDF for data privacy reasons.

General code run -

def _parse_lt_objs(lt_objs):
    for lt_obj in lt_objs:
        if isinstance(lt_obj, LTImage):
           iw = ImageWriter('output_dir')
           iw.export_image(lt_obj)
        elif isinstance(lt_obj, LTFigure):
            image_text+= _parse_lt_objs(lt_obj, page_number)

pdfrsrcmgr = PDFResourceManager()
pdflaparams = LAParams()
pdfdevice = PDFPageAggregator(pdfrsrcmgr, laparams=pdflaparams)
pdfinterpreter = PDFPageInterpreter(pdfrsrcmgr, pdfdevice)
img_content = []
fp = open(file, 'rb')
print(PDFPage.create_pages(fp))
i=0
for page in PDFPage.get_pages(fp):
    pdfinterpreter.process_page(page)
    # receive the LTPage object for this page
    pdflayout = pdfdevice.get_result()
    img_content.append(_parse_lt_objs(pdflayout))
    i+=1
fp.close()

Error given
ValueError: Unsupported `bitspercomponent': 1

Hi,
It is possible to have one bit per component PNGs especially with scanned pdfs

in the section https://github.com/pdfminer/pdfminer.six/blob/master/pdfminer/utils.py

please look at the page here https://stackoverflow.com/questions/757265/how-does-pdfs-bitspercomponent-translate-to-bits-per-pixel-for-images

some work needs to be performed to insure correct images can be extracted

pietermarsman · 2022-06-25T20:53:25Z

The [PNG specification] only mentions bytes, not bits. The apply_png_predictor() needs a lot of rework when it needs to be applied on bits instead of bytes. But it might be easier if your PNG does not use a filter, which does make sense if it only uses bits. Can you check the filter type?

pettzilla1 · 2022-06-28T08:20:02Z

Hi Pietermarsman,
I removed the element that throws the error and it successful saved the image to .btm while this was not readable when I used pillow to convert it to JPG it could now be displayed , it seems that the code around line 136 of "utils.py " should be amended as a bits per component of 1 is supported.

pettzilla1 · 2022-07-05T16:04:43Z

this fix worked for me @pietermarsman , line 135 in util.py should be " if bitspercomponent not in [1,8]:"

pietermarsman added type:anomaly Errors caused by deviations from the PDF Reference status: needs more info labels Jun 25, 2022

pettzilla1 mentioned this issue Jul 18, 2022

Update utils (fixes #773) #784

Merged

pettzilla1 mentioned this issue Jul 28, 2022

Image extraction does not handle the case when colorspace is a PdfObjRef and bmp handling is probably broken #754

Closed

pietermarsman closed this as completed in f79ad56 Aug 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDFMiner seems to be unable to extract images from scanned PDFs #773

PDFMiner seems to be unable to extract images from scanned PDFs #773

pettzilla1 commented Jun 23, 2022 •

edited by pietermarsman

Loading

pietermarsman commented Jun 25, 2022

pettzilla1 commented Jun 28, 2022

pettzilla1 commented Jul 5, 2022 •

edited

Loading

PDFMiner seems to be unable to extract images from scanned PDFs #773

PDFMiner seems to be unable to extract images from scanned PDFs #773

Comments

pettzilla1 commented Jun 23, 2022 • edited by pietermarsman Loading

pietermarsman commented Jun 25, 2022

pettzilla1 commented Jun 28, 2022

pettzilla1 commented Jul 5, 2022 • edited Loading

pettzilla1 commented Jun 23, 2022 •

edited by pietermarsman

Loading

pettzilla1 commented Jul 5, 2022 •

edited

Loading