Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDFMiner seems to be unable to extract images from scanned PDFs #773

Closed
pettzilla1 opened this issue Jun 23, 2022 · 3 comments
Closed

PDFMiner seems to be unable to extract images from scanned PDFs #773

pettzilla1 opened this issue Jun 23, 2022 · 3 comments
Labels
status: needs more info type:anomaly Errors caused by deviations from the PDF Reference

Comments

@pettzilla1
Copy link
Contributor

pettzilla1 commented Jun 23, 2022

Hi,

I'm trying to read the images off a scanned pdf to conduct OCR on them , I can't include this PDF for data privacy reasons.

General code run -

def _parse_lt_objs(lt_objs):
    for lt_obj in lt_objs:
        if isinstance(lt_obj, LTImage):
           iw = ImageWriter('output_dir')
           iw.export_image(lt_obj)
        elif isinstance(lt_obj, LTFigure):
            image_text+= _parse_lt_objs(lt_obj, page_number)

pdfrsrcmgr = PDFResourceManager()
pdflaparams = LAParams()
pdfdevice = PDFPageAggregator(pdfrsrcmgr, laparams=pdflaparams)
pdfinterpreter = PDFPageInterpreter(pdfrsrcmgr, pdfdevice)
img_content = []
fp = open(file, 'rb')
print(PDFPage.create_pages(fp))
i=0
for page in PDFPage.get_pages(fp):
    pdfinterpreter.process_page(page)
    # receive the LTPage object for this page
    pdflayout = pdfdevice.get_result()
    img_content.append(_parse_lt_objs(pdflayout))
    i+=1
fp.close()

Error given
ValueError: Unsupported `bitspercomponent': 1

Hi,
It is possible to have one bit per component PNGs especially with scanned pdfs

in the section https://github.com/pdfminer/pdfminer.six/blob/master/pdfminer/utils.py

please look at the page here https://stackoverflow.com/questions/757265/how-does-pdfs-bitspercomponent-translate-to-bits-per-pixel-for-images

some work needs to be performed to insure correct images can be extracted

@pietermarsman
Copy link
Member

The [PNG specification] only mentions bytes, not bits. The apply_png_predictor() needs a lot of rework when it needs to be applied on bits instead of bytes. But it might be easier if your PNG does not use a filter, which does make sense if it only uses bits. Can you check the filter type?

@pietermarsman pietermarsman added type:anomaly Errors caused by deviations from the PDF Reference status: needs more info labels Jun 25, 2022
@pettzilla1
Copy link
Contributor Author

Hi Pietermarsman,
I removed the element that throws the error and it successful saved the image to .btm while this was not readable when I used pillow to convert it to JPG it could now be displayed , it seems that the code around line 136 of "utils.py " should be amended as a bits per component of 1 is supported.

@pettzilla1
Copy link
Contributor Author

pettzilla1 commented Jul 5, 2022

this fix worked for me @pietermarsman , line 135 in util.py should be " if bitspercomponent not in [1,8]:"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: needs more info type:anomaly Errors caused by deviations from the PDF Reference
Projects
None yet
Development

No branches or pull requests

2 participants