Add check_extractable argument to high_level.extract_text #350

Recursing · 2020-01-05T12:40:50Z

Is your feature request related to a problem? Please describe.
When using pdfminer.high_level.extract_text on some files, I get pdfminer.pdfdocument.PDFTextExtractionNotAllowed: Text extraction is not allowed

Describe the solution you'd like
Add a keyword argument check_extractable to pdfminer.high_level.extract_text, and pass it to PDFPage.get_pages

The text was updated successfully, but these errors were encountered:

igormp · 2020-01-06T03:31:43Z

Is it possible for you to give us a sample PDF so we can reproduce this issue?

Recursing · 2020-01-06T14:42:26Z

asw_oct06_p33-41.pdf

pietermarsman · 2020-01-06T19:35:15Z

Hi all,

I am against adding the check_extractable() parameter to the high-level functions extract_text() and extract_text_to_fp(). I think these function signatures are already bloated, especially extract_text_to_fp().

The high-level functions (should) cover the most common use-cases. Changing the check_extractable flag is not imho a common use-case. Instead, you should use the more adaptable composable api instead. Also see the (currently minimal) docs for this.

In your case, I recommend something like this:

from io import StringIO

from converter import TextConverter
from layout import LAParams
from pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfpage import PDFPage

layout_parameters = LAParams()

with open('asw_oct06_p33-41.pdf', "rb") as fp, StringIO() as output_string:
    resource_manager = PDFResourceManager()
    device = TextConverter(resource_manager, output_string, 
                           laparams=layout_parameters)
    interpreter = PDFPageInterpreter(resource_manager, device)

    for page in PDFPage.get_pages(fp, check_extractable=False):
        interpreter.process_page(page)
    
    print(output_string.getvalue())

You could also wrap this as a function.

I like to know if this suits your needs, and if I am missing something.

Recursing · 2020-01-06T20:33:27Z

Of course rewriting the high level extract_text myself with check_extractable=False would suit my needs, but I think that if adding an argument is a problem, a more reasonable default for the high level extract_text would be to work even on pdfs like the attached one.

Also, I think we might want to streamline the parameters of extract_text and extract_text_to_fp (and remove its unused kwargs parameter)

pietermarsman · 2020-01-06T21:17:08Z

Of course rewriting the high level extract_text myself with check_extractable=False would suit my needs

Great! I'm closing this issue and the corresponding PR, let me know if you disagree.

but I think that if adding an argument is a problem, a more reasonable default for the high level extract_text would be to work even on pdfs like the attached one.

A PDF is not extractable if the document itself signals that it is not allowed to extract the content if opened with user-level permissions. In my opinion, we should respect the choice of the application that created the PDF, and consequently check_extractable=True is a great default.

Also, I think we might want to streamline the parameters of extract_text and extract_text_to_fp (and remove its unused kwargs parameter)

I agree! The function signature of extract_text_to_fp is been around for long and changing it would be a breaking change. The extract_text() function is only recently created, and its arguments are much more concise and understandable.

If this bothers you, you can create a new issue for this. We could deprecate the use of extract_text_to_fp and create some new methods.

Recursing · 2020-01-06T21:58:42Z

I kind of disagree on closing this, as the main purpose of the high level API is to extract text from pdfs and in my tests pdfs that raise PDFTextExtractionNotAllowed are not that rare (happened to both me and another user).
But it's not that big of a deal as it's easy to rewrite the high level api

I would argue that laparams, codec and caching and maxpages are all less useful parameters (especially the last one as a user can just specify a range of pages)

pietermarsman · 2020-01-07T08:19:00Z

Then I will reopen this. Lets see what others think.

igormp · 2020-01-07T09:21:37Z

While I think that the high level functions should be a no brainer to use in most situations, where you just pass the file and it extracts the text without problems, I do think that we should adopt some sane defaults that would at least throw a warning about something as a PDF not meant to be extractable.

Passing another parameter to a method that is meant to be simple doesn't sound right to me, so maybe we should just change check_extractable's default to False and issue a warning of some kind, since people will want to extract its contents anyway.

Recursing · 2020-01-07T11:12:09Z

I would agree with the warning

pietermarsman · 2020-01-07T17:21:09Z

I like it too.

In summary:

Set the default value for check_extractable to False.
If check_extractable is True we throw an Error, if False we raise a warning.
Remove the explicit arguments for check_extractable from the high_level module.

As an alternative; we could also drop the check_extractable argument completely and always raise a warning and never an error.

igormp · 2020-01-07T23:20:26Z

As an alternative; we could also drop the check_extractable argument completely and always raise a warning and never an error.

I really like that.

These PDFs are extractable, but pdfminer won't extract them pdfminer/pdfminer.six#350 This forks the high level `extract_text` function to fix this. I could have combined `_get_pdf_page_count` with but then I wouldn't be able to delete this code in the future if pdfminer implements a fix. Part of #38

filipopo · 2020-03-25T16:58:38Z

Encountered this too
Or just make a ignore flag, there's many tools that ignore this, hell even opening the .pdf in firefox allows me to copy the text of protected .pdf's

pietermarsman · 2020-03-25T17:08:36Z

@filip98, if you feel like creating a PR, I'll make sure that it gets reviewed and merged.

HeroadZ · 2020-06-16T06:50:31Z

Can't believe this problem still exists.
In my opinion, some users have suffered for this problem. There are also some discussion on stackoverflow.
I mean, "add an optional argument" is simple and doesn't have any bad effect on old codes.
I agree with the optional argument, check_extractable.

pietermarsman · 2020-06-23T13:28:29Z

Hi @HeroadZ, feel free to work on this. I can review, merge and publish the code.

Recursing · 2020-06-23T13:33:24Z

You can still merge #351 , maybe see if you want to change the default to False

madhurcodes · 2020-07-03T01:14:16Z

Encountered this error today causing me to redo a batch job :<
Submitted this PR #453 to change the error to a warning with a descriptive message to the user.

pietermarsman · 2020-07-05T11:58:13Z

Hi @madhurcodes, thanks for your work on this. I'll review the PR.

Recursing mentioned this issue Jan 5, 2020

add check_extractable argument to high level functions #351

Closed

5 tasks

pietermarsman closed this as completed Jan 6, 2020

pietermarsman mentioned this issue Jan 7, 2020

Pack the permissions (the /P entry) as unsigned long, fix #186 #352

Merged

5 tasks

pietermarsman reopened this Jan 7, 2020

pietermarsman added the type: new feature label Jan 7, 2020

crccheck mentioned this issue Jan 21, 2020

fix: Handle PDFTextExtractionNotAllowed pdfs crccheck/atx-bandc#42

Merged

madhurcodes mentioned this issue Jul 3, 2020

Change Text extraction is not allowed error to warning #453

Merged

6 tasks

pietermarsman closed this as completed in #453 Jul 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add check_extractable argument to high_level.extract_text #350

Add check_extractable argument to high_level.extract_text #350

Recursing commented Jan 5, 2020

igormp commented Jan 6, 2020

Recursing commented Jan 6, 2020

pietermarsman commented Jan 6, 2020 •

edited

Loading

Recursing commented Jan 6, 2020

pietermarsman commented Jan 6, 2020

Recursing commented Jan 6, 2020 •

edited

Loading

pietermarsman commented Jan 7, 2020

igormp commented Jan 7, 2020

Recursing commented Jan 7, 2020

pietermarsman commented Jan 7, 2020 •

edited

Loading

igormp commented Jan 7, 2020

filipopo commented Mar 25, 2020

pietermarsman commented Mar 25, 2020

HeroadZ commented Jun 16, 2020

pietermarsman commented Jun 23, 2020

Recursing commented Jun 23, 2020 •

edited

Loading

madhurcodes commented Jul 3, 2020

pietermarsman commented Jul 5, 2020

Add check_extractable argument to high_level.extract_text #350

Add check_extractable argument to high_level.extract_text #350

Comments

Recursing commented Jan 5, 2020

igormp commented Jan 6, 2020

Recursing commented Jan 6, 2020

pietermarsman commented Jan 6, 2020 • edited Loading

Recursing commented Jan 6, 2020

pietermarsman commented Jan 6, 2020

Recursing commented Jan 6, 2020 • edited Loading

pietermarsman commented Jan 7, 2020

igormp commented Jan 7, 2020

Recursing commented Jan 7, 2020

pietermarsman commented Jan 7, 2020 • edited Loading

igormp commented Jan 7, 2020

filipopo commented Mar 25, 2020

pietermarsman commented Mar 25, 2020

HeroadZ commented Jun 16, 2020

pietermarsman commented Jun 23, 2020

Recursing commented Jun 23, 2020 • edited Loading

madhurcodes commented Jul 3, 2020

pietermarsman commented Jul 5, 2020

pietermarsman commented Jan 6, 2020 •

edited

Loading

Recursing commented Jan 6, 2020 •

edited

Loading

pietermarsman commented Jan 7, 2020 •

edited

Loading

Recursing commented Jun 23, 2020 •

edited

Loading