-
Notifications
You must be signed in to change notification settings - Fork 930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add check_extractable argument to high_level.extract_text #350
Comments
Is it possible for you to give us a sample PDF so we can reproduce this issue? |
Hi all, I am against adding the The high-level functions (should) cover the most common use-cases. Changing the In your case, I recommend something like this: from io import StringIO
from converter import TextConverter
from layout import LAParams
from pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfpage import PDFPage
layout_parameters = LAParams()
with open('asw_oct06_p33-41.pdf', "rb") as fp, StringIO() as output_string:
resource_manager = PDFResourceManager()
device = TextConverter(resource_manager, output_string,
laparams=layout_parameters)
interpreter = PDFPageInterpreter(resource_manager, device)
for page in PDFPage.get_pages(fp, check_extractable=False):
interpreter.process_page(page)
print(output_string.getvalue()) You could also wrap this as a function. I like to know if this suits your needs, and if I am missing something. |
Of course rewriting the high level Also, I think we might want to streamline the parameters of extract_text and extract_text_to_fp (and remove its unused kwargs parameter) |
Great! I'm closing this issue and the corresponding PR, let me know if you disagree.
A PDF is not extractable if the document itself signals that it is not allowed to extract the content if opened with user-level permissions. In my opinion, we should respect the choice of the application that created the PDF, and consequently
I agree! The function signature of If this bothers you, you can create a new issue for this. We could deprecate the use of |
I kind of disagree on closing this, as the main purpose of the high level API is to extract text from pdfs and in my tests pdfs that raise PDFTextExtractionNotAllowed are not that rare (happened to both me and another user). I would argue that |
Then I will reopen this. Lets see what others think. |
While I think that the high level functions should be a no brainer to use in most situations, where you just pass the file and it extracts the text without problems, I do think that we should adopt some sane defaults that would at least throw a warning about something as a PDF not meant to be extractable. Passing another parameter to a method that is meant to be simple doesn't sound right to me, so maybe we should just change |
I would agree with the warning |
I like it too. In summary:
As an alternative; we could also drop the |
I really like that. |
These PDFs are extractable, but pdfminer won't extract them pdfminer/pdfminer.six#350 This forks the high level `extract_text` function to fix this. I could have combined `_get_pdf_page_count` with but then I wouldn't be able to delete this code in the future if pdfminer implements a fix. Part of #38
Encountered this too |
@filip98, if you feel like creating a PR, I'll make sure that it gets reviewed and merged. |
Can't believe this problem still exists. |
Hi @HeroadZ, feel free to work on this. I can review, merge and publish the code. |
You can still merge #351 , maybe see if you want to change the default to |
Encountered this error today causing me to redo a batch job :< |
Hi @madhurcodes, thanks for your work on this. I'll review the PR. |
Is your feature request related to a problem? Please describe.
When using
pdfminer.high_level.extract_text
on some files, I getpdfminer.pdfdocument.PDFTextExtractionNotAllowed: Text extraction is not allowed
Describe the solution you'd like
Add a keyword argument
check_extractable
topdfminer.high_level.extract_text
, and pass it toPDFPage.get_pages
The text was updated successfully, but these errors were encountered: