Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault (core dumped) #64

Open
avirala-eightfold opened this issue Jul 26, 2022 · 9 comments
Open

Segmentation fault (core dumped) #64

avirala-eightfold opened this issue Jul 26, 2022 · 9 comments

Comments

@avirala-eightfold
Copy link

avirala-eightfold commented Jul 26, 2022

Hi, Thank you for this amazing work. Recently I was working with some pdf and poppler was working great for most of them but for some of those pdf I am seeing the following error:-

Segmentation fault (core dumped)

Considering this is a memory issue I also can't put it in a try & catch to prevent my code from rebooting the workers again and again just to be stuck over there. This has been a major problem for me.
To give you some context and debugging that I have gone ahead with:-

  1. The segmentation error happens when I call:- page.text_list(page.TextListOption.text_list_include_font)
  2. If I remove the optional enum, the error does not come anymore, also if I use pdf_document.create_font_iterator(), this also works but while getting this on the text_box level I face this error
  3. As soon as it hits:- boxes = self._page.text_list(opt_flag) in page.py the code is stopped with the error
  4. I initially thought that this might be an upstream error in the CPP code itself, but other libraries which are based on poppler itself seem to work fine on this pdf, hence my thought that something must be happening in the python bindings

The metadata for the pdf that I see such errors with is mostly (not always):-

{'Producer': 'macOS Version 11.2.3 (Build 20D91) Quartz PDFContext', 'Creator': 'Pages'}

The code to repro the error:-

from poppler import load_from_file
file_path = "sample_pdf.pdf"
pdf_document = load_from_file(file_path)
no_of_pages = pdf_document.pages
for page_ind in range(no_of_pages):
    page = pdf_document.create_page(page_ind)
    text_list = page.text_list(page.TextListOption.text_list_include_font)

The link to the pdf:- https://drive.google.com/file/d/180CDGyiJRfytvuzVsAiYKppHvaBABGkJ/view?usp=sharing
Please request access to the pdf as I can't share it publically. (Really sorry for this, but I hope you understand)

@avirala-eightfold
Copy link
Author

avirala-eightfold commented Sep 21, 2022

Hi, @cbrunet @bzamecnik I have encountered many such files. Is it possible to not return font information in these cases and just the text and prevent a core dump?

@avirala-eightfold
Copy link
Author

@cbrunet @bzamecnik sorry to tag you guys again but this issue has increased and can be now seen on a lot many documents. It will be really helpful if you can take a look at it.

@avirala-eightfold
Copy link
Author

@bzamecnik by any chance did you take a look at this once? Sorry for tagging again.

@bzamecnik
Copy link
Contributor

@avirala-eightfold Hi, I can possibly check that. I made the request for sharing the file. Have you managed to confirm it yet?

@avirala-eightfold
Copy link
Author

avirala-eightfold commented Oct 13, 2022

@bzamecnik Yes yes I somehow missed it sorry for the delay. I have shared it again can you please confirm if you can access it?

@avirala-eightfold
Copy link
Author

@avirala-eightfold Hi, I can possibly check that. I made the request for sharing the file. Have you managed to confirm it yet?
@bzamecnik did you get a chance to look at it?

@avirala-eightfold
Copy link
Author

Sorry for bugging you again but @bzamecnik did you get a chance to look into it?

@bzamecnik
Copy link
Contributor

bzamecnik commented Nov 14, 2022

@avirala-eightfold Sorry, no I didn't have chance to look at it. Is there anything that prevents you to investigate it?

UPDATE: I can confirm that it crashes on a Segmentation fault. That's all I can see without rebuilding the code. 🤷

Running with gdb gives some hint:

$ gdb
(gdb) file python
Reading symbols from python...
(gdb) run script.py
Starting program: /usr/local/bin/python script.py
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x0000ffff990e54a8 in TextFontInfo::matches(Ref const*) const () from /usr/lib/aarch64-linux-gnu/libpoppler.so.102
(gdb) 

Enabling the faulthandler module gives a similar hint:

import faulthandler

faulthandler.enable()

# ... rest of the code...
Fatal Python error: Segmentation fault

Current thread 0x0000ffffa61f8d30 (most recent call first):
  File "/usr/local/lib/python3.11/site-packages/poppler/page.py", line 128 in text_list
  File "/usr/local/lib/python3.11/site-packages/poppler/utilities.py", line 90 in wrapped
  File "/app/script.py", line 11 in <module>
Segmentation fault

Some fiddling with the code:

  • passing the option seems OK (enum, 1, any odd number)
  • removing conversion of poppler.cpp.page.text_boxto TextBox still leads to a crash, ie. the problem likely not in that conversion
  • using pdfinfo, pdffonts, pdftotext does not crash
  • using a different PDF works
  • qpdf --check sample_pdf.pdf does not see any errors

Looking at the gdb output, the crash may come from this place: https://github.com/freedesktop/poppler/blob/master/cpp/poppler-page.cpp#L461

if (cur_text_font_info->matches(&(tb_font_info->font_info_cache[k].d->ref))) {

...which would mean some reference to the font info is wrong (either cur_text_font_info or the one in the cache.

@avirala-eightfold
Copy link
Author

avirala-eightfold commented Nov 16, 2022

Thank you so much for looking into it, let me try to take this as the base and move forward to find anything else

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants