Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undefined array key 1,crash on parsing #673

Closed
micos7 opened this issue Feb 6, 2024 · 5 comments · Fixed by #692
Closed

Undefined array key 1,crash on parsing #673

micos7 opened this issue Feb 6, 2024 · 5 comments · Fixed by #692
Labels

Comments

@micos7
Copy link

micos7 commented Feb 6, 2024

  • PHP Versiov 8.3:
  • PDFParser Version 2.8:

Description:

PDF input

Cracking-the-Coding-Interview-6th-Edition-189-Programming-Questions-and-Solutions.pdf

Expected output & actual output

It crashes in the RawDataPraser, line 890

elseif ($startxrefPreg) { // startxref found $startxref = $matches[1][0]; }
$matches is empty array.

Code

Just the usual stuff.

@GreyWyvern
Copy link
Contributor

GreyWyvern commented Feb 6, 2024

Looks like this is happening because the test in the elseif chain before this is a preg_match that also sets a $matches variable, overwriting the one from the first preg_match.

...
        } elseif (strpos($pdfData, 'xref', $offset) == $offset) {
            // Already pointing at the xref table
            $startxref = $offset;
        } elseif (preg_match('/([0-9]+[\s][0-9]+[\s]obj)/i', $pdfData, $matches, \PREG_OFFSET_CAPTURE, $offset)) {
                                                // $matches gets set here ^ by this test, even if it fails

            // Cross-Reference Stream object
            $startxref = $offset;
        } elseif ($startxrefPreg) {
            // startxref found
            $startxref = $matches[1][0];
// This is the wrong $matches ^ now

        } else {
            throw new \Exception('Unable to find startxref');
        }
...

Also, the example file from the OP is a gigantic 712 page PDF that so far has not finished parsing since I started writing this post! :D I would not recommend using PdfParser to extract text from this file. You should probably use an online tool that separates all the pages into individual PDF files and running PdfParser on those.

Edit: After churning for 10 minutes on this file, PHP ran out of memory, lol.

@micos7
Copy link
Author

micos7 commented Feb 6, 2024

I`m using it to count the pages for a middleware in laravel , for extracting text I use python , it can digest weird formats.I have 2000 + pages pdfs...

@k00ni k00ni added the bug label Feb 7, 2024
@GreyWyvern
Copy link
Contributor

GreyWyvern commented Feb 7, 2024

On further investigation, this is actually happening because either the example PDF isn't giving the correct, to-the-byte offset for the start of an xref object, or PdfParser isn't being lenient enough when checking the current offset against the content of the file.

One of the tests in RawDataParser->getXrefData() is as follows:

        } elseif (strpos($pdfData, 'xref', $offset) == $offset) {

This checks to see if, for the given $offset in the PDF, there is an xref statement, and if so we should start parsing content here. However, this check is to-the-byte strict. In the case of this PDF, the $offset value given actually points to a whitespace character (carriage return followed by a newline) two bytes before the xref. So when PdfParser fails to find the xref at the exact $offset value, it actually falls into a loop trying (and failing) to find it over and over and over, which is where PHP was running out of memory.

When I add the following code to "bump the caret" past any whitespace at the current offset, the xref command is found and this huge PDF is actually parsed and displayed in a remarkably short time:

        while (preg_match('/\s/', substr($pdfData, $offset, 1))) {
            $offset++;
        }

        if (0 == $offset) {
            ...

I'm not sure this is the best solution yet, and I haven't run it through the unit tests either. However, adding this code with no other changes allows parsing of the OP's example file.

Edit: All unit tests pass with addition of this code. I'm studying the PDF Reference to see if there are considerations for offset values to be lenient with whitespace like this.

This might be because in the PDF header, when loaded as ISO-8859-1, we see the following:

%PDF-1.3
%âãÏÓ
3390 0 obj
...

But when loaded as UTF-8, the four special characters are merged into two unknown characters, perhaps lopping off two bytes from every offset value:

%PDF-1.3
%??
3390 0 obj
...

It's a plausible cause, but I'm not sure it's the actual one. Other PDFs have headers just like these and their offset values aren't off by two bytes.

@GreyWyvern
Copy link
Contributor

So, this file contains a Prev 7123863 command which references the character position of the previous XRef block. Loading the file as a string and doing a var_dump(substr($pdfdata, 7123863, 200)); results in:

string(200) "
xref
0 3390 
0000000000 65535 f
0000667726 00000 n
0000667861 00000 n
0000668830 00000 n
0000668970 00000 n
0000669939 00000 n
0000670272 00000 n
0000670294 00000 n
0000670434 00000 n
00006"

You can see that the string begins with a newline character (in fact a carriage-return plus newline \r\n) and the xref starts on the next line. PdfParser expects the xref text to be at exactly character position 7123863, instead of 7123865. When it does not find the xref text, it stops looking for xref and instead scans the document from this offset for the next startxref command. The one it finds is one it's seen before though, the one that contains the Prev 7123863 command, so PdfParser falls into an endless loop at this point.

The PDF Reference is not exactly clear on this, but in theory, an incorrect XRef offset value should cause an error and the PDF should fail to display. However, in practice, Adobe Acrobat is loading the OP's sample file and displaying it without error. Obviously Acrobat accounts for this and deals with it internally.

Therefore I believe that my "bump the caret" code above is probably an acceptable solution to this. What do you think, @k00ni ?

@k00ni
Copy link
Collaborator

k00ni commented Mar 13, 2024

Therefore I believe that my "bump the caret" code above is probably an acceptable solution to this. What do you think, @k00ni?

👍 Sounds reasonable. Can you provide a PR?

@micos7 Can we use your PDF for our test environment (it must be free of charge and without any obligations)? If so, please reupload.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants