-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Undefined array key 1,crash on parsing #673
Comments
Looks like this is happening because the test in the ...
} elseif (strpos($pdfData, 'xref', $offset) == $offset) {
// Already pointing at the xref table
$startxref = $offset;
} elseif (preg_match('/([0-9]+[\s][0-9]+[\s]obj)/i', $pdfData, $matches, \PREG_OFFSET_CAPTURE, $offset)) {
// $matches gets set here ^ by this test, even if it fails
// Cross-Reference Stream object
$startxref = $offset;
} elseif ($startxrefPreg) {
// startxref found
$startxref = $matches[1][0];
// This is the wrong $matches ^ now
} else {
throw new \Exception('Unable to find startxref');
}
... Also, the example file from the OP is a gigantic 712 page PDF that so far has not finished parsing since I started writing this post! :D I would not recommend using PdfParser to extract text from this file. You should probably use an online tool that separates all the pages into individual PDF files and running PdfParser on those. Edit: After churning for 10 minutes on this file, PHP ran out of memory, lol. |
I`m using it to count the pages for a middleware in laravel , for extracting text I use python , it can digest weird formats.I have 2000 + pages pdfs... |
On further investigation, this is actually happening because either the example PDF isn't giving the correct, to-the-byte offset for the start of an One of the tests in } elseif (strpos($pdfData, 'xref', $offset) == $offset) { This checks to see if, for the given When I add the following code to "bump the caret" past any whitespace at the current offset, the while (preg_match('/\s/', substr($pdfData, $offset, 1))) {
$offset++;
}
if (0 == $offset) {
... I'm not sure this is the best solution yet, and Edit: All unit tests pass with addition of this code. I'm studying the PDF Reference to see if there are considerations for offset values to be lenient with whitespace like this. This might be because in the PDF header, when loaded as ISO-8859-1, we see the following:
But when loaded as UTF-8, the four special characters are merged into two unknown characters, perhaps lopping off two bytes from every offset value:
It's a plausible cause, but I'm not sure it's the actual one. Other PDFs have headers just like these and their offset values aren't off by two bytes. |
So, this file contains a string(200) "
xref
0 3390
0000000000 65535 f
0000667726 00000 n
0000667861 00000 n
0000668830 00000 n
0000668970 00000 n
0000669939 00000 n
0000670272 00000 n
0000670294 00000 n
0000670434 00000 n
00006" You can see that the string begins with a newline character (in fact a carriage-return plus newline The PDF Reference is not exactly clear on this, but in theory, an incorrect XRef offset value should cause an error and the PDF should fail to display. However, in practice, Adobe Acrobat is loading the OP's sample file and displaying it without error. Obviously Acrobat accounts for this and deals with it internally. Therefore I believe that my "bump the caret" code above is probably an acceptable solution to this. What do you think, @k00ni ? |
👍 Sounds reasonable. Can you provide a PR? @micos7 Can we use your PDF for our test environment (it must be free of charge and without any obligations)? If so, please reupload. |
Description:
PDF input
Cracking-the-Coding-Interview-6th-Edition-189-Programming-Questions-and-Solutions.pdf
Expected output & actual output
It crashes in the RawDataPraser, line 890
elseif ($startxrefPreg) { // startxref found $startxref = $matches[1][0]; }
$matches is empty array.
Code
Just the usual stuff.
The text was updated successfully, but these errors were encountered: