-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
text encoding breaks in the middle of the line #586
Comments
This is another example (related to #607 ) of where re-saving the PDF using Adobe Acrobat fixes the encoding issue. The listed "producer" is "3-Heights(TM) PDF Security Shell 4.8.25.2 (http://www.pdf-tools.com)" Obviously Adobe Acrobat can interpret the mess, but doesn't save it that way. |
@GreyWyvern the problem is, i do not want to detect issues, resave any pdfs and hope, that issues would disappear somehow. I'd prefer to get pdf's url, extract text from pdf, and forever forget the url, pdf -- everything but text. |
Probably i got You wrong, and you wanted to say, that my example was also generated by well-known to You faulty producer? |
Oh, I know PdfParser should be able to decode it somehow, but it's something to do with the way the application is saving it, rather than PdfParser decoding it incorrectly. My speculation here is that some error-correction code needs to be added, not that PdfParser is doing something wrong. |
No. Enjoy this comment for decodeContentByToUnicodeCMapOrDescendantFonts :( pdfparser/src/Smalot/PdfParser/Font.php Line 486 in 66ddf47
So, a person required, who would understand current algorithm, pdf spec, and implement spec-compatible algorithm, that woudn't destroy the rest of the lib. |
More research on this. The issue affects the last complete "line" in the PDF data (this is different from the last line in the visual document). For some reason this line does not get assigned a font nor a font-table so it does not get parsed and just passes through undecoded. If I force the So it seems that the issue is... why is this one line not getting assigned the same font as all the other lines? Adobe can obviously cope with it; is it using some kind of fall-back if it can't find a linked font? I think that might be the solution for PdfParser. |
More research. I now believe PdfParser is actually working as intended, and Adobe is doing some convoluted error correction. I think Adobe is somehow detecting that the current line of text cannot be rendered in the specified font, and actually disobeys the PDFs instructions by overriding the requested font. Here is the last section of the content stream:
As explanation, every line that ends with " Tf" means a font change for the content following, and there are two fonts defined in the document: TT1 and C2_1. The font with ID C2_1 is the one that has the correct code page to display Cyrillic characters. TT1 gets used only for spaces and punctuation like commas and periods. Lines that end with " Tj" or " TJ" are lines of content (you might have to scroll the code window to the right to see these line endings since they're long). You'll notice that for the first three " TJ" lines, the font has been correctly set to C2_1. But then, before the last line of " TJ" content (the line that begins In this case, PdfParser is simply obeying the instructions of the PDF and displaying the last line of content using the incorrectly specified font code page. So unless you want to add the same kind of error-detection and recovery code that Adobe probably has, I think this Issue can be closed with "working as intended." PdfParser is just doing what it's told. Edit: Stop the presses... I just might have a fix for this. |
Thank you for looking into this @GreyWyvern. It would be great if you could add a link to the part of the specification you referring to. |
Does it make sense to add an option to use the "recovery code" if someone wants it? (ref: CustomConfig) |
Yes, you can use it. You can also try page92.pdf from |
Thanks!
Thanks for that too. Unfortunately it looks like #585 is a different issue. Probably a missing single mapping in a CIDMap (character code dictionary) rather than a whole missing CIDMap which #614 addresses. I'll have to take a look at that later. :) |
I believe my PR #614 falls under the Handling Undefined Characters in section 5.6 of the PDF Reference 1.7. (pages 454-455) Specifically the paragraph beginning "If the code is invalid..." and the following two numbered points. The only difference being I'm not doing it in a bytewise fashion, I'm decoding the whole string using a different codespace and then checking for invalid characters again. |
Maybe? I think the major caveat about this fix is that a "correctly" decoded text string that may contain actual UTF-8 control characters would make the script cycle through all the fonts in the document trying to get rid of them. However, such a string would only end up incorrectly decoded if all of its characters successfully mapped to another font in the document which doesn't seem very likely. Perhaps we could add it as a config option defaulting to On? Then if it does end up causing problems, it can easily be disabled. Famous last words and all, but I personally don't see this change causing problems. :) |
Add a unit test for correctly decoding an emdash in Cyrillic text. Use sample PDF from issue smalot#585 User @se-ti allowed use of this file in issue smalot#586 (comment) In `cleanContent()` once all strings and dictionaries are hidden, do a MIME-type check on the remaining content. If it doesn't register as text/plain, then return an empty string. This prevents non-document-stream data from being passed to `cleanContent()` such as JPEG data in file '12249.pdf' from smalot#458 Remove whitespace-adding code from **Font.php**. I originally added this code as a "shim" because `decodeText()` did not take into account the current Text Matrix when considering what counted as "words". Now that it does, the previous code of just `implode()`-ing with a space character works.
* Rewrite PDFObject.php + ancillary This is a major update to the PHPObject.php file. Where previously PdfParser would attempt to gather document stream data using a series of multiline regular expressions, this update changes the behaviour of `cleanContent()` to the following: - Hide all (strings) - Remove all newlines and carriage-returns - Hide all <<dictionaries>> - Normalize all whitespace - Using a list of valid Operators from the PDF Reference, add \r\n back to the remaining text so that there is a single PDF command on every line - Restore the <<dictionaries>> and (strings) By using this system, it is then much easier to examine and parse the document stream in a line-by-line manner, instead of PREG extraction. `getSections()` text has been updated to do just this, stepping through the output of `cleanContent()` line-by-line and returning an array of only the relevant commands needed to position and display text. The guts of `getText()` have been moved to `getTextArray()` to reduce code duplication. `getTextArray()` now takes into account both the current graphics position `cm` and the scaling factors of the text matrix `Tm` when adding \n and \t whitespace for positioning. Positioning is only taken into account at the point of inserting text, rather than whenever a `Tm` or `Td`/`TD` was found. It also treats `Q` and `q` as a stack of stored states rather than a single stored state. The presence of `ActualText` `BDC` commands is also taken into account and the contents of the `ActualText` will replace the formerly output text in both content and position. This requires the new `parseDictionary()` method to reliably extract such commands as well as any others PdfParser may take into account in the future. `decodeText()` in **Font.php** now takes into account the current text matrix when considering whether or not to add spaces between words. Instead of `implode()`-ing the result array with a space joiner, rely on the positioning check to add all required spacing. In `decodeContent()` in **Font.php** add a check to see if the string to decode has the UTF-16BE BOM and decode it directly as Unicode if so. * Add test, check for text/plain Add a unit test for correctly decoding an emdash in Cyrillic text. Use sample PDF from issue #585 User @se-ti allowed use of this file in issue #586 (comment) In `cleanContent()` once all strings and dictionaries are hidden, do a MIME-type check on the remaining content. If it doesn't register as text/plain, then return an empty string. This prevents non-document-stream data from being passed to `cleanContent()` such as JPEG data in file '12249.pdf' from #458 Remove whitespace-adding code from **Font.php**. I originally added this code as a "shim" because `decodeText()` did not take into account the current Text Matrix when considering what counted as "words". Now that it does, the previous code of just `implode()`-ing with a space character works. * Modify some code comments + q and Q update Modify several code comments to be clearer. Remove the `$key => ` from `decodeText()` in **Font.php** as it's no longer needed. Also, now that `cleanContent()` is ignoring non `text/plain` content, there should be no errant `q` or `Q` commands that cause the stored-state stack to try restoring a state that doesn't exist. Remove the kludgy code that prevented this. * Update Font.php Remove unnecessary `$whitespace` line. * PHP-cs-fixer edits * Address some reviewed changes * Use text matrix 'i' instead of 'b' The correct matrix elements to use for scaling the x-axis are actually the first *column*, so 'a' and 'i', not 'a' and 'b'. My bad! It worked before because almost always the x-axis scaling is equal to the y-axis scaling. * Use mb_detect_encoding() instead of finfo() The Fileinfo functions are not installed by default on Windows, so use a different method to determine whether the stream is valid or binary. * Update Font.php PHP CS Fixer native_function_invocation * Sort switch statement cases in getTextArray() Make the cases a little bit more alphabetical. Remove cases/commands that aren't relevant to getting and positioning text. * Edit documentation comments, add tests Add some documentation comment text to PDFObject.php and fix a comment typo in Font.php. Add a test accounting for text-matrix scaling in Font::decodeText(). Add a test verifying that a string prefixed by a UTF-16BE BOM is decoded directly by Font::decodeContent(). Move "ET in font name" test from testCleanContent() to testGetSectionsText() as that is the function the test uses. Add a test that verifies cleanContent() returns an empty string for binary content. Remove unnecessary variable reset from ET command in Page::getDataTm. Only needed under BT. * Update ParserTest.php Add a fixed numerical value instead of a multiple of $baselineMemory so there's less dependence on the initial memory value and more focus on how much memory the loading of the PDF actually takes. Hopefully fixes the last unit-test failure? Change tests to assertGreaterThan and assertLessThan so it provides better failure info than just assertTrue (will tell us the values). In my PHP 8.2.7 test env, the $usedMemory exceeds $baselineMemory by ~209,000,000 bytes at this point (line 379). 180,000,000 should be safe? * Restore original cleanContent() Rename the new function formatContent() and restore the original cleanContent() function and tests. If anyone relied on these functions before then this will not be a backwards incompatible change. * added a few PDFs generated with various tools to check conformance * fixed CS issue * Remove textMatrix from decodeText() As it turns out, the current text matrix *can* be used to position text in decodeText(), but it is actually applied **both** to the $offset and the user-selected font space limit. In this way it cancels out, and importing the text matrix into the function is not necessary. Remove two tests in FontTest.php that were associated with this, as they're no longer valid. In decodeText() account for strings of text that have spaces with large offsets, such as 'a line of text that is full-width justified.' Treat these as single spaces instead of honouring the offsets given and spacing them out unnecessarily. Save and use the current font-size when guessing the "fudge" factor for text block width. Make sure this is saved as well when the q command is seen. Adjust the position factor threshold up from 10 to 24 for deciding when to insert tabs or spaces. Also update the current horizontal position when storing the last-written-position for use by an ActualText. Add a test for Microsoft Print-to-PDF v1.7. This change actually reverts the output value in a ParserTest assertion back to the original value before this PR. So that's good! * Account for text leading, font-size factor Keep track of, and account for, the current text leading (space between lines) value when determining whether or not to insert newlines \n. Use the current font size as a factor in determining whether or not to insert positioning whitespace. Take more care wrt adding whitespace. Examine some points where whitespace might be added, and remove it if it's unnecessary. Implode words in Font.php, but examine the words to see if they already have space characters on their fronts and ends. TD and Td "move to the start of the next line" so use a $current_position_td['x'] of zero (0) to start with. This change shouldn't really affect much since enough of a y-axis leading should be added that a newline \n results. In Integration\FontTest.php remove "group linux-only" from testDecodeTextForFontWithIndirectEncodingWithoutTypeEncoding() as this now succeeds in any environment. Before where PdfParser was adding spaces after the "zůstatek:" lines whereas now PdfParser adds tabs, which makes far more sense when looking at the document /samples/bugs/PullRequest500.pdf So change the expected results to match the output. * Update FontTest.php Don't use HEREDOC syntax to compose the expected value for the testDecodeTextForFontWithIndirectEncodingWithoutTypeEncoding() test in FontTest.php. This should allow it to work regardless of the PHP file line-endings. * Misc fixes - When assigning $current_text_leading, take the negative as described in the PDF Reference. Initialize it to zero, also as in the Reference. - Add back $current_position_td['x'] when considering a Td or TD command as this appears to be the way it's supposed to be handled according to the reference. - Add another check to remove unnecessary whitespace by examining the previous string of text for a trailing space. - $current_position_tm['x'] and $current_position_tm['y'] aren't tested for an initial false value, so initialize them to 0. - Use double null bytes to implode the string in Font.php to avoid catching UTF-8 bytes that might include a single null byte. - Base the text-box width guess on the current font size instead of a constant value of 4. - In PageTest.php, PdfParser now correctly inserts a newline between Aardappelen and Zak, so change the test. - In ParserTest.php, we are now removing unnecessary whitespace, including double spaces, so change this test to match the output. (Double spaces changed to single spaces) * Add SmallPDF sample PDF and unit test * Pass horizontal fontFactor to decodeText() Pass the fontFactor value (font size multiplied by the text-matrix scale value) to decodeText() so it can more accurately separate text into words by position. A font factor divided by four seems to give the best results for both horizontal and vertical positioning, although this value is selected arbitrarily. In Font::decodeText() some PDFs may have text-matrix scale values so large that directly encoded horizontal tabs are used in place of spaces. Remove internal tabs from texts and replace them with plain spaces. * Update PDFObject.php Restore the $offset parameter to public function getCommandsText() for backwards compatibility. * moved new PDFs into special folder adapted paths in tests * split DocumentTest class spit into a general one, one that is issue related and one which is PDF generator related * added 3 PDFs generated by different generators + tests * More accounting for horizontal font factor Account for the entire font-factor (font-size multiplied by the horizontal scaling factor of the text-matrix) when estimating the width of the current text block. Insert a fix when decoding octal strings in Font::decodeOctal(), check further ahead for escaped backslashes. Remove tests for images in DocumentGeneratorFocusTest.php. These also fail in the current v2.7.0 release and they should be looked at in a separate PR. * Revert decodeOctal(), add @internal Revert the change to Font::decodeOctal() as it's been superceded by PR #640. Add @internal notes to formatContent() and parseDictionary(). * Proper use of PHPDoc "internal" The `@internal` tag hides the content that comes _after_ it from the documentation, so adjust these comments as appropriate. See: https://manual.phpdoc.org/HTMLSmartyConverter/HandS/phpDocumentor/tutorial_tags.internal.pkg.html * Disable a failing test This test will succeed once PR #640 is merged. It doesn't have anything to do with the current PR, so disable it for now. * Update PDFObject.php Switch to tagging method for `@internal`. Adjust comments. * Update PDFObject.php PHP-CS-Fixer requires spaces between `@` statements I guess. * Return empty array if no valid command detected In some edge cases, the formatContent() method may return a document stream row containing an invalid command. Make sure we just ignore these commands instead of triggering warnings for missing $matches array elements. * Update DocumentGeneratorFocusTest.php Re-enable this assertion, now that we have merged #640. * Make formatContent() private Make the formatContent() method private to PDFObject so that `@internal` isn't required. Adjust the unit tests with `ReflectionMethod` to account for this. --------- Co-authored-by: Konrad Abicht <hi@inspirito.de>
Description:
In the middle of the last line text encoding changes to a wierd one with squares instead of characters.
Acrobat Reader and Chrome pdf-viewer shows this file correct;
PDF input
page91.pdf
Expected output & actual output
actual: "Тем не менее, стаявший ледник �\�h�d�j�m�]�� �g�m�g�Z�l�Z�d�Z���� �k�m�^�y�� �i�h�� �\�k�_�f�m���� �k�^�_�e�Z�e�� �i�_�j�_�\�Z�e�� �k�e�h�`�g�_�_�� �b��"
expected: "Тем не менее, стаявший ледник вокруг нунатака, судя по всему, сделал перевал сложнее и"
Code
The text was updated successfully, but these errors were encountered: