Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better octal and hex-entity decode #640

Merged
merged 4 commits into from
Sep 26, 2023
Merged

Conversation

GreyWyvern
Copy link
Contributor

@GreyWyvern GreyWyvern commented Sep 19, 2023

Type of pull request

  • Bug fix (involves code and configuration changes)
  • New feature (involves code and configuration changes)
  • Documentation update
  • Something else

About

A follow-up and improvement over #627.

Octal strings can include series of backslashes of arbitrary length. If there is an odd number of backslashes, a following octal code is valid, but if there's an even number, the following octal code should not be translated.

Previously PdfParser would only account for two backslashes directly preceding an octal code. A commit from in-progress PR #634 extended this to three which probably covers 99.99% of all cases. This change ups that to 100% in that there could be a string with any number of backslashes in a row, and codes will be correctly translated.

Also update decodeEntities() to use a preg_replace_callback() instead of the bulkier preg_split() + foreach loop. Make sure it matches all hexadecimal digits including a-f.

Add new tests for both of these.

Checklist for code / configuration changes

In case you changed the code/configuration, please read each of the following checkboxes as they contain valuable information:

  • Please add at least one test case (unit test, system test, ...) to demonstrate that the change is working. If existing code was changed, your tests cover these code parts as well.
    By the way, you don't have to provide a full fledged PDF file to demonstrate a fix. Instead a unit test may be sufficient sometimes, please have a look at FontTest for example code.
    Code changes without any tests are likely to be rejected. If you dont know how to write tests, no problem, tell us upfront and we may add them ourselves or discuss other ways.
  • Please run PHP-CS-Fixer before committing, to confirm with our coding styles. See https://github.com/smalot/pdfparser/blob/master/.php-cs-fixer.php for more information about our coding styles.

Octal strings can include series of backslashes of arbitrary length. If there is an odd number of backslashes, a following octal code is valid, but if there's an even number, the following octal code should not be translated.

Previously PdfParser would only account for two backslashes directly preceding an octal code. A commit from in-progress PR smalot#634 extended this to three which probably covers 99.99% of all cases. This change ups that to 100% in that there could be a string with any number of backslashes in a row, and codes will be correctly translated.

Also update decodeEntities() to use a preg_replace_callback() instead of the bulkier preg_split() + foreach loop. Make sure it matches all hexadecimal digits including a-f.

Add new tests for both of these.
Move the special string replacement after the unescaping of parentheses so we don't unescape any parentheses we shouldn't.

Add more tests to make sure this is working.
Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and clean. Thank you again. 👍

GreyWyvern added a commit to GreyWyvern/pdfparser that referenced this pull request Sep 20, 2023
Revert the change to Font::decodeOctal() as it's been superceded by PR smalot#640.

Add @internal notes to formatContent() and parseDictionary().
GreyWyvern added a commit to GreyWyvern/pdfparser that referenced this pull request Sep 20, 2023
This test will succeed once PR smalot#640 is merged. It doesn't have anything to do with the current PR, so disable it for now.
Co-authored-by: Konrad Abicht <hi@inspirito.de>
@k00ni
Copy link
Collaborator

k00ni commented Sep 22, 2023

Ready for merge?

@GreyWyvern
Copy link
Contributor Author

Ready for merge?

Yes. All done. 👍

@k00ni k00ni merged commit 5c48261 into smalot:master Sep 26, 2023
GreyWyvern added a commit to GreyWyvern/pdfparser that referenced this pull request Sep 26, 2023
Re-enable this assertion, now that we have merged smalot#640.
k00ni added a commit that referenced this pull request Nov 7, 2023
* Rewrite PDFObject.php + ancillary

This is a major update to the PHPObject.php file. Where previously PdfParser would attempt to gather document stream data using a series of multiline regular expressions, this update changes the behaviour of `cleanContent()` to the following:
- Hide all (strings)
- Remove all newlines and carriage-returns
- Hide all <<dictionaries>>
- Normalize all whitespace
- Using a list of valid Operators from the PDF Reference, add \r\n back to the remaining text so that there is a single PDF command on every line
- Restore the <<dictionaries>> and (strings)

By using this system, it is then much easier to examine and parse the document stream in a line-by-line manner, instead of PREG extraction. `getSections()` text has been updated to do just this, stepping through the output of `cleanContent()` line-by-line and returning an array of only the relevant commands needed to position and display text.

The guts of `getText()` have been moved to `getTextArray()` to reduce code duplication. `getTextArray()` now takes into account both the current graphics position `cm` and the scaling factors of the text matrix `Tm` when adding \n and \t whitespace for positioning. Positioning is only taken into account at the point of inserting text, rather than whenever a `Tm` or `Td`/`TD` was found.

It also treats `Q` and `q` as a stack of stored states rather than a single stored state.

The presence of `ActualText` `BDC` commands is also taken into account and the contents of the `ActualText` will replace the formerly output text in both content and position. This requires the new `parseDictionary()` method to reliably extract such commands as well as any others PdfParser may take into account in the future.

`decodeText()` in **Font.php** now takes into account the current text matrix when considering whether or not to add spaces between words. Instead of `implode()`-ing the result array with a space joiner, rely on the positioning check to add all required spacing.

In `decodeContent()` in **Font.php** add a check to see if the string to decode has the UTF-16BE BOM and decode it directly as Unicode if so.

* Add test, check for text/plain

Add a unit test for correctly decoding an emdash in Cyrillic text. Use sample PDF from issue #585 User @se-ti allowed use of this file in issue #586 (comment)

In `cleanContent()` once all strings and dictionaries are hidden, do a MIME-type check on the remaining content. If it doesn't register as text/plain, then return an empty string. This prevents non-document-stream data from being passed to `cleanContent()` such as JPEG data in file '12249.pdf' from #458

Remove whitespace-adding code from **Font.php**. I originally added this code as a "shim" because `decodeText()` did not take into account the current Text Matrix when considering what counted as "words". Now that it does, the previous code of just `implode()`-ing with a space character works.

* Modify some code comments + q and Q update

Modify several code comments to be clearer.

Remove the `$key => ` from `decodeText()` in **Font.php** as it's no longer needed.

Also, now that `cleanContent()` is ignoring non `text/plain` content, there should be no errant `q` or `Q` commands that cause the stored-state stack to try restoring a state that doesn't exist. Remove the kludgy code that prevented this.

* Update Font.php

Remove unnecessary `$whitespace` line.

* PHP-cs-fixer edits

* Address some reviewed changes

* Use text matrix 'i' instead of 'b'

The correct matrix elements to use for scaling the x-axis are actually the first *column*, so 'a' and 'i', not 'a' and 'b'. My bad! It worked before because almost always the x-axis scaling is equal to the y-axis scaling.

* Use mb_detect_encoding() instead of finfo()

The Fileinfo functions are not installed by default on Windows, so use a different method to determine whether the stream is valid or binary.

* Update Font.php

PHP CS Fixer native_function_invocation

* Sort switch statement cases in getTextArray()

Make the cases a little bit more alphabetical. Remove cases/commands that aren't relevant to getting and positioning text.

* Edit documentation comments, add tests

Add some documentation comment text to PDFObject.php and fix a comment typo in Font.php.

Add a test accounting for text-matrix scaling in Font::decodeText().

Add a test verifying that a string prefixed by a UTF-16BE BOM is decoded directly by Font::decodeContent().

Move "ET in font name" test from testCleanContent() to testGetSectionsText() as that is the function the test uses.

Add a test that verifies cleanContent() returns an empty string for binary content.

Remove unnecessary variable reset from ET command in Page::getDataTm. Only needed under BT.

* Update ParserTest.php

Add a fixed numerical value instead of a multiple of $baselineMemory so there's less dependence on the initial memory value and more focus on how much memory the loading of the PDF actually takes. Hopefully fixes the last unit-test failure?

Change tests to assertGreaterThan and assertLessThan so it provides better failure info than just assertTrue (will tell us the values).

In my PHP 8.2.7 test env, the $usedMemory exceeds $baselineMemory by ~209,000,000 bytes at this point (line 379). 180,000,000 should be safe?

* Restore original cleanContent()

Rename the new function formatContent() and restore the original cleanContent() function and tests. If anyone relied on these functions before then this will not be a backwards incompatible change.

* added a few PDFs generated with various tools to check conformance

* fixed CS issue

* Remove textMatrix from decodeText()

As it turns out, the current text matrix *can* be used to position text in decodeText(), but it is actually applied **both** to the $offset and the user-selected font space limit. In this way it cancels out, and importing the text matrix into the function is not necessary. Remove two tests in FontTest.php that were associated with this, as they're no longer valid.

In decodeText() account for strings of text that have spaces with large offsets, such as
'a    line    of    text    that    is    full-width    justified.'
Treat these as single spaces instead of honouring the offsets given and spacing them out unnecessarily.

Save and use the current font-size when guessing the "fudge" factor for text block width. Make sure this is saved as well when the q command is seen.

Adjust the position factor threshold up from 10 to 24 for deciding when to insert tabs or spaces.

Also update the current horizontal position when storing the last-written-position for use by an ActualText.

Add a test for Microsoft Print-to-PDF v1.7.

This change actually reverts the output value in a ParserTest assertion back to the original value before this PR. So that's good!

* Account for text leading, font-size factor

Keep track of, and account for, the current text leading (space between lines) value when determining whether or not to insert newlines \n.

Use the current font size as a factor in determining whether or not to insert positioning whitespace.

Take more care wrt adding whitespace. Examine some points where whitespace might be added, and remove it if it's unnecessary. Implode words in Font.php, but examine the words to see if they already have space characters on their fronts and ends.

TD and Td "move to the start of the next line" so use a $current_position_td['x'] of zero (0) to start with. This change shouldn't really affect much since enough of a y-axis leading should be added that a newline \n results.

In Integration\FontTest.php remove "group linux-only" from testDecodeTextForFontWithIndirectEncodingWithoutTypeEncoding() as this now succeeds in any environment. Before where PdfParser was adding spaces after the "zůstatek:" lines whereas now PdfParser adds tabs, which makes far more sense when looking at the document /samples/bugs/PullRequest500.pdf So change the expected results to match the output.

* Update FontTest.php

Don't use HEREDOC syntax to compose the expected value for the testDecodeTextForFontWithIndirectEncodingWithoutTypeEncoding() test in FontTest.php. This should allow it to work regardless of the PHP file line-endings.

* Misc fixes

- When assigning $current_text_leading, take the negative as described in the PDF Reference. Initialize it to zero, also as in the Reference.
- Add back $current_position_td['x'] when considering a Td or TD command as this appears to be the way it's supposed to be handled according to the reference.
- Add another check to remove unnecessary whitespace by examining the previous string of text for a trailing space.
- $current_position_tm['x'] and $current_position_tm['y'] aren't tested for an initial false value, so initialize them to 0.
- Use double null bytes to implode the string in Font.php to avoid catching UTF-8 bytes that might include a single null byte.
- Base the text-box width guess on the current font size instead of a constant value of 4.

- In PageTest.php, PdfParser now correctly inserts a newline between Aardappelen and Zak, so change the test.
- In ParserTest.php, we are now removing unnecessary whitespace, including double spaces, so change this test to match the output. (Double spaces changed to single spaces)

* Add SmallPDF sample PDF and unit test

* Pass horizontal fontFactor to decodeText()

Pass the fontFactor value (font size multiplied by the text-matrix scale value) to decodeText() so it can more accurately separate text into words by position. A font factor divided by four seems to give the best results for both horizontal and vertical positioning, although this value is selected arbitrarily.

In Font::decodeText() some PDFs may have text-matrix scale values so large that directly encoded horizontal tabs are used in place of spaces. Remove internal tabs from texts and replace them with plain spaces.

* Update PDFObject.php

Restore the $offset parameter to public function getCommandsText() for backwards compatibility.

* moved new PDFs into special folder

adapted paths in tests

* split DocumentTest class

spit into a general one, one that is issue related and
one which is PDF generator related

* added 3 PDFs generated by different generators + tests

* More accounting for horizontal font factor

Account for the entire font-factor (font-size multiplied by the horizontal scaling factor of the text-matrix) when estimating the width of the current text block.

Insert a fix when decoding octal strings in Font::decodeOctal(), check further ahead for escaped backslashes.

Remove tests for images in DocumentGeneratorFocusTest.php. These also fail in the current v2.7.0 release and they should be looked at in a separate PR.

* Revert decodeOctal(), add @internal

Revert the change to Font::decodeOctal() as it's been superceded by PR #640.

Add @internal notes to formatContent() and parseDictionary().

* Proper use of PHPDoc "internal"

The `@internal` tag hides the content that comes _after_ it from the documentation, so adjust these comments as appropriate. See: https://manual.phpdoc.org/HTMLSmartyConverter/HandS/phpDocumentor/tutorial_tags.internal.pkg.html

* Disable a failing test

This test will succeed once PR #640 is merged. It doesn't have anything to do with the current PR, so disable it for now.

* Update PDFObject.php

Switch to tagging method for `@internal`. Adjust comments.

* Update PDFObject.php

PHP-CS-Fixer requires spaces between `@` statements I guess.

* Return empty array if no valid command detected

In some edge cases, the formatContent() method may return a document stream row containing an invalid command. Make sure we just ignore these commands instead of triggering warnings for missing $matches array elements.

* Update DocumentGeneratorFocusTest.php

Re-enable this assertion, now that we have merged #640.

* Make formatContent() private

Make the formatContent() method private to PDFObject so that `@internal` isn't required. Adjust the unit tests with `ReflectionMethod` to account for this.

---------

Co-authored-by: Konrad Abicht <hi@inspirito.de>
@GreyWyvern GreyWyvern deleted the better-octal branch February 14, 2024 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants