Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for two bugs related to Unicode translation support by Font objects #698

Merged
merged 20 commits into from
May 20, 2024

Conversation

unixnut
Copy link
Contributor

@unixnut unixnut commented Apr 5, 2024

Symptom was that some documents' contents was rendering as a bunch of control characters. These are the untranslated strings. This was happening because for two different reasons, these strings weren't being translated \Smalot\PdfParser\Font::decodeContent() in some circumstances.

First fix is to \Smalot\PdfParser\Font::loadTranslateTable():

  • Fixed bug where bfchar sections weren't loaded due to mistake in regexp.
  • It now uses * instead of + and thus supports translation tables with lines like <0000><0000>. (Required <0000> <0000> before.)

Second fix is for documents that attach their Font objects to the Pages object instead of each Page object:

  • \Smalot\PdfParser\Page now has a setFonts() method
  • \Smalot\PdfParser\Pages now declares its $fonts variable
  • \Smalot\PdfParser\Pages::getPages() now applies the object's fonts to each child Page
  • \Smalot\PdfParser\Pages::getFonts() copied from Page class

Type of pull request

  • Bug fix (involves code and configuration changes)

About

Checklist for code / configuration changes

In case you changed the code/configuration, please read each of the following checkboxes as they contain valuable information:

  • Please add at least one test case (unit test, system test, ...) to demonstrate that the change is working. If existing code was changed, your tests cover these code parts as well. [...]
  • Please run PHP-CS-Fixer before committing, to confirm with our coding styles. See https://github.com/smalot/pdfparser/blob/master/.php-cs-fixer.php for more information about our coding styles.

unixnut and others added 2 commits April 5, 2024 14:34
Symptom was that some documents' contents was rendering as a bunch of
control characters.  These are the untranslated strings.  This was
happening because for two different reasons, these strings weren't being
translated \Smalot\PdfParser\Font::decodeContent() in some circumstances.

First fix is to \Smalot\PdfParser\Font::loadTranslateTable():

  - Fixed bug where bfchar sections weren't loaded due to mistake in regexp.
  - It now uses `*` instead of `+` and thus supports translation tables with
    lines like `<0000><0000>`.  (Required `<0000> <0000>` before.)

Second fix is for documents that attach their Font objects to the Pages
object instead of each Page object:

  - \Smalot\PdfParser\Page now has a setFonts() method
  - \Smalot\PdfParser\Pages now declares its $fonts variable
  - \Smalot\PdfParser\Pages::getPages() now applies the object's fonts to each child Page
  - \Smalot\PdfParser\Pages::getFonts() copied from Page class
Added relative namespace for `ElementMissing`
Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@unixnut thank you for your pull request! Just a few questions.

Besides, would you mind adding unit tests to cover these changes?

src/Smalot/PdfParser/Page.php Show resolved Hide resolved
src/Smalot/PdfParser/Page.php Show resolved Hide resolved
src/Smalot/PdfParser/Pages.php Outdated Show resolved Hide resolved
@unixnut
Copy link
Contributor Author

unixnut commented Apr 10, 2024

Unit tests have been requested by @k00ni but I lack the time to create them at this point. I think integration tests would be more useful given that the code deals with fonts attached to Pages objects, which only applies to some documents.

@GreyWyvern
Copy link
Contributor

The first "fix" is very easy to understand; just making a regexp more robust.

The second fix is more difficult to comprehend at first. For sure I would want a unit test to provide a "broken-before", "fixed-after" demonstration. Does it even fix a specific reported issue? Is there an example PDF we can see that shows the error?

@unixnut
Copy link
Contributor Author

unixnut commented Apr 18, 2024

@k00ni @GreyWyvern The PDFs that were experiencing this problem cannot be shared because of privacy reasons. I'm not sure how to generate an example document with fonts attached to the Pages object.

Can I please get some assistance in writing the integration tests for this issue?

k00ni added 2 commits April 19, 2024 12:42
new getFonts method not only
returned stored fonts but also built related fonts
list. With this changes its easier to test and the
replacement (setupFonts) only builds the fonts list.
    
Also refined some phpdoc comments.
@k00ni
Copy link
Collaborator

k00ni commented Apr 19, 2024

@unixnut sorry for the late response. I added two integration tests, but had to adapt Pages class a bit. Please have a look if its OK for you. If there are no objections, we are done here and ready to merge.

CC @GreyWyvern

@GreyWyvern
Copy link
Contributor

After some review of the added tests, it doesn't seem like anything is actually being tested. The tests are using mock documents and setting them up in such a way that the PR code added runs successfully. However, it doesn't appear to be solving any actual issue.

The first test function tests the expectation that the added function setFonts() is called exactly once, which it is, because of the newly added code. Yet this isn't an error condition that existed before this PR was created. The data supplied for testing in a unit test should be able to cause an error in the unmodified PdfParser. In this case, the test is just testing that the added code is running, and not that it's performing some kind of fix for an issue.

The second test adds a font to $page1, then performs a bunch of seemingly unrelated actions (comments in the code needed?) before testing whether $page1->getFonts() returns the font that was added at the top of the function. It seems this should always be true? When I remove lines 106-117 of PagesTest.php the assertion still passes. What is the rest of the code testing for? The comment at the top of the function also seems duplicated from the function above.

I would really prefer that there be a test document added for this change that clearly shows a "this was not working before the PR, and now it's working" conclusion. As it stands, I still don't know what (at least the second part of) this PR is actually doing.

@k00ni
Copy link
Collaborator

k00ni commented Apr 29, 2024

@unixnut we need your feedback here, please comment.

@GreyWyvern thanks for your feedback, my comments are below:

The first test function tests the expectation that the added function setFonts() is called exactly once, which it is, because of the newly added code. Yet this isn't an error condition that existed before this PR was created. The data supplied for testing in a unit test should be able to cause an error in the unmodified PdfParser. In this case, the test is just testing that the added code is running, and not that it's performing some kind of fix for an issue.

I based my work on the assumption, that there is a PDF which fails to parse without this changes, as @unixnut said here #698 (comment) . The PDF might be faulty in the first place, but I assume its valid and PDFParser is missing something for now.

Sorry if I am too picky here, but its an integration test not a unit test you are talking about. The reason is that I use a few actual classes together with a with mocks. Method setFonts is not called if $deep is false or Pages instance has no kids. I am a bit confused, without the new code this test would fail, because there is no setFonts function, or did I miss something in your comment?

About the last part of your comment, a test can be used for various things. One is to check if a functionality is implemented. It doesn't have to be a check for a fix.

The second test adds a font to $page1, then performs a bunch of seemingly unrelated actions (comments in the code needed?) before testing whether $page1->getFonts() returns the font that was added at the top of the function. It seems this should always be true? When I remove lines 106-117 of PagesTest.php the assertion still passes. What is the rest of the code testing for? The comment at the top of the function also seems duplicated from the function above.

Having thought about your comment and looking again on my code I see a few problems. The addition of a dummy class to pre-set fonts variable is not relevant here. It should be more about the conditions which lead to the setupFonts call (e.g. if $this->has('Kids') returns true or false, ref). I will propose something soon.

I would really prefer that there be a test document added for this change that clearly shows a "this was not working before the PR, and now it's working" conclusion. As it stands, I still don't know what (at least the second part of) this PR is actually doing.

@unixnut can you provide us the PDF in private?

@k00ni
Copy link
Collaborator

k00ni commented Apr 29, 2024

I've checked PagesTest again and added a few comments to make it more clear whats going on. Hope it helps, but if not, please comment.


I partly revise my previous comment about the test testPullRequest698DontOverride

First of all, PagesDummy was introduced to avoid the tedious work to setup fonts in Pages instance. Because of internal structure of PDFParser, you need to traverse over a couple of objects to get what you want. setFonts bypasses this.

If you remove line 106-117 the test still passes, because setupFonts works correctly. It must not override fonts of Page instances, if they are already set. Without its check the code in line 106-117 would lead to a replacement of these Font instances.

Having $font1 and $font2 to be an actual class and a mock is helpful in this case, because testing for equality leads to an error if they are not both of the same kind. The idea is to see if Page instances keep their fonts. I was thinking how to differentiate one from another, other than providing each one with a different bunch of dummy instances/mocks. I am glad I've found this, because it makes the test code leaner IMHO.

In the end all this code is obsolete if we can't confirm there is a PDF available which can not be parsed with the unfixed PDFParser.

@GreyWyvern
Copy link
Contributor

If you remove line 106-117 the test still passes, because setupFonts works correctly. It must not override fonts of Page instances, if they are already set. Without its check the code in line 106-117 would lead to a replacement of these Font instances.

Okay, this functionality makes sense. My main concern is that the setFonts() method is added by this PR, and the Integration tests require its use for testing. This means that the tests can't be back-ported to an older version of PdfParser to see if they fail there and succeed here; they just cause an error for a missing function.

Is there a way we can test for the correct handling of fonts between Page and Pages objects without using any of the added functions?

@k00ni
Copy link
Collaborator

k00ni commented May 6, 2024

Okay, this functionality makes sense. My main concern is that the setFonts() method is added by this PR, and the Integration tests require its use for testing. This means that the tests can't be back-ported to an older version of PdfParser to see if they fail there and succeed here; they just cause an error for a missing function.
Is there a way we can test for the correct handling of fonts between Page and Pages objects without using any of the added functions?

Good question. We could check if constructed instance(s) of Pages and Page contain correct Font instances, without explicitly use/check for setFonts. I have no complete plan right now, but will think about it.

To help constructing such a test, first, I would want know if there are tests which use the new functionality. I ran all tests but let the new method Page::setFonts threw an exception

public function setFonts($fonts)
{
    if (empty($this->fonts)) {
        if (0 < count($fonts)) { //                 <=====
            throw new Exception('fff');
        }

        $this->fonts = $fonts;
    }
}

and got the following list of "failing" tests:

  • DocumentGeneratorFocusTest::testGetTextPull634LibreOffice
  • PageTest::testExtractDecodedRawDataIssue450
  • PageTest::testGetDataTmIssue450
  • PagesTes1t::testPullRequest698DontOverride
  • ParserTest::testIssue202
  • ParserTest::testIssue557
  • ParserTest::testRetainImageContentImpact
  • RawDataParserTest::testDecodeObjectHeaderIssue405
  • RawDataParserTest::testGetXrefDataIssue673

A possible next step could be to check fonts in Pages + Page + the PDF itself. Then compare and see if the function works. Thats as far as I can go for now, but I hope I have further time soon.


Ping @unixnut

@unixnut
Copy link
Contributor Author

unixnut commented May 9, 2024

Hi, @k00ni @GreyWyvern . I cannot provide the PDFs to anyone because they contain private medical data and I'm under NDA.

The second change allows PDFParser to open files that store fonts differently to what the author expected. These files open fine with Poppler-based tools.

I don't know which tool is being used to generate these files. All I know is that these changes are essential to be able to open them. Any assistance from anyone able to generate PDF files with various structures would be helpful.

Just FYI, I'm hoping we can wrap this PR up soon so I can make one for the working PDF decryption code I have written.

I am fine with your changes mentioned above, @k00ni .

@k00ni
Copy link
Collaborator

k00ni commented May 10, 2024

@GreyWyvern what do you think about this small test?

https://github.com/smalot/pdfparser/pull/698/files#diff-b13743b1a6f35b028e4476142bf872adb07a012dd619fc42f722aff19bb0152aR150-R187

I used the PDF of one of the tests which were shown in my little experiment here: #698 (comment)

Therefore one can assume the Font instances were created in Pages and later on passed down to each Page instance. The test doesn't depend on the setFonts method explicitly, but I am not sure though, if it wholly covers the new functionality and is what you meant before.

@k00ni
Copy link
Collaborator

k00ni commented May 15, 2024

@GreyWyvern I would like to merge this.

You asked if there is a way to test for the correct handling of fonts between Page and Pages objects without using any of the added functions? Does my (minimal) example test help you in this regard?

@GreyWyvern
Copy link
Contributor

You asked if there is a way to test for the correct handling of fonts between Page and Pages objects without using any of the added functions? Does my (minimal) example test help you in this regard?

Sorry, I missed this. I don't think this really works because the test succeeds in the current version of PdfParser without this PR. Let me see if I can come up with something.

@GreyWyvern
Copy link
Contributor

GreyWyvern commented May 15, 2024

After looking at your original testPullRequest698DontOverride() it turns out I was wrong and this is actually almost what we need, @k00ni. The key change is that we shouldn't use the setFonts() method on the $page object (leave it empty) to be sure that the fonts are being passed from Pages to Page. We can use setFonts() on $pages because it's declared in the PagesDummy class earlier in the test file.

Here's what I came up with:

    public function testFontsArePassedFromPagesToPage(): void
    {
        // Create mock Document, Font and Page objects
        $document = $this->createMock(Document::class);
        $font1 = new Font($document);
        $page = new Page($document);

        // Create a Header object that indicates $page is a child
        $header = new Header([
            'Kids' => new ElementArray([
                $page,
            ]),
        ], $document);

        // Use this header to create a mock Pages object
        $pages = new PagesDummy($document, $header);

        // Apply $font1 as a Font object to this Pages object;
        // setFonts is used here as part of PagesDummy, only to access
        // the protected Pages::fonts variable; it is not a method
        // available in production
        $pages->setFonts([$font1]);

        // Trigger setupFonts method in $pages
        $pages->getPages(true);

        // Since the $page object font list is empty, $font1 from Pages
        // object must be passed to the Page object
        $this->assertEquals([$font1], $page->getFonts());

        // Create a second $font2 using a different method
        $font2 = $this->createMock(Font::class);

        // Update the fonts in $pages
        $pages->setFonts([$font1, $font2]);

        // Trigger setupFonts method in $pages
        $pages->getPages(true);

        // Now that $page already has a font, updates from $pages
        // should not overwrite it
        $this->assertEquals([$font1], $page->getFonts());
    }

At the first assertEquals() call, without the added code from this PR, $page->getFonts() returns an empty array. With the added code from the PR, the test succeeds.

Since in the current version of PdfParser fonts from Pages aren't passed from Pages to child Page objects at all, testing that existing fonts in Page objects aren't overwritten by those from the parent Pages object is not something we can say "wasn't working before and now it is" because it's essentially a new feature. The second assertEquals() call tests that updated fonts from Pages don't overwrite existing fonts in child Page objects.

This should be the only unit test needed.

@k00ni
Copy link
Collaborator

k00ni commented May 16, 2024

@GreyWyvern Thanks, yes that's better. I deployed your test and I think we have it. Any objections left or can we merge?

@k00ni k00ni merged commit 4b86c66 into smalot:master May 20, 2024
29 checks passed
@k00ni
Copy link
Collaborator

k00ni commented May 20, 2024

Thank you for your effort and time here @unixnut @GreyWyvern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants