-
Notifications
You must be signed in to change notification settings - Fork 538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for two bugs related to Unicode translation support by Font objects #698
Conversation
Symptom was that some documents' contents was rendering as a bunch of control characters. These are the untranslated strings. This was happening because for two different reasons, these strings weren't being translated \Smalot\PdfParser\Font::decodeContent() in some circumstances. First fix is to \Smalot\PdfParser\Font::loadTranslateTable(): - Fixed bug where bfchar sections weren't loaded due to mistake in regexp. - It now uses `*` instead of `+` and thus supports translation tables with lines like `<0000><0000>`. (Required `<0000> <0000>` before.) Second fix is for documents that attach their Font objects to the Pages object instead of each Page object: - \Smalot\PdfParser\Page now has a setFonts() method - \Smalot\PdfParser\Pages now declares its $fonts variable - \Smalot\PdfParser\Pages::getPages() now applies the object's fonts to each child Page - \Smalot\PdfParser\Pages::getFonts() copied from Page class
Added relative namespace for `ElementMissing`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@unixnut thank you for your pull request! Just a few questions.
Besides, would you mind adding unit tests to cover these changes?
Co-authored-by: Konrad Abicht <hi@inspirito.de>
Unit tests have been requested by @k00ni but I lack the time to create them at this point. I think integration tests would be more useful given that the code deals with fonts attached to Pages objects, which only applies to some documents. |
Requested by @k00ni
The first "fix" is very easy to understand; just making a regexp more robust. The second fix is more difficult to comprehend at first. For sure I would want a unit test to provide a "broken-before", "fixed-after" demonstration. Does it even fix a specific reported issue? Is there an example PDF we can see that shows the error? |
@k00ni @GreyWyvern The PDFs that were experiencing this problem cannot be shared because of privacy reasons. I'm not sure how to generate an example document with fonts attached to the Pages object. Can I please get some assistance in writing the integration tests for this issue? |
new getFonts method not only returned stored fonts but also built related fonts list. With this changes its easier to test and the replacement (setupFonts) only builds the fonts list. Also refined some phpdoc comments.
@unixnut sorry for the late response. I added two integration tests, but had to adapt Pages class a bit. Please have a look if its OK for you. If there are no objections, we are done here and ready to merge. CC @GreyWyvern |
After some review of the added tests, it doesn't seem like anything is actually being tested. The tests are using mock documents and setting them up in such a way that the PR code added runs successfully. However, it doesn't appear to be solving any actual issue. The first test function tests the expectation that the added function The second test adds a font to I would really prefer that there be a test document added for this change that clearly shows a "this was not working before the PR, and now it's working" conclusion. As it stands, I still don't know what (at least the second part of) this PR is actually doing. |
@unixnut we need your feedback here, please comment. @GreyWyvern thanks for your feedback, my comments are below:
I based my work on the assumption, that there is a PDF which fails to parse without this changes, as @unixnut said here #698 (comment) . The PDF might be faulty in the first place, but I assume its valid and PDFParser is missing something for now. Sorry if I am too picky here, but its an integration test not a unit test you are talking about. The reason is that I use a few actual classes together with a with mocks. Method About the last part of your comment, a test can be used for various things. One is to check if a functionality is implemented. It doesn't have to be a check for a fix.
@unixnut can you provide us the PDF in private? |
I've checked I partly revise my previous comment about the test testPullRequest698DontOverride First of all, If you remove line 106-117 the test still passes, because Having In the end all this code is obsolete if we can't confirm there is a PDF available which can not be parsed with the unfixed PDFParser. |
Okay, this functionality makes sense. My main concern is that the Is there a way we can test for the correct handling of fonts between Page and Pages objects without using any of the added functions? |
Good question. We could check if constructed instance(s) of To help constructing such a test, first, I would want know if there are tests which use the new functionality. I ran all tests but let the new method public function setFonts($fonts)
{
if (empty($this->fonts)) {
if (0 < count($fonts)) { // <=====
throw new Exception('fff');
}
$this->fonts = $fonts;
}
} and got the following list of "failing" tests:
A possible next step could be to check fonts in Ping @unixnut |
Hi, @k00ni @GreyWyvern . I cannot provide the PDFs to anyone because they contain private medical data and I'm under NDA. The second change allows PDFParser to open files that store fonts differently to what the author expected. These files open fine with Poppler-based tools. I don't know which tool is being used to generate these files. All I know is that these changes are essential to be able to open them. Any assistance from anyone able to generate PDF files with various structures would be helpful. Just FYI, I'm hoping we can wrap this PR up soon so I can make one for the working PDF decryption code I have written. I am fine with your changes mentioned above, @k00ni . |
@GreyWyvern what do you think about this small test? I used the PDF of one of the tests which were shown in my little experiment here: #698 (comment) Therefore one can assume the |
@GreyWyvern I would like to merge this. You asked if there is a way to test for the correct handling of fonts between Page and Pages objects without using any of the added functions? Does my (minimal) example test help you in this regard? |
Sorry, I missed this. I don't think this really works because the test succeeds in the current version of PdfParser without this PR. Let me see if I can come up with something. |
After looking at your original Here's what I came up with: public function testFontsArePassedFromPagesToPage(): void
{
// Create mock Document, Font and Page objects
$document = $this->createMock(Document::class);
$font1 = new Font($document);
$page = new Page($document);
// Create a Header object that indicates $page is a child
$header = new Header([
'Kids' => new ElementArray([
$page,
]),
], $document);
// Use this header to create a mock Pages object
$pages = new PagesDummy($document, $header);
// Apply $font1 as a Font object to this Pages object;
// setFonts is used here as part of PagesDummy, only to access
// the protected Pages::fonts variable; it is not a method
// available in production
$pages->setFonts([$font1]);
// Trigger setupFonts method in $pages
$pages->getPages(true);
// Since the $page object font list is empty, $font1 from Pages
// object must be passed to the Page object
$this->assertEquals([$font1], $page->getFonts());
// Create a second $font2 using a different method
$font2 = $this->createMock(Font::class);
// Update the fonts in $pages
$pages->setFonts([$font1, $font2]);
// Trigger setupFonts method in $pages
$pages->getPages(true);
// Now that $page already has a font, updates from $pages
// should not overwrite it
$this->assertEquals([$font1], $page->getFonts());
} At the first Since in the current version of PdfParser fonts from Pages aren't passed from Pages to child Page objects at all, testing that existing fonts in Page objects aren't overwritten by those from the parent Pages object is not something we can say "wasn't working before and now it is" because it's essentially a new feature. The second This should be the only unit test needed. |
@GreyWyvern Thanks, yes that's better. I deployed your test and I think we have it. Any objections left or can we merge? |
Thank you for your effort and time here @unixnut @GreyWyvern. |
Symptom was that some documents' contents was rendering as a bunch of control characters. These are the untranslated strings. This was happening because for two different reasons, these strings weren't being translated \Smalot\PdfParser\Font::decodeContent() in some circumstances.
First fix is to \Smalot\PdfParser\Font::loadTranslateTable():
*
instead of+
and thus supports translation tables with lines like<0000><0000>
. (Required<0000> <0000>
before.)Second fix is for documents that attach their Font objects to the Pages object instead of each Page object:
Type of pull request
About
Checklist for code / configuration changes
In case you changed the code/configuration, please read each of the following checkboxes as they contain valuable information: