-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read XMP Metadata and add it to data returned by getDetails() #606
Conversation
Try to get XMP Metadata from the PDF document, and if successful, prefer it over any decoded "details" we may find.
Don't need $index in this loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GreyWyvern Thank you for this pull request!
I made some remarks. Also, please add one or two tests which demonstrate that your code is working properly. The coding style issues can be fixed by running PHP-CS-Fixer locally (https://github.com/smalot/pdfparser/blob/master/doc/Developer.md#php-cs-fixer).
There should also be a remark in our documentation (maybe in https://github.com/smalot/pdfparser/blob/master/doc/Usage.md#extract-metadata?) about these new meta data and how to get them. |
Change function name from getXMPMetadata to extractXMPMetadata, returns void.
I'm sorry, I'm having a rough time running the tests with PHPUnit as I'm developing on Windows. I keep getting the error: PHP Fatal error: Uncaught Error: Class "PHPUnitTests\TestCase" not found in C:\Users...\Documents\GitHub\pdfparser\tests\PHPUnit\Integration\ConfigTest.php:41 I've made a small example PDF that demonstrates the XMP-reading vs. no XMP-reading difference very well, so it should not be too difficult. |
This metadata is gotten in the same way as before, by using getDetails(). If any XMP data was found, it will over-write (via array_merge) any header values with the same array keys obtained by regular extraction or gzuncompress(). |
Add XMP_Metadata.pdf sample file. Change name of ModifyDate to ModDate to match the array key name already used by PdfParser.
It looks like your auto loading isn't working. You ran |
On second thought, remove testing for the Registered Trademark symbol, as it is an encodable ISO-8851-1 glyph (00AE) and proper reading of it might be fixed in the rest of PdfParser eventually.
I have added the Unit Test. One thing I noticed is that while the XMP Metadata is encoded in UTF-8, the xref data is encoded in CP1252 which means the Registered Sign (U+00AE) glyph from the XMP_Metadata.pdf is decodable with mb_convert_encoding($str, 'UTF-8', 'CP1252') or displaying the content on a page with charset CP1252, however high-level UTF-8 glyphs are not encodable in CP1252 and cannot be decoded properly. Might be something to note for xref decoding problems you might have in the Issues list? |
Good catch! Could you make a small reference in (some of) these issues? The pull request looks good now, after remaining coding style issues (importing a class which is not used) are fixed, we can merge it. |
Sure. Where would you like me to do that? I've done some more studying about this and now I'm not so precisely sure what character encoding Adobe is using to store encoded Properties values, only that the characters it destroys are all in the 32 character range where ISO-8859-1 differs from CP1252. CP1252 actually has a code point for Right Single Quotation mark at 0x92, but Adobe saves it as 0x90 which is an empty code point in CP1252. I'm not sure if this is a bug, or if one of Adobe's special encodings map to this. No other character set seems to: https://www.fileformat.info/info/unicode/char/2019/charset_support.htm FWIW, WinAnsiEncoding does map to CP1252. Edit: Here is the mapping Adobe is using when saving encoded Properties, if you can find it useful somehow. The second code point is the CP1252 code point the Adobe character is translated to.
Likely several of the already reported issues fall under this. For example, this issue: #585 the troublesome character is — (0x97). Although I'm not sure how you would fix this in getText(). Luckily the XMP Metadata is always encoded in UTF-8 as per standard, so it can be relied upon.
Oh, yes, haha. I noticed PHP-CS-Fixer flagged that, but my edits didn't touch Encoding.php so I didn't commit it. |
Hey hey! Outside the scope of this PR, but I believe I found it: https://github.com/maxwell-bland/pdf-latin-text-encodings Apparently PdfParser is missing an encoding PDFDocEncoding, which the github readme file above describes it as:
Adobe has a class for it, so it's a real thing: https://developer.adobe.com/experience-manager/reference-materials/cloud-service/javadoc/com/adobe/internal/pdftoolkit/core/util/PDFDocEncoding.html The github above has a JSON file with all the characters from the set: https://github.com/maxwell-bland/pdf-latin-text-encodings/blob/main/pdf-encoding.json and it matches what I see. (eg. 0x92 gets translated to 0x90 and 0x90 is a Right Single Quotation Mark in PDFDocEncoding) Maybe we can use this to create a new Encoding file specifically for Document Properties. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took the liberty to fix remaining coding style and PHPStan issues.
Sure. Where would you like me to do that?
I assumed that you had particular issues in mind. We have a lot of encoding and white space related issues, so every bit of help there is appreciated.
Lets move the discussion about PDFDocEncoding to #609, so we can keep this pull request clean.
Thank you for your work! I will leave this PR open for a few days. If there are no objections/feedback I will merge it.
One caveat that I've just discovered, I don't think it's a blocker: When saving a PDF with properties, Adobe uses the contents of the Subject field to populate the dc:description XMP metadata element, even though there is a dc:subject element in the XMP specifications. This means that $details['Subject'] will retain the original decoded Subject, while the XMP 'Subject' will be in $details['Description']. I think in the future I could refine this so that XMP values are represented in their own namespace within the getDetails() response. I assumed it was a 1:1 relationship, it is almost but not quite. |
@k00ni, if possible, I would like to submit some new code for this PR before you merge it. More and more I feel like overwriting the regular decoded getDetails() values is the wrong idea, and XMP data should be distinguished from it. In some cases, XMP data can even conflict with itself! Such as when there are pdf:creator and dc:creator elements in the same XMP block. You can't just assign them all to $details['Creator']. I propose adding XMP data to the getDetails() array using the lowercased and namespaced element name. So it would look like this: Array
(
[Creator] => Adobe Acrobat
[Producer] => Adobe Acrobat
[CreatedOn] => 2022-01-28T16:36:11+00:00
[Pages] => 35
[dc:creator] => My Name
[pdf:producer] => Adobe Acrobat
[dc:title] => My Document Title
...
) Let me know and I'll push the commit I've created. |
Sure, take the time you need to polish the pull request. No rush. |
Instead of overwriting various values from the getDetails() metadata array, add all collected XMP values as new array keys matching the lowercased and namespaced XML element names.
Make sure the XMP metadata block is referring to a PDF as some programs will add XMP blocks for included assets not of type application/pdf.
Better explanation of XMP and flow.
Only do this for strings. If it's a single element list but the value is an array, leave it so it makes better sense.
Resolve suggested changes.
I've added your explanation to Document.php and fixed remaining coding style issues. Declined PHP-CS-Fixer to use rule |
Looks good, thanks! :) After this, I'll have another PR that decodes the regular metadata with PDFDocEncoding, which is why I removed some tests related to that from DocumentTest.php. |
Can I merge it now? |
Sure, go ahead! |
Try to get XMP Metadata from the PDF document, and if successful, prefer it over any decoded header "details" we may find.
As an example, here's a PDF where the FlateDecode'd header data results in Title and Creator "details" values that contain invalid UTF-8 characters: https://evertz.com/resources/press/Studer-Vista-X-Mixbus-At-Eurovision.pdf
The PDF however also contains an XMP Metadata block where the Title and Creator values are encoded properly. If it exists, we should prefer that over gzuncompress()'ed data which may potentially have invalid UTF-8 characters.