Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read XMP Metadata and add it to data returned by getDetails() #606

Merged
merged 19 commits into from
Jul 4, 2023
Merged

Read XMP Metadata and add it to data returned by getDetails() #606

merged 19 commits into from
Jul 4, 2023

Conversation

GreyWyvern
Copy link
Contributor

Try to get XMP Metadata from the PDF document, and if successful, prefer it over any decoded header "details" we may find.

As an example, here's a PDF where the FlateDecode'd header data results in Title and Creator "details" values that contain invalid UTF-8 characters: https://evertz.com/resources/press/Studer-Vista-X-Mixbus-At-Eurovision.pdf

The PDF however also contains an XMP Metadata block where the Title and Creator values are encoded properly. If it exists, we should prefer that over gzuncompress()'ed data which may potentially have invalid UTF-8 characters.

Try to get XMP Metadata from the PDF document, and if successful, prefer it over any decoded "details" we may find.
Don't need $index in this loop.
Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GreyWyvern Thank you for this pull request!

I made some remarks. Also, please add one or two tests which demonstrate that your code is working properly. The coding style issues can be fixed by running PHP-CS-Fixer locally (https://github.com/smalot/pdfparser/blob/master/doc/Developer.md#php-cs-fixer).

src/Smalot/PdfParser/Parser.php Outdated Show resolved Hide resolved
src/Smalot/PdfParser/Document.php Outdated Show resolved Hide resolved
@k00ni
Copy link
Collaborator

k00ni commented Jun 23, 2023

There should also be a remark in our documentation (maybe in https://github.com/smalot/pdfparser/blob/master/doc/Usage.md#extract-metadata?) about these new meta data and how to get them.

Change function name from getXMPMetadata to extractXMPMetadata, returns void.
@GreyWyvern
Copy link
Contributor Author

@GreyWyvern Thank you for this pull request!

I made some remarks. Also, please add one or two tests which demonstrate that your code is working properly. The coding style issues can be fixed by running PHP-CS-Fixer locally (https://github.com/smalot/pdfparser/blob/master/doc/Developer.md#php-cs-fixer).

I'm sorry, I'm having a rough time running the tests with PHPUnit as I'm developing on Windows. I keep getting the error:

PHP Fatal error: Uncaught Error: Class "PHPUnitTests\TestCase" not found in C:\Users...\Documents\GitHub\pdfparser\tests\PHPUnit\Integration\ConfigTest.php:41

I've made a small example PDF that demonstrates the XMP-reading vs. no XMP-reading difference very well, so it should not be too difficult.

@GreyWyvern
Copy link
Contributor Author

There should also be a remark in our documentation (maybe in https://github.com/smalot/pdfparser/blob/master/doc/Usage.md#extract-metadata?) about these new meta data and how to get them.

This metadata is gotten in the same way as before, by using getDetails(). If any XMP data was found, it will over-write (via array_merge) any header values with the same array keys obtained by regular extraction or gzuncompress().

Add XMP_Metadata.pdf sample file.
Change name of ModifyDate to ModDate to match the array key name already used by PdfParser.
@k00ni
Copy link
Collaborator

k00ni commented Jun 25, 2023

I'm sorry, I'm having a rough time running the tests with PHPUnit as I'm developing on Windows. I keep getting the error:

PHP Fatal error: Uncaught Error: Class "PHPUnitTests\TestCase" not found in C:\Users...\Documents\GitHub\pdfparser\tests\PHPUnit\Integration\ConfigTest.php:41

It looks like your auto loading isn't working. You ran composer update? Please check https://github.com/smalot/pdfparser/blob/master/.github/workflows/continuous-integration.yml#L203 (especially this line). These are our Windows related tests.

On second thought, remove testing for the Registered Trademark symbol, as it is an encodable ISO-8851-1 glyph (00AE) and proper reading of it might be fixed in the rest of PdfParser eventually.
@GreyWyvern
Copy link
Contributor Author

I have added the Unit Test. One thing I noticed is that while the XMP Metadata is encoded in UTF-8, the xref data is encoded in CP1252 which means the Registered Sign (U+00AE) glyph from the XMP_Metadata.pdf is decodable with mb_convert_encoding($str, 'UTF-8', 'CP1252') or displaying the content on a page with charset CP1252, however high-level UTF-8 glyphs are not encodable in CP1252 and cannot be decoded properly.

Might be something to note for xref decoding problems you might have in the Issues list?

@k00ni
Copy link
Collaborator

k00ni commented Jun 27, 2023

Might be something to note for xref decoding problems you might have in the Issues list?

Good catch! Could you make a small reference in (some of) these issues?

The pull request looks good now, after remaining coding style issues (importing a class which is not used) are fixed, we can merge it.

@GreyWyvern
Copy link
Contributor Author

GreyWyvern commented Jun 27, 2023

Good catch! Could you make a small reference in (some of) these issues?

Sure. Where would you like me to do that?

I've done some more studying about this and now I'm not so precisely sure what character encoding Adobe is using to store encoded Properties values, only that the characters it destroys are all in the 32 character range where ISO-8859-1 differs from CP1252. CP1252 actually has a code point for Right Single Quotation mark at 0x92, but Adobe saves it as 0x90 which is an empty code point in CP1252. I'm not sure if this is a bug, or if one of Adobe's special encodings map to this. No other character set seems to: https://www.fileformat.info/info/unicode/char/2019/charset_support.htm

FWIW, WinAnsiEncoding does map to CP1252.

Edit: Here is the mapping Adobe is using when saving encoded Properties, if you can find it useful somehow. The second code point is the CP1252 code point the Adobe character is translated to.

  • CP1252 -> Adobe ?
  • € (0x80) -> (0x20) Space
  • (0x81) -> € (0x80)
  • ‚ (0x82) -> ‘ (0x91)
  • ƒ (0x83) -> † (0x86)
  • „ (0x84) -> Œ (0x8c)
  • … (0x85) -> ƒ (0x83)
  • † (0x86) -> (0x81)
  • ‡ (0x87) -> ‚ (0x82)
  • ˆ (0x88) -> (0x1a)
  • ‰ (0x89) -> ‹ (0x8b)
  • Š (0x8a) -> — (0x97)
  • ‹ (0x8b) -> ˆ (0x88)
  • Œ (0x8c) -> – (0x96)
  • (0x8d) -> € (0x80)
  • Ž (0x8e) -> ™ (0x99)
  • (0x8f) -> € (0x80)
  • (0x90) -> € (0x80)
  • ‘ (0x91) -> (0x8f)
  • ’ (0x92) -> (0x90)
  • “ (0x93) -> (0x8d)
  • ” (0x94) -> Ž (0x8e)
  • • (0x95) -> € (0x80)
  • – (0x96) -> … (0x85)
  • — (0x97) -> „ (0x84)
  • ˜ (0x98) -> (0x1f)
  • ™ (0x99) -> ’ (0x92)
  • š (0x9a) -> (0x9d)
  • › (0x9b) -> ‰ (0x89)
  • œ (0x9c) -> œ (0x9c)
  • (0x9d) -> € (0x80)
  • ž (0x9e) -> ž (0x9e)
  • Ÿ (0x9f) -> ˜ (0x98)

Likely several of the already reported issues fall under this. For example, this issue: #585 the troublesome character is — (0x97). Although I'm not sure how you would fix this in getText().

Luckily the XMP Metadata is always encoded in UTF-8 as per standard, so it can be relied upon.

The pull request looks good now, after remaining coding style issues (importing a class which is not used) are fixed, we can merge it.

Oh, yes, haha. I noticed PHP-CS-Fixer flagged that, but my edits didn't touch Encoding.php so I didn't commit it.

@GreyWyvern
Copy link
Contributor Author

GreyWyvern commented Jun 27, 2023

Hey hey! Outside the scope of this PR, but I believe I found it: https://github.com/maxwell-bland/pdf-latin-text-encodings

Apparently PdfParser is missing an encoding PDFDocEncoding, which the github readme file above describes it as:

Encoding for text strings in a PDF document outside the document's content streams.

Adobe has a class for it, so it's a real thing: https://developer.adobe.com/experience-manager/reference-materials/cloud-service/javadoc/com/adobe/internal/pdftoolkit/core/util/PDFDocEncoding.html

The github above has a JSON file with all the characters from the set: https://github.com/maxwell-bland/pdf-latin-text-encodings/blob/main/pdf-encoding.json and it matches what I see. (eg. 0x92 gets translated to 0x90 and 0x90 is a Right Single Quotation Mark in PDFDocEncoding)

Maybe we can use this to create a new Encoding file specifically for Document Properties.

Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the liberty to fix remaining coding style and PHPStan issues.

Sure. Where would you like me to do that?

I assumed that you had particular issues in mind. We have a lot of encoding and white space related issues, so every bit of help there is appreciated.

Lets move the discussion about PDFDocEncoding to #609, so we can keep this pull request clean.

Thank you for your work! I will leave this PR open for a few days. If there are no objections/feedback I will merge it.

@GreyWyvern
Copy link
Contributor Author

One caveat that I've just discovered, I don't think it's a blocker: When saving a PDF with properties, Adobe uses the contents of the Subject field to populate the dc:description XMP metadata element, even though there is a dc:subject element in the XMP specifications. This means that $details['Subject'] will retain the original decoded Subject, while the XMP 'Subject' will be in $details['Description'].

I think in the future I could refine this so that XMP values are represented in their own namespace within the getDetails() response. I assumed it was a 1:1 relationship, it is almost but not quite.

@k00ni k00ni self-assigned this Jun 29, 2023
@GreyWyvern
Copy link
Contributor Author

GreyWyvern commented Jun 29, 2023

@k00ni, if possible, I would like to submit some new code for this PR before you merge it. More and more I feel like overwriting the regular decoded getDetails() values is the wrong idea, and XMP data should be distinguished from it. In some cases, XMP data can even conflict with itself! Such as when there are pdf:creator and dc:creator elements in the same XMP block. You can't just assign them all to $details['Creator'].

I propose adding XMP data to the getDetails() array using the lowercased and namespaced element name. So it would look like this:

Array
(
    [Creator] => Adobe Acrobat
    [Producer] => Adobe Acrobat
    [CreatedOn] => 2022-01-28T16:36:11+00:00
    [Pages] => 35
    [dc:creator] => My Name
    [pdf:producer] => Adobe Acrobat
    [dc:title] => My Document Title
    ...
)

Let me know and I'll push the commit I've created.

@k00ni
Copy link
Collaborator

k00ni commented Jun 30, 2023

Sure, take the time you need to polish the pull request. No rush.

Instead of overwriting various values from the getDetails() metadata array, add all collected XMP values as new array keys matching the lowercased and namespaced XML element names.
Make sure the XMP metadata block is referring to a PDF as some programs will add XMP blocks for included assets not of type application/pdf.
Better explanation of XMP and flow.
Only do this for strings. If it's a single element list but the value is an array, leave it so it makes better sense.
Resolve suggested changes.
@GreyWyvern GreyWyvern changed the title Read XMP Metadata in place of decoded header 'details' Read XMP Metadata and add it to data returned by getDetails() Jul 3, 2023
@k00ni
Copy link
Collaborator

k00ni commented Jul 4, 2023

I've added your explanation to Document.php and fixed remaining coding style issues. Declined PHP-CS-Fixer to use rule modernize_strpos, because it replaces strpos with str_starts_with, which is only available since PHP 8.0 (ref).

@GreyWyvern
Copy link
Contributor Author

Looks good, thanks! :)

After this, I'll have another PR that decodes the regular metadata with PDFDocEncoding, which is why I removed some tests related to that from DocumentTest.php.

@k00ni
Copy link
Collaborator

k00ni commented Jul 4, 2023

Can I merge it now?

@GreyWyvern
Copy link
Contributor Author

Can I merge it now?

Sure, go ahead!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants