Read XMP Metadata and add it to data returned by getDetails() #606

GreyWyvern · 2023-06-22T19:28:32Z

Try to get XMP Metadata from the PDF document, and if successful, prefer it over any decoded header "details" we may find.

As an example, here's a PDF where the FlateDecode'd header data results in Title and Creator "details" values that contain invalid UTF-8 characters: https://evertz.com/resources/press/Studer-Vista-X-Mixbus-At-Eurovision.pdf

The PDF however also contains an XMP Metadata block where the Title and Creator values are encoded properly. If it exists, we should prefer that over gzuncompress()'ed data which may potentially have invalid UTF-8 characters.

Try to get XMP Metadata from the PDF document, and if successful, prefer it over any decoded "details" we may find.

Don't need $index in this loop.

k00ni

@GreyWyvern Thank you for this pull request!

I made some remarks. Also, please add one or two tests which demonstrate that your code is working properly. The coding style issues can be fixed by running PHP-CS-Fixer locally (https://github.com/smalot/pdfparser/blob/master/doc/Developer.md#php-cs-fixer).

src/Smalot/PdfParser/Parser.php

src/Smalot/PdfParser/Document.php

k00ni · 2023-06-23T08:06:15Z

There should also be a remark in our documentation (maybe in https://github.com/smalot/pdfparser/blob/master/doc/Usage.md#extract-metadata?) about these new meta data and how to get them.

Change function name from getXMPMetadata to extractXMPMetadata, returns void.

GreyWyvern · 2023-06-23T15:23:59Z

@GreyWyvern Thank you for this pull request!

I made some remarks. Also, please add one or two tests which demonstrate that your code is working properly. The coding style issues can be fixed by running PHP-CS-Fixer locally (https://github.com/smalot/pdfparser/blob/master/doc/Developer.md#php-cs-fixer).

I'm sorry, I'm having a rough time running the tests with PHPUnit as I'm developing on Windows. I keep getting the error:

PHP Fatal error: Uncaught Error: Class "PHPUnitTests\TestCase" not found in C:\Users...\Documents\GitHub\pdfparser\tests\PHPUnit\Integration\ConfigTest.php:41

I've made a small example PDF that demonstrates the XMP-reading vs. no XMP-reading difference very well, so it should not be too difficult.

GreyWyvern · 2023-06-23T15:31:35Z

There should also be a remark in our documentation (maybe in https://github.com/smalot/pdfparser/blob/master/doc/Usage.md#extract-metadata?) about these new meta data and how to get them.

This metadata is gotten in the same way as before, by using getDetails(). If any XMP data was found, it will over-write (via array_merge) any header values with the same array keys obtained by regular extraction or gzuncompress().

Add XMP_Metadata.pdf sample file. Change name of ModifyDate to ModDate to match the array key name already used by PdfParser.

k00ni · 2023-06-25T08:52:14Z

I'm sorry, I'm having a rough time running the tests with PHPUnit as I'm developing on Windows. I keep getting the error:

PHP Fatal error: Uncaught Error: Class "PHPUnitTests\TestCase" not found in C:\Users...\Documents\GitHub\pdfparser\tests\PHPUnit\Integration\ConfigTest.php:41

It looks like your auto loading isn't working. You ran composer update? Please check https://github.com/smalot/pdfparser/blob/master/.github/workflows/continuous-integration.yml#L203 (especially this line). These are our Windows related tests.

On second thought, remove testing for the Registered Trademark symbol, as it is an encodable ISO-8851-1 glyph (00AE) and proper reading of it might be fixed in the rest of PdfParser eventually.

GreyWyvern · 2023-06-26T15:24:04Z

I have added the Unit Test. One thing I noticed is that while the XMP Metadata is encoded in UTF-8, the xref data is encoded in CP1252 which means the Registered Sign (U+00AE) glyph from the XMP_Metadata.pdf is decodable with mb_convert_encoding($str, 'UTF-8', 'CP1252') or displaying the content on a page with charset CP1252, however high-level UTF-8 glyphs are not encodable in CP1252 and cannot be decoded properly.

Might be something to note for xref decoding problems you might have in the Issues list?

k00ni · 2023-06-27T06:47:37Z

Might be something to note for xref decoding problems you might have in the Issues list?

Good catch! Could you make a small reference in (some of) these issues?

The pull request looks good now, after remaining coding style issues (importing a class which is not used) are fixed, we can merge it.

GreyWyvern · 2023-06-27T15:31:30Z

Good catch! Could you make a small reference in (some of) these issues?

Sure. Where would you like me to do that?

I've done some more studying about this and now I'm not so precisely sure what character encoding Adobe is using to store encoded Properties values, only that the characters it destroys are all in the 32 character range where ISO-8859-1 differs from CP1252. CP1252 actually has a code point for Right Single Quotation mark at 0x92, but Adobe saves it as 0x90 which is an empty code point in CP1252. I'm not sure if this is a bug, or if one of Adobe's special encodings map to this. No other character set seems to: https://www.fileformat.info/info/unicode/char/2019/charset_support.htm

FWIW, WinAnsiEncoding does map to CP1252.

Edit: Here is the mapping Adobe is using when saving encoded Properties, if you can find it useful somehow. The second code point is the CP1252 code point the Adobe character is translated to.

CP1252 -> Adobe ?
€ (0x80) -> (0x20) Space
(0x81) -> € (0x80)
‚ (0x82) -> ‘ (0x91)
ƒ (0x83) -> † (0x86)
„ (0x84) -> Œ (0x8c)
… (0x85) -> ƒ (0x83)
† (0x86) -> (0x81)
‡ (0x87) -> ‚ (0x82)
ˆ (0x88) -> (0x1a)
‰ (0x89) -> ‹ (0x8b)
Š (0x8a) -> — (0x97)
‹ (0x8b) -> ˆ (0x88)
Œ (0x8c) -> – (0x96)
(0x8d) -> € (0x80)
Ž (0x8e) -> ™ (0x99)
(0x8f) -> € (0x80)
(0x90) -> € (0x80)
‘ (0x91) -> (0x8f)
’ (0x92) -> (0x90)
“ (0x93) -> (0x8d)
” (0x94) -> Ž (0x8e)
• (0x95) -> € (0x80)
– (0x96) -> … (0x85)
— (0x97) -> „ (0x84)
˜ (0x98) -> (0x1f)
™ (0x99) -> ’ (0x92)
š (0x9a) -> (0x9d)
› (0x9b) -> ‰ (0x89)
œ (0x9c) -> œ (0x9c)
(0x9d) -> € (0x80)
ž (0x9e) -> ž (0x9e)
Ÿ (0x9f) -> ˜ (0x98)

Likely several of the already reported issues fall under this. For example, this issue: #585 the troublesome character is — (0x97). Although I'm not sure how you would fix this in getText().

Luckily the XMP Metadata is always encoded in UTF-8 as per standard, so it can be relied upon.

The pull request looks good now, after remaining coding style issues (importing a class which is not used) are fixed, we can merge it.

Oh, yes, haha. I noticed PHP-CS-Fixer flagged that, but my edits didn't touch Encoding.php so I didn't commit it.

GreyWyvern · 2023-06-27T19:09:03Z

Hey hey! Outside the scope of this PR, but I believe I found it: https://github.com/maxwell-bland/pdf-latin-text-encodings

Apparently PdfParser is missing an encoding PDFDocEncoding, which the github readme file above describes it as:

Encoding for text strings in a PDF document outside the document's content streams.

Adobe has a class for it, so it's a real thing: https://developer.adobe.com/experience-manager/reference-materials/cloud-service/javadoc/com/adobe/internal/pdftoolkit/core/util/PDFDocEncoding.html

The github above has a JSON file with all the characters from the set: https://github.com/maxwell-bland/pdf-latin-text-encodings/blob/main/pdf-encoding.json and it matches what I see. (eg. 0x92 gets translated to 0x90 and 0x90 is a Right Single Quotation Mark in PDFDocEncoding)

Maybe we can use this to create a new Encoding file specifically for Document Properties.

k00ni

I took the liberty to fix remaining coding style and PHPStan issues.

Sure. Where would you like me to do that?

I assumed that you had particular issues in mind. We have a lot of encoding and white space related issues, so every bit of help there is appreciated.

Lets move the discussion about PDFDocEncoding to #609, so we can keep this pull request clean.

Thank you for your work! I will leave this PR open for a few days. If there are no objections/feedback I will merge it.

src/Smalot/PdfParser/Document.php

GreyWyvern · 2023-06-28T19:24:34Z

One caveat that I've just discovered, I don't think it's a blocker: When saving a PDF with properties, Adobe uses the contents of the Subject field to populate the dc:description XMP metadata element, even though there is a dc:subject element in the XMP specifications. This means that $details['Subject'] will retain the original decoded Subject, while the XMP 'Subject' will be in $details['Description'].

I think in the future I could refine this so that XMP values are represented in their own namespace within the getDetails() response. I assumed it was a 1:1 relationship, it is almost but not quite.

GreyWyvern · 2023-06-29T19:34:00Z

@k00ni, if possible, I would like to submit some new code for this PR before you merge it. More and more I feel like overwriting the regular decoded getDetails() values is the wrong idea, and XMP data should be distinguished from it. In some cases, XMP data can even conflict with itself! Such as when there are pdf:creator and dc:creator elements in the same XMP block. You can't just assign them all to $details['Creator'].

I propose adding XMP data to the getDetails() array using the lowercased and namespaced element name. So it would look like this:

Array
(
    [Creator] => Adobe Acrobat
    [Producer] => Adobe Acrobat
    [CreatedOn] => 2022-01-28T16:36:11+00:00
    [Pages] => 35
    [dc:creator] => My Name
    [pdf:producer] => Adobe Acrobat
    [dc:title] => My Document Title
    ...
)

Let me know and I'll push the commit I've created.

k00ni · 2023-06-30T04:36:41Z

Sure, take the time you need to polish the pull request. No rush.

Instead of overwriting various values from the getDetails() metadata array, add all collected XMP values as new array keys matching the lowercased and namespaced XML element names.

Make sure the XMP metadata block is referring to a PDF as some programs will add XMP blocks for included assets not of type application/pdf.

Better explanation of XMP and flow.

Only do this for strings. If it's a single element list but the value is an array, leave it so it makes better sense.

src/Smalot/PdfParser/Document.php

Resolve suggested changes.

k00ni · 2023-07-04T06:23:23Z

I've added your explanation to Document.php and fixed remaining coding style issues. Declined PHP-CS-Fixer to use rule modernize_strpos, because it replaces strpos with str_starts_with, which is only available since PHP 8.0 (ref).

GreyWyvern · 2023-07-04T12:58:32Z

Looks good, thanks! :)

After this, I'll have another PR that decodes the regular metadata with PDFDocEncoding, which is why I removed some tests related to that from DocumentTest.php.

k00ni · 2023-07-04T13:02:11Z

Can I merge it now?

GreyWyvern · 2023-07-04T13:03:48Z

Can I merge it now?

Sure, go ahead!

GreyWyvern added 2 commits June 22, 2023 15:17

Read XMP Metadata

320781d

Try to get XMP Metadata from the PDF document, and if successful, prefer it over any decoded "details" we may find.

Update Document.php

72fdc5b

Don't need $index in this loop.

k00ni added enhancement tests required labels Jun 23, 2023

k00ni requested changes Jun 23, 2023

View reviewed changes

src/Smalot/PdfParser/Parser.php Outdated Show resolved Hide resolved

src/Smalot/PdfParser/Document.php Outdated Show resolved Hide resolved

get => extract

aff2bb2

Change function name from getXMPMetadata to extractXMPMetadata, returns void.

GreyWyvern added 2 commits June 23, 2023 11:35

Add XMP_Metadata.pdf sample file

2af4b8c

Add XMP_Metadata.pdf sample file. Change name of ModifyDate to ModDate to match the array key name already used by PdfParser.

PHP-CS-Fixer edits

9f802d1

GreyWyvern added 2 commits June 26, 2023 10:21

Add testExtractXMPMetadata() unit test

f560110

Update testExtractXMPMetadata()

40e6cf4

On second thought, remove testing for the Registered Trademark symbol, as it is an encodable ISO-8851-1 glyph (00AE) and proper reading of it might be fixed in the rest of PdfParser eventually.

k00ni removed the tests required label Jun 27, 2023

Remove Exception class, not used.

cfb7d1f

k00ni added 3 commits June 28, 2023 09:18

fixes coding style issue in EncodingTest.php

b2bd543

Update Document.php

b80de6c

Update DocumentTest.php

fd137ef

k00ni approved these changes Jun 28, 2023

View reviewed changes

k00ni reviewed Jun 28, 2023

View reviewed changes

src/Smalot/PdfParser/Document.php Outdated Show resolved Hide resolved

k00ni self-assigned this Jun 29, 2023

Add XMP values, don't overwrite

7a6b9d7

Instead of overwriting various values from the getDetails() metadata array, add all collected XMP values as new array keys matching the lowercased and namespaced XML element names.

GreyWyvern added 3 commits June 30, 2023 12:56

Only save application/pdf XMP metadata

2160bd7

Make sure the XMP metadata block is referring to a PDF as some programs will add XMP blocks for included assets not of type application/pdf.

Update Usage.md

5bf3d0a

Better explanation of XMP and flow.

Only apply list simplification for strings.

56f8224

Only do this for strings. If it's a single element list but the value is an array, leave it so it makes better sense.

k00ni reviewed Jul 3, 2023

View reviewed changes

src/Smalot/PdfParser/Document.php Outdated Show resolved Hide resolved

k00ni reviewed Jul 3, 2023

View reviewed changes

src/Smalot/PdfParser/Document.php Outdated Show resolved Hide resolved

k00ni reviewed Jul 3, 2023

View reviewed changes

src/Smalot/PdfParser/Document.php Outdated Show resolved Hide resolved

k00ni reviewed Jul 3, 2023

View reviewed changes

src/Smalot/PdfParser/Document.php Outdated Show resolved Hide resolved

k00ni reviewed Jul 3, 2023

View reviewed changes

src/Smalot/PdfParser/Document.php Outdated Show resolved Hide resolved

Update Document.php

f78fdf6

Resolve suggested changes.

GreyWyvern changed the title ~~Read XMP Metadata in place of decoded header 'details'~~ Read XMP Metadata and add it to data returned by getDetails() Jul 3, 2023

k00ni added 3 commits July 4, 2023 08:06

Document.php: added short description

61abfdc

fixed coding style issues

5f604e7

Update Document.php

8ea4b65

k00ni merged commit 66ddf47 into smalot:master Jul 4, 2023

k00ni mentioned this pull request Jul 13, 2023

Discussions about how to organize further maintenance of this library #286

Open

k00ni mentioned this pull request Jul 21, 2023

undefined function Smalot\PdfParser\xml_parser_create() #616

Closed

GreyWyvern mentioned this pull request Jul 21, 2023

getDetails() returning single element arrays instead of strings #617

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read XMP Metadata and add it to data returned by getDetails() #606

Read XMP Metadata and add it to data returned by getDetails() #606

GreyWyvern commented Jun 22, 2023

k00ni left a comment •

edited

Loading

k00ni commented Jun 23, 2023 •

edited

Loading

GreyWyvern commented Jun 23, 2023

GreyWyvern commented Jun 23, 2023

k00ni commented Jun 25, 2023 •

edited

Loading

GreyWyvern commented Jun 26, 2023

k00ni commented Jun 27, 2023 •

edited

Loading

GreyWyvern commented Jun 27, 2023 •

edited

Loading

GreyWyvern commented Jun 27, 2023 •

edited

Loading

k00ni left a comment

GreyWyvern commented Jun 28, 2023

GreyWyvern commented Jun 29, 2023 •

edited

Loading

k00ni commented Jun 30, 2023

k00ni commented Jul 4, 2023

GreyWyvern commented Jul 4, 2023

k00ni commented Jul 4, 2023

GreyWyvern commented Jul 4, 2023

Read XMP Metadata and add it to data returned by getDetails() #606

Read XMP Metadata and add it to data returned by getDetails() #606

Conversation

GreyWyvern commented Jun 22, 2023

k00ni left a comment • edited Loading

Choose a reason for hiding this comment

k00ni commented Jun 23, 2023 • edited Loading

GreyWyvern commented Jun 23, 2023

GreyWyvern commented Jun 23, 2023

k00ni commented Jun 25, 2023 • edited Loading

GreyWyvern commented Jun 26, 2023

k00ni commented Jun 27, 2023 • edited Loading

GreyWyvern commented Jun 27, 2023 • edited Loading

GreyWyvern commented Jun 27, 2023 • edited Loading

k00ni left a comment

Choose a reason for hiding this comment

GreyWyvern commented Jun 28, 2023

GreyWyvern commented Jun 29, 2023 • edited Loading

k00ni commented Jun 30, 2023

k00ni commented Jul 4, 2023

GreyWyvern commented Jul 4, 2023

k00ni commented Jul 4, 2023

GreyWyvern commented Jul 4, 2023

k00ni left a comment •

edited

Loading

k00ni commented Jun 23, 2023 •

edited

Loading

k00ni commented Jun 25, 2023 •

edited

Loading

k00ni commented Jun 27, 2023 •

edited

Loading

GreyWyvern commented Jun 27, 2023 •

edited

Loading

GreyWyvern commented Jun 27, 2023 •

edited

Loading

GreyWyvern commented Jun 29, 2023 •

edited

Loading