Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read XMP Metadata and add it to data returned by getDetails() #606

Merged
merged 19 commits into from
Jul 4, 2023
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions doc/Usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,9 +140,49 @@ Array
[Producer] => Adobe Acrobat
[CreatedOn] => 2022-01-28T16:36:11+00:00
[Pages] => 35
...
)
```

If the PDF contains Extensible Metadata Platform (XMP) XML metadata, their values, including the XMP namespace, will be appended to the data returned by `getDetails()`. You can read more about what values and namespaces are commonly used in the [XMP Specifications](https://github.com/adobe/XMP-Toolkit-SDK/tree/main/docs).

```php
Array
(
...
[Pages] => 35
[dc:creator] => My Name
[pdf:producer] => Adobe Acrobat
[dc:title] => My Document Title
...
)
```

Some XMP metadata values may have multiple values, or even named children with their own values. In these cases, the value will be an array. The XMP metadata will follow the structure of the XML so it is possible to have multiple levels of nested values.

```php
Array
(
...
[dc:title] => My Document Title
[xmptpg:maxpagesize] => Array
(
[stdim:w] => 21.500000
[stdim:h] => 6.222222
[stdim:unit] => Inches
)
[xmptpg:platenames] => Array
(
[0] => Cyan
[1] => Magenta
[2] => Yellow
[3] => Black
)
...
)
```


## Read Base64 encoded PDFs

If working with [Base64](https://en.wikipedia.org/wiki/Base64) encoded PDFs, you might want to parse the PDF without saving the file to disk.
Expand Down
Binary file added samples/XMP_Metadata.pdf
Binary file not shown.
93 changes: 93 additions & 0 deletions src/Smalot/PdfParser/Document.php
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,11 @@ class Document
*/
protected $trailer;

/**
* @var array<mixed>
*/
protected $metadata = [];

/**
* @var array
*/
Expand Down Expand Up @@ -144,9 +149,97 @@ protected function buildDetails()
$details['Pages'] = 0;
}

$details = array_merge($details, $this->metadata);

$this->details = $details;
}

/**
* Extract XMP Metadata
*/
public function extractXMPMetadata(string $content): void
{
$xml = xml_parser_create();
xml_parser_set_option($xml, \XML_OPTION_SKIP_WHITE, 1);

if (xml_parse_into_struct($xml, $content, $values, $index)) {
GreyWyvern marked this conversation as resolved.
Show resolved Hide resolved

$metadata = [];
$stack = [];
foreach ($values as $val) {

// Standardize to lowercase
$val['tag'] = strtolower($val['tag']);

// Ignore structural x: and rdf: XML elements
if (strpos($val['tag'], 'x:') === 0) continue;
if (strpos($val['tag'], 'rdf:') === 0 && 'rdf:li' != $val['tag']) continue;
GreyWyvern marked this conversation as resolved.
Show resolved Hide resolved

switch ($val['type']) {
case 'open':
// Create an array of list items
if ('rdf:li' == $val['tag']) {
$metadata[] = [];

// Move up one level in the stack
$stack[count($stack)] = &$metadata;
$metadata = &$metadata[count($metadata) - 1];
k00ni marked this conversation as resolved.
Show resolved Hide resolved

// Else create an array of named values
} else {
$metadata[$val['tag']] = [];

// Move up one level in the stack
$stack[count($stack)] = &$metadata;
$metadata = &$metadata[$val['tag']];
}
break;

case 'complete':
if (isset($val['value'])) {

// Assign a value to this list item
if ('rdf:li' == $val['tag']) {
$metadata[] = $val['value'];

// Else assign a value to this property
} else {
$metadata[$val['tag']] = $val['value'];
}
}
break;

case 'close':
// If the value of this property is a single-
// element array where the element is of type
// string, use the value of the first list item
// as the value for this property
if (is_array($metadata) && isset($metadata[0]) && count($metadata) == 1 && is_string($metadata[0])) {
GreyWyvern marked this conversation as resolved.
Show resolved Hide resolved
$metadata = $metadata[0];
}

// Move down one level in the stack
$metadata = &$stack[count($stack) - 1];
unset($stack[count($stack) - 1]);
break;

}
}

// Only use this metadata if it's referring to a PDF
if (isset($metadata['dc:format']) && 'application/pdf' == $metadata['dc:format']) {

// According to the XMP specifications: 'Conflict resolution
// for separate packets that describe the same resource is
// beyond the scope of this document.' - Section 6.1
// So if there are multiple XMP blocks, just merge the values
// of each found block over top of the existing values
$this->metadata = array_merge($this->metadata, $metadata);
}
}
xml_parser_free($xml);
}

public function getDictionary(): array
{
return $this->dictionary;
Expand Down
1 change: 0 additions & 1 deletion src/Smalot/PdfParser/Encoding.php
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,6 @@

namespace Smalot\PdfParser;

use Exception;
use Smalot\PdfParser\Element\ElementNumeric;
use Smalot\PdfParser\Encoding\EncodingLocator;
use Smalot\PdfParser\Encoding\PostScriptGlyphs;
Expand Down
3 changes: 3 additions & 0 deletions src/Smalot/PdfParser/Parser.php
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,9 @@ protected function parseObject(string $id, array $structure, ?Document $document
// It is not necessary to store this content.

return;
} elseif ($header->get('Type')->equals('Metadata')) {
// Attempt to parse XMP XML Metadata
$document->extractXMPMetadata($content);
}
break;

Expand Down
16 changes: 16 additions & 0 deletions tests/PHPUnit/Integration/DocumentTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -255,4 +255,20 @@ public function testGetTextWithPageLimit(): void
// given text is on page 2, it has to be ignored because of that
self::assertStringNotContainsString('Medeni Usul ve İcra İflas Hukuku', $document->getText(1));
}

/**
* Tests extraction of XMP Metadata vs. getHeader() data.
*
* @see https://github.com/smalot/pdfparser/pull/606
*/
public function testExtractXMPMetadata(): void
{
$document = (new Parser())->parseFile($this->rootDir.'/samples/XMP_Metadata.pdf');

$details = $document->getDetails();

// Test that the dc:title data was extracted from the XMP
// Metadata.
self::assertStringContainsString("Enhance PdfParser\u{2019}s Metadata Capabilities", $details['dc:title']);
}
}
1 change: 0 additions & 1 deletion tests/PHPUnit/Integration/EncodingTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,6 @@

namespace PHPUnitTests\Integration;

use Exception;
use PHPUnitTests\TestCase;
use Smalot\PdfParser\Document;
use Smalot\PdfParser\Element;
Expand Down