Weird characters and missing text in multiline metadata text #366

gnojus · 2020-11-03T18:51:01Z

Hello.
I have an issue with the way this library extracts metadata from a PDF file. It comes out corrupted, and I'm not sure why.
It looks to me that for some reason \ are prefixed before newlines, but I may be wrong.
Code used:

$parser = new \Smalot\PdfParser\Parser();
$pdf    = $parser->parseFile('./file.pdf');
$details  = $pdf->getDetails();
echo $details['Subject'];

I attach the pdf's tested with metadata descriptions used, inserted with Adobe Acrobat.
pdf_1.pdf
pdf_2.pdf
pdf_3.pdf
subject_in_1.txt
subject_in_2.txt
subject_in_3.txt
subject_out_1.txt
subject_out_2.txt
subject_out_3.txt

Also I actually came here because of another library https://github.com/pauln/tcpdi. However, that one didn't seem very active and both of these libraries seemed to have problems with metadata. So there comes my other question - what is the relationship between tcpdi_parser andtcpdf_parser could tcpdi_parser be replaced by tcpdf_parser, this way having more update version?

The text was updated successfully, but these errors were encountered:

k00ni · 2020-11-09T09:45:21Z

Also I actually came here because of another library https://github.com/pauln/tcpdi. However, that one didn't seem very active and both of these libraries seemed to have problems with metadata. So there comes my other question - what is the relationship between tcpdi_parser andtcpdf_parser could tcpdi_parser be replaced by tcpdf_parser, this way having more update version?

PDFParser used some code from https://github.com/tecnickcom/TCPDF, but I ported it and removed TCPDF as Composer dependency. The files I ported can be found here: https://github.com/smalot/pdfparser/tree/master/src/Smalot/PdfParser/RawData

Last time I checked tecnickcom's TCPDF project, it had the following message in the README.md:

A new version of this library is under development at https://github.com/tecnickcom/tc-lib-pdf 
and as a consequence this version will not receive any additional development or support. 
This version should be considered obsolete, new projects should use the new version as soon 
it will become stable.

I don't know if tcpdf_parser is more up to date, but the project it belongs to is not actively maintained and I don't think its a good idea to add it as a dependency again. But we could discuss potential code contributions as a PR for instance.

gnojus · 2020-11-11T16:06:59Z

All right, thanks for information.
Do you have any idea about the corrupted metadata (description).

gnojus · 2020-11-19T13:13:43Z

All right, I think I found the bug myself - special characters in PDF strings may be escaped (e. g. (, ), \r), thus when querying metadata they should be un-escaped.

k00ni · 2020-11-19T15:04:10Z

All right, I think I found the bug myself - special characters in PDF strings may be escaped (e. g. (, ), \r), thus when querying metadata they should be un-escaped.

So one has to take care of himself or do you suggest a change in the PDFParser?

gnojus · 2020-11-20T12:40:04Z

I strongly believe that this should be handled by the library - end user shouldn't be required to read PDF specification and implement his own escaping function just to extract metadata as simple text.

k00ni · 2023-07-06T08:00:47Z

@gnojus Can you please test #611 if it fixes your problems? That would be helpful.

gnojus · 2023-07-06T21:09:26Z

Yes, from a quick test, it seems to work correctly with my first test pdf.

k00ni added missing or incomplete functionality For something which is not a bug, but more like an incomplete feature. needs more info question labels Nov 9, 2020

k00ni removed the needs more info label Nov 19, 2020

GreyWyvern mentioned this issue Jul 4, 2023

Enable PDFDocEncoding support for metadata #611

Merged

k00ni linked a pull request Jul 6, 2023 that will close this issue

Enable PDFDocEncoding support for metadata #611

Merged

k00ni closed this as completed in #611 Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird characters and missing text in multiline metadata text #366

Weird characters and missing text in multiline metadata text #366

gnojus commented Nov 3, 2020

k00ni commented Nov 9, 2020

gnojus commented Nov 11, 2020

gnojus commented Nov 19, 2020

k00ni commented Nov 19, 2020

gnojus commented Nov 20, 2020

k00ni commented Jul 6, 2023

gnojus commented Jul 6, 2023

Weird characters and missing text in multiline metadata text #366

Weird characters and missing text in multiline metadata text #366

Comments

gnojus commented Nov 3, 2020

k00ni commented Nov 9, 2020

gnojus commented Nov 11, 2020

gnojus commented Nov 19, 2020

k00ni commented Nov 19, 2020

gnojus commented Nov 20, 2020

k00ni commented Jul 6, 2023

gnojus commented Jul 6, 2023