Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird characters and missing text in multiline metadata text #366

Closed
gnojus opened this issue Nov 3, 2020 · 7 comments · Fixed by #611
Closed

Weird characters and missing text in multiline metadata text #366

gnojus opened this issue Nov 3, 2020 · 7 comments · Fixed by #611
Labels
missing or incomplete functionality For something which is not a bug, but more like an incomplete feature. question

Comments

@gnojus
Copy link

gnojus commented Nov 3, 2020

Hello.
I have an issue with the way this library extracts metadata from a PDF file. It comes out corrupted, and I'm not sure why.
It looks to me that for some reason \ are prefixed before newlines, but I may be wrong.
Code used:

$parser = new \Smalot\PdfParser\Parser();
$pdf    = $parser->parseFile('./file.pdf');
$details  = $pdf->getDetails();
echo $details['Subject'];

I attach the pdf's tested with metadata descriptions used, inserted with Adobe Acrobat.
pdf_1.pdf
pdf_2.pdf
pdf_3.pdf
subject_in_1.txt
subject_in_2.txt
subject_in_3.txt
subject_out_1.txt
subject_out_2.txt
subject_out_3.txt

Also I actually came here because of another library https://github.com/pauln/tcpdi. However, that one didn't seem very active and both of these libraries seemed to have problems with metadata. So there comes my other question - what is the relationship between tcpdi_parser andtcpdf_parser could tcpdi_parser be replaced by tcpdf_parser, this way having more update version?

@k00ni k00ni added missing or incomplete functionality For something which is not a bug, but more like an incomplete feature. needs more info question labels Nov 9, 2020
@k00ni
Copy link
Collaborator

k00ni commented Nov 9, 2020

Also I actually came here because of another library https://github.com/pauln/tcpdi. However, that one didn't seem very active and both of these libraries seemed to have problems with metadata. So there comes my other question - what is the relationship between tcpdi_parser andtcpdf_parser could tcpdi_parser be replaced by tcpdf_parser, this way having more update version?

PDFParser used some code from https://github.com/tecnickcom/TCPDF, but I ported it and removed TCPDF as Composer dependency. The files I ported can be found here: https://github.com/smalot/pdfparser/tree/master/src/Smalot/PdfParser/RawData

Last time I checked tecnickcom's TCPDF project, it had the following message in the README.md:

A new version of this library is under development at https://github.com/tecnickcom/tc-lib-pdf 
and as a consequence this version will not receive any additional development or support. 
This version should be considered obsolete, new projects should use the new version as soon 
it will become stable.

I don't know if tcpdf_parser is more up to date, but the project it belongs to is not actively maintained and I don't think its a good idea to add it as a dependency again. But we could discuss potential code contributions as a PR for instance.

@gnojus
Copy link
Author

gnojus commented Nov 11, 2020

All right, thanks for information.
Do you have any idea about the corrupted metadata (description).

@gnojus
Copy link
Author

gnojus commented Nov 19, 2020

All right, I think I found the bug myself - special characters in PDF strings may be escaped (e. g. (, ), \r), thus when querying metadata they should be un-escaped.

@k00ni
Copy link
Collaborator

k00ni commented Nov 19, 2020

All right, I think I found the bug myself - special characters in PDF strings may be escaped (e. g. (, ), \r), thus when querying metadata they should be un-escaped.

So one has to take care of himself or do you suggest a change in the PDFParser?

@gnojus
Copy link
Author

gnojus commented Nov 20, 2020

I strongly believe that this should be handled by the library - end user shouldn't be required to read PDF specification and implement his own escaping function just to extract metadata as simple text.

@k00ni
Copy link
Collaborator

k00ni commented Jul 6, 2023

@gnojus Can you please test #611 if it fixes your problems? That would be helpful.

@k00ni k00ni linked a pull request Jul 6, 2023 that will close this issue
@gnojus
Copy link
Author

gnojus commented Jul 6, 2023

Yes, from a quick test, it seems to work correctly with my first test pdf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
missing or incomplete functionality For something which is not a bug, but more like an incomplete feature. question
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants