Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode PDFString containing non-Latin characters #162

Closed
wants to merge 1 commit into from

Conversation

PlushBeaver
Copy link

ASCII representation for PDFString cannot handle code points above 127 (7-bit ASCII). Such strings have to be encoded as Unicode (UTF16+BOM). This implementation opts to represent them as hex strings for ease of encoding.

@Hopding
Copy link
Owner

Hopding commented Sep 3, 2019

Hello @PlushBeaver! I just had a chance to review this. My apologies for the late response.

Thank you very much for the time and effort you spent on this! I have a couple of questions/concerns:

  1. The PDFString class is meant to represent PDF literal strings, specifically. The PDFHexString class exists to represent hex strings. So I would prefer not to have a PDFString class encode itself as a hex string.
    However, the PDF spec does allow for literal strings to contain characters outside the ASCII set via octal character codes. E.g. (Foo\053Bar) represents Foo+Bar. What do you think about implementing things this way?

  2. PDF files should not contain non-ascii characters within string literals (this would violate the spec). And checking to see if they do whenever a PDFString is created will negatively impact parsing performance (which is already a very intensive operation). So if we adopt these changes, we'll need a way to ensure they do not run when creating strings during document parsing.

  3. Aside from checking for non-ascii characters when parsing, what use cases are you thinking of that these changes would address?

@PlushBeaver
Copy link
Author

Thanks for the concerns raised, @Hopding.

  1. I find the need for library users to choose between PDFString and PDFHexString impractical, because it exposes PDF internals and limitations (while providing flexibility if needed). Would you recommend using PDFHexString when there's any chance that the string contains characters outside ASCII? If so, this PR should not exist at all.

    Otherwise, I must indeed take care of octal representation when decoding strings and consider using it when encoding. Hex seems preferable for non-English text (Cyrillic, CJK, etc), while octal may be better for seldom special characters in mostly Latin text (diacritics, punctuation, etc).

  2. Parsing got completely off my mind (see below). Are parsing tests used as benchmarks? Note that decoding needs not be done while parsing, it can be a lazy operation on a string stored "as is" (encoded).

  3. Primary use case is inserting Unicode metadata and outline entries. I stumbled into this with pagedjs-cli. That tool uses an outdated version of pdf-lib and it uses PDFStrings. However, you raised an interesting question of loading documents with such data.

@Hopding
Copy link
Owner

Hopding commented Sep 27, 2019

Hello @PlushBeaver! My apologies for the delayed response - I've been quite busy. But I haven't forgotten about this. I've been thinking about the best way to handle all of this. I finally made some decisions, and have been working on implementing things in #204 over the past few days. The changes I made in #204 are all based on your work here, but with a few differences. Please take a look at #204 and let me know if I missed anything that you solved here. If it looks good to you, then I'll merge it (closing both #162 and #204) and the changes will go out in the next pdf-lib release!

@PlushBeaver
Copy link
Author

Hello, @Hopding. Specialized facilities from #204 are both handy and flexible enough not only to solve metatata encoding issues, but also to implement Unicode outline using new PDFHexString.fromText(string). This PR can indeed be closed in favor of #204. Thank you for the excellent library you provide!

@Hopding Hopding closed this in #204 Oct 1, 2019
@Hopding
Copy link
Owner

Hopding commented Oct 9, 2019

Version 1.2.0 is now published. It contains the changes from this #204. The full release notes are available here.

You can install this new version with npm:

npm install pdf-lib@1.2.0

It's also available on unpkg:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants