Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Definition of octal codes in literal strings related to UTF-16BE encoding #494

Open
jmlehton opened this issue Nov 22, 2024 · 11 comments
Open
Assignees
Labels
documentation Improvements or additions to documentation proposed solution Proposed solution is ready for review

Comments

@jmlehton
Copy link

jmlehton commented Nov 22, 2024

We feel that there is a small unclarity regarding to literal strings in ISO 32000-2:2020 (and previous versions). In ch. 7.3.4.2 "Literal strings" Table 3, a single "\ddd" octal code is defined as a "character code". Isn't a "character code" something which maps to a character in a codepage in question? For strings encoded with UTF-16BE, a single octal code can not really be used as a mapping character code (i.e. \ddd does not map to a Unicode character). Of course this can be done with multiple octal codes, but the definition is about a single octal code \ddd. From this, it may be unclear for the reader whether it is possible to use octal coding to UTF-16BE encoded string with multiple-byte characters.

It is true that ch. 7.3.4.2 "Literal strings" also states that any 8-bit value can appear also with the octal "notation described". But this still can be understood so that "notation described" refers to defining (i.e. limiting) an octal code as a mapping character code, which leads to the original unclarity.

In future revisions, we suggest to reconsider or open the term "character code" used in octal codes and to give a short sentence about its usage with Unicode in a case where a single character requires multiple bytes.

@jmlehton jmlehton added the bug Something isn't correct label Nov 22, 2024
@petervwyatt petervwyatt added documentation Improvements or additions to documentation and removed bug Something isn't correct labels Nov 22, 2024
@petervwyatt
Copy link
Member

I cannot find the use of the phrase "character code" anywhere in 7.3.4 subclauses - and that would be incorrect. But I agree that the language is confusing since the word "character" is used for both the bytes comprising the string in the input PDF as well as what they mean once lexed/de-escaped:

7.3.4.1: "A string object shall consist of a series of zero or more bytes."
7.3.4.2: "A literal string shall be written as an arbitrary number of characters enclosed ..."

The correct terminology should be "characters" are what comprise the string in "raw PDF" (pre-lexing), but "bytes" are what they represent post-lexing. So an octal code \ddd in a literal string comprises up to 4 "characters" but presents a single "byte" in that string object. The interpretation of those bytes (such as in a specific encoding) is dependent on type definitions elsewhere in the spec, and according to 7.9.2 and Figure 7.

@petervwyatt
Copy link
Member

Proposed solution:

The \ddd escape sequence provides a way to represent characters bytes outside the printable ASCII character set.

and

Since any 8-bit value may appear in a string (with proper escaping for REVERSE SOLIDUS (backslash) and unbalanced PARENTHESES) this \ddd notation provides a way to specify characters bytes outside the ASCII character set by using ASCII characters only. However, any 8-bit value may appear in a string, represented either as itself or with the \ddd notation described.

@petervwyatt petervwyatt self-assigned this Nov 23, 2024
@petervwyatt petervwyatt added the proposed solution Proposed solution is ready for review label Nov 23, 2024
@car222222
Copy link

I think that at least one change is also needed in 7.3.4.1:

The term “literal characters” is used (meaning “bytes”) there also
so this probably needs to be changed/clarified here too:

As a sequence of literal characters enclosed in parentheses () (using LEFT PARENTHESIS (28h) and RIGHT PARENTHESIS (29h)); see 7.3.4.2, "Literal strings"

Also, maybe the final sentence of 7.3.4.1 could be expanded somewhat, to say that “7.9.1+2 explains the use of such “byte strings” to represent characters in string objects, using various character encodings including multi-byte schemes”.

Currently it is:

Subclause 7.9.2, "String object types" describes the encoding schemes used for the contents of string objects.

@petervwyatt
Copy link
Member

I agree 7.3.4.1, 1st bullet should drop the word "literal" - it should just state "characters" so it is consistent with the terminology throughout 7.3.4.2:

  • As a sequence of literal characters enclosed in parentheses () (using LEFT PARENTHESIS (28h) and RIGHT PARENTHESIS (29h)); see 7.3.4.2, "Literal strings"

I think other errata we have already applied sufficiently cover encodings and the fact that the lexical form of a string object is orthogonal to any character encoding in string data - see from this point down https://pdf-issues.pdfa.org/32000-2-2020/clause07.html#H7.9.1

@jmlehton
Copy link
Author

I cannot find the use of the phrase "character code" anywhere in 7.3.4 subclauses - and that would be incorrect.

The "character code" definition is in the last row of Table 3 (in ch. 7.3.4.2).

@car222222
Copy link

car222222 commented Nov 24, 2024

That definition could be made more precise, and understandable, as follows:

\ddd 8-bit Character code ddd (3 octal digits)

(Assuming I interpreted it correctly 😄 !)

@car222222
Copy link

I edited this last comment to make the following correction:
ASCII changed to 8-bit

@jmlehton
Copy link
Author

I would suggest:

\ddd Byte code ddd (3 octal digits)

I feel that the word "character" is somewhat problematic here. When octal coding is used for a UTF-16BE encoded string, then an octal code \ddd does not map to any character, but a byte of a multiple-byte character, since UTF-16BE has only 2- and 4-byte characters.

@petervwyatt
Copy link
Member

Thanks. Table 3 proposed fix is quite simple - its a byte:

\ddd Byte with value ddd in octal

I would NOT add "(3 octal digits)" as that is incorrect - it can be 1-3 bytes as per normative text further down.

@car222222
Copy link

OK with me.

@jmlehton
Copy link
Author

Thanks. Table 3 proposed fix is quite simple - its a byte:

\ddd Byte with value ddd in octal

I would NOT add "(3 octal digits)" as that is incorrect - it can be 1-3 bytes as per normative text further down.

This is great. Thanks. And yes, "(3 octal digits)" would be incorrect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation proposed solution Proposed solution is ready for review
Projects
None yet
Development

No branches or pull requests

3 participants