Definition of octal codes in literal strings related to UTF-16BE encoding #494

jmlehton · 2024-11-22T17:40:08Z

We feel that there is a small unclarity regarding to literal strings in ISO 32000-2:2020 (and previous versions). In ch. 7.3.4.2 "Literal strings" Table 3, a single "\ddd" octal code is defined as a "character code". Isn't a "character code" something which maps to a character in a codepage in question? For strings encoded with UTF-16BE, a single octal code can not really be used as a mapping character code (i.e. \ddd does not map to a Unicode character). Of course this can be done with multiple octal codes, but the definition is about a single octal code \ddd. From this, it may be unclear for the reader whether it is possible to use octal coding to UTF-16BE encoded string with multiple-byte characters.

It is true that ch. 7.3.4.2 "Literal strings" also states that any 8-bit value can appear also with the octal "notation described". But this still can be understood so that "notation described" refers to defining (i.e. limiting) an octal code as a mapping character code, which leads to the original unclarity.

In future revisions, we suggest to reconsider or open the term "character code" used in octal codes and to give a short sentence about its usage with Unicode in a case where a single character requires multiple bytes.

petervwyatt · 2024-11-23T00:10:21Z

I cannot find the use of the phrase "character code" anywhere in 7.3.4 subclauses - and that would be incorrect. But I agree that the language is confusing since the word "character" is used for both the bytes comprising the string in the input PDF as well as what they mean once lexed/de-escaped:

7.3.4.1: "A string object shall consist of a series of zero or more bytes."
7.3.4.2: "A literal string shall be written as an arbitrary number of characters enclosed ..."

The correct terminology should be "characters" are what comprise the string in "raw PDF" (pre-lexing), but "bytes" are what they represent post-lexing. So an octal code \ddd in a literal string comprises up to 4 "characters" but presents a single "byte" in that string object. The interpretation of those bytes (such as in a specific encoding) is dependent on type definitions elsewhere in the spec, and according to 7.9.2 and Figure 7.

petervwyatt · 2024-11-23T00:15:19Z

Proposed solution:

The \ddd escape sequence provides a way to represent ~~characters~~ bytes outside the printable ASCII character set.

and

Since any 8-bit value may appear in a string (with proper escaping for REVERSE SOLIDUS (backslash) and unbalanced PARENTHESES) this \ddd notation provides a way to specify ~~characters~~ bytes outside the ASCII character set by using ASCII characters only. However, any 8-bit value may appear in a string, represented either as itself or with the \ddd notation described.

car222222 · 2024-11-23T09:29:04Z

I think that at least one change is also needed in 7.3.4.1:

The term “literal characters” is used (meaning “bytes”) there also
so this probably needs to be changed/clarified here too:

• As a sequence of literal characters enclosed in parentheses () (using LEFT PARENTHESIS (28h) and RIGHT PARENTHESIS (29h)); see 7.3.4.2, "Literal strings"

Also, maybe the final sentence of 7.3.4.1 could be expanded somewhat, to say that “7.9.1+2 explains the use of such “byte strings” to represent characters in string objects, using various character encodings including multi-byte schemes”.

Currently it is:

Subclause 7.9.2, "String object types" describes the encoding schemes used for the contents of string objects.

petervwyatt · 2024-11-24T00:37:46Z

I agree 7.3.4.1, 1st bullet should drop the word "literal" - it should just state "characters" so it is consistent with the terminology throughout 7.3.4.2:

As a sequence of ~~literal~~ characters enclosed in parentheses () (using LEFT PARENTHESIS (28h) and RIGHT PARENTHESIS (29h)); see 7.3.4.2, "Literal strings"

I think other errata we have already applied sufficiently cover encodings and the fact that the lexical form of a string object is orthogonal to any character encoding in string data - see from this point down https://pdf-issues.pdfa.org/32000-2-2020/clause07.html#H7.9.1

jmlehton · 2024-11-24T06:40:15Z

I cannot find the use of the phrase "character code" anywhere in 7.3.4 subclauses - and that would be incorrect.

The "character code" definition is in the last row of Table 3 (in ch. 7.3.4.2).

car222222 · 2024-11-24T08:59:36Z

That definition could be made more precise, and understandable, as follows:

\ddd 8-bit Character code ddd (3 octal digits)

(Assuming I interpreted it correctly 😄 !)

car222222 · 2024-11-24T12:17:24Z

I edited this last comment to make the following correction:
ASCII changed to 8-bit

jmlehton · 2024-11-24T14:36:29Z

I would suggest:

\ddd Byte code ddd (3 octal digits)

I feel that the word "character" is somewhat problematic here. When octal coding is used for a UTF-16BE encoded string, then an octal code \ddd does not map to any character, but a byte of a multiple-byte character, since UTF-16BE has only 2- and 4-byte characters.

petervwyatt · 2024-11-25T00:06:59Z

Thanks. Table 3 proposed fix is quite simple - its a byte:

\ddd Byte with value ddd in octal

I would NOT add "(3 octal digits)" as that is incorrect - it can be 1-3 bytes as per normative text further down.

car222222 · 2024-11-25T04:05:06Z

OK with me.

jmlehton · 2024-11-25T06:58:34Z

Thanks. Table 3 proposed fix is quite simple - its a byte:

\ddd Byte with value ddd in octal

I would NOT add "(3 octal digits)" as that is incorrect - it can be 1-3 bytes as per normative text further down.

This is great. Thanks. And yes, "(3 octal digits)" would be incorrect.

jmlehton added the bug Something isn't correct label Nov 22, 2024

jmlehton mentioned this issue Nov 22, 2024

PDF-hul: various issues with parsing PDFs openpreserve/jhove#927

Open

petervwyatt added documentation Improvements or additions to documentation and removed bug Something isn't correct labels Nov 22, 2024

petervwyatt self-assigned this Nov 23, 2024

petervwyatt added the proposed solution Proposed solution is ready for review label Nov 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Definition of octal codes in literal strings related to UTF-16BE encoding #494

Definition of octal codes in literal strings related to UTF-16BE encoding #494

jmlehton commented Nov 22, 2024 •

edited

Loading

petervwyatt commented Nov 23, 2024

petervwyatt commented Nov 23, 2024

car222222 commented Nov 23, 2024

petervwyatt commented Nov 24, 2024

jmlehton commented Nov 24, 2024

car222222 commented Nov 24, 2024 •

edited

Loading

car222222 commented Nov 24, 2024

jmlehton commented Nov 24, 2024

petervwyatt commented Nov 25, 2024

car222222 commented Nov 25, 2024

jmlehton commented Nov 25, 2024

Definition of octal codes in literal strings related to UTF-16BE encoding #494

Definition of octal codes in literal strings related to UTF-16BE encoding #494

Comments

jmlehton commented Nov 22, 2024 • edited Loading

petervwyatt commented Nov 23, 2024

petervwyatt commented Nov 23, 2024

car222222 commented Nov 23, 2024

petervwyatt commented Nov 24, 2024

jmlehton commented Nov 24, 2024

car222222 commented Nov 24, 2024 • edited Loading

car222222 commented Nov 24, 2024

jmlehton commented Nov 24, 2024

petervwyatt commented Nov 25, 2024

car222222 commented Nov 25, 2024

jmlehton commented Nov 25, 2024

jmlehton commented Nov 22, 2024 •

edited

Loading

car222222 commented Nov 24, 2024 •

edited

Loading