From beecec90f33f266372178254ac3b4e75adb4de26 Mon Sep 17 00:00:00 2001 From: Anne van Kesteren Date: Wed, 21 Oct 2020 12:18:12 +0200 Subject: [PATCH 1/2] Add get an encoder and encode or fail for URLs Fixes #235. --- encoding.bs | 67 ++++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 53 insertions(+), 14 deletions(-) diff --git a/encoding.bs b/encoding.bs index 969476a..ac5b87c 100644 --- a/encoding.bs +++ b/encoding.bs @@ -1045,12 +1045,17 @@ optional I/O queue of bytes output (default « »), return the result

Legacy hooks for standards

-

Standards are strongly discouraged from using decode, encode, and -BOM sniff, except as needed for compatibility. Standards needing these legacy hooks will most -likely also need to use get an encoding (to turn a label into an -encoding) and get an output encoding (to turn an encoding into -another encoding that is suitable to pass into encode). Other algorithms are not -to be used directly. +

+

Standards are strongly discouraged from using decode, BOM sniff, and + encode, except as needed for compatibility. Standards needing these legacy hooks will + most likely also need to use get an encoding (to turn a label into an + encoding) and get an output encoding (to turn an encoding into + another encoding that is suitable to pass into encode). + +

For an extremely niche case custom encoder error handling is needed. The get an encoder + and encode or fail algorithms are to be used for that. Other algorithms are not to be used + directly. +

To decode an I/O queue of bytes ioQueue given a fallback encoding encoding and an optional I/O queue of scalar values output (default « »), run @@ -1111,19 +1116,52 @@ corresponding to the byte order mark found, or null otherwise. steps:

    -
  1. Assert: encoding is not replacement or UTF-16BE/LE. +

  2. Let encoder be the result of getting an encoder from encoding. -

  3. Run encoding's encoder with ioQueue, - output, and "html". +

  4. Run encoder with ioQueue, output, and + "html".

  5. Return output.

-

This is mostly a legacy hook for URLs and HTML forms. Layering -UTF-8 encode on top is safe as it never triggers -errors. -[[URL]] -[[HTML]] +

This is a legacy hook for HTML forms. Layering UTF-8 encode on top +is safe as it never triggers errors. [[HTML]] + +


+ +

To get an encoder from an +encoding encoding: + +

    +
  1. Assert: encoding is not replacement or UTF-16BE/LE. + +

  2. Return encoding's encoder. +

+ +

To encode or fail an I/O queue of scalar values ioQueue given an +encoder encoder and an I/O queue of bytes output, run these +steps: + +

    +
  1. Let potentialError be the result of running encoder with + ioQueue, output, and "fatal". + +

  2. If potentialError is an error, then return error's + code point's value. + +

  3. Return null. +

+ +
+

This is a legacy hook for URLs. The caller will have to keep an encoder alive as + the ISO-2022-JP encoder can be in two different states when returning an error. That + also means that if the caller emits bytes to encode the error in some way, these have to be in the + range 0x00 to 0x7F, inclusive, excluding 0x0E, 0x0F, 0x1B, 0x5C, and 0x7E. [[URL]] + +

The return value is either the number representing the code point that could not be + encoded or null, if there was no error. When it returns non-null the caller will have to + invoke it again, supplying the same encoder and a new output I/O queue. +

@@ -3399,6 +3437,7 @@ Glenn Maynard, Gordon P. Hemsley, Henri Sivonen, Ian Hickson, +J. King, James Graham, Jeffrey Yasskin, John Tamplin, From 4632a055a47f9a77eed7e1467532ab3eb972f73d Mon Sep 17 00:00:00 2001 From: Anne van Kesteren Date: Fri, 23 Oct 2020 13:52:51 +0200 Subject: [PATCH 2/2] address some of the review feedback --- encoding.bs | 27 +++++++++++++++++++-------- 1 file changed, 19 insertions(+), 8 deletions(-) diff --git a/encoding.bs b/encoding.bs index ac5b87c..7afd377 100644 --- a/encoding.bs +++ b/encoding.bs @@ -1052,9 +1052,9 @@ optional I/O queue of bytes output (default « »), return the result encoding) and get an output encoding (to turn an encoding into another encoding that is suitable to pass into encode). -

For an extremely niche case custom encoder error handling is needed. The get an encoder - and encode or fail algorithms are to be used for that. Other algorithms are not to be used - directly. +

For the extremely niche case of URL percent-encoding, custom encoder error handling is needed. + The get an encoder and encode or fail algorithms are to be used for that. Other + algorithms are not to be used directly.

To decode an I/O queue of bytes ioQueue given a fallback encoding @@ -1146,17 +1146,28 @@ steps:

  • Let potentialError be the result of running encoder with ioQueue, output, and "fatal". +

  • Push end-of-queue to output. +

  • If potentialError is an error, then return error's code point's value.

  • Return null. -

    -

    This is a legacy hook for URLs. The caller will have to keep an encoder alive as - the ISO-2022-JP encoder can be in two different states when returning an error. That - also means that if the caller emits bytes to encode the error in some way, these have to be in the - range 0x00 to 0x7F, inclusive, excluding 0x0E, 0x0F, 0x1B, 0x5C, and 0x7E. [[URL]] +

    +

    This is a legacy hook for URL percent-encoding. The caller will have to keep an + encoder alive as the ISO-2022-JP encoder can be in two different states when + returning an error. That also means that if the caller emits bytes to encode the error in + some way, these have to be in the range 0x00 to 0x7F, inclusive, excluding 0x0E, 0x0F, 0x1B, 0x5C, + and 0x7E. [[URL]] + +

    In particular, if upon returning an error the ISO-2022-JP encoder is in the + Roman state, the caller cannot output 0x5C (\) as it will not + decode as U+005C (\). For this reason, applications using encode or fail for unintended + purposes ought to take care to prevent the use of the ISO-2022-JP encoder in combination + with replacement schemes, such as those of JavaScript and CSS, that use U+005C (\) as part of the + replacement syntax (e.g., \u2603) or make sure to pass the replacement syntax through + the encoder (in contrast to URL percent-encoding).

    The return value is either the number representing the code point that could not be encoded or null, if there was no error. When it returns non-null the caller will have to