-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove streaming API, fill in spec text, etc #27
Conversation
1. NOTE: When _strict_ is *false*, the algorithm below is equivalent to the <a href="https://infra.spec.whatwg.org/#forgiving-base64-decode">forgiving-base64 decode</a> operation in HTML. | ||
1. If _strict_ is *false*, then | ||
1. Remove all occurrences of U+0009 (TAB), U+000A (LF), U+000C (FF), U+000D (CR), and U+0020 (SPACE) from _input_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Observation: The WHATWG algorithm (and thus atob
) does not consider U+000B VERTICAL TABULATION to be whitespace, but ECMA-262 in general does, as does e.g. base64 --decode --ignore-garbage
(which isn't saying much though, because --ignore-garbage
discards any non-alphabet character and only U+000A LINE FEED is ignored without it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
262 has a much bigger set of whitespace. I'd prefer to match WHATWG unless there's a specific reason to deviate.
spec.html
Outdated
1. Append U+0041 (LATIN CAPITAL LETTER A) to _input_ twice. | ||
1. Else if _lastChunkSize_ is 3, then | ||
1. Append U+0041 (LATIN CAPITAL LETTER A) to _input_. | ||
1. Let _bytes_ be the unique (possibly empty) sequence of bytes such that applying the base64 encoding specified in section 4 of RFC 4648 to that sequence would produce _input_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revisiting this step, solving a puzzle seems a bit too opaque for what is fundamentally a simple algorithm. Maybe it would be better to specify things more directly?
1. Let _inputLength_ be the length of _input_.
1. If _inputLength_ modulo 4 is 0, then
1. NOTE: If _input_ is valid, it might end with one or two padding characters that must be removed before comparing its contents against the standard base64 alphabet.
1. If _input_ is not empty and the last element of _input_ is U+003D (EQUALS SIGN), remove the last element of _input_.
1. If _input_ is not empty and the last element of _input_ is U+003D (EQUALS SIGN), remove the last element of _input_.
1. Set _inputLength_ to the length of _input.
1. Else,
1. If _strict_ is *true*, throw a *SyntaxError* exception.
1. If _input_ contains any elements which are not also elements of the standard base64 alphabet, throw a *SyntaxError* exception.
1. Let _lastChunkSize_ be _inputLength_ modulo 4.
1. If _lastChunkSize_ is 1, then
1. Throw a *SyntaxError* exception.
1. Else if _lastChunkSize_ is 2, then
1. Append U+0041 (LATIN CAPITAL LETTER A) to _input_ twice.
1. Else if _lastChunkSize_ is 3, then
1. Append U+0041 (LATIN CAPITAL LETTER A) to _input_.
1. Let _paddedInputLength_ be the length of _input_.
1. Let _bytes_ be an empty List of octets.
1. Let _buffer_ be 0.
1. Let _i_ be 0.
1. Repeat, while _i_ < _paddedInputLength_,
1. Set _buffer_ to (_buffer_ × 64) + StringIndexOf(CodePointsToString(standard base64 alphabet), _input_[_i_], 0).
1. If _i_ modulo 4 is 3, then
1. Append floor(_buffer_ / 256<sup>2</sup>) to _bytes_.
1. Append floor((_buffer_ modulo 256<sup>2</sup>) / 256) to _bytes_.
1. Append (_buffer_ modulo 256) to _bytes_.
1. Set _buffer_ to 0.
1. Set _i_ to _i_ + 1.
1. Let _byteLength_ be the length of _bytes_.
1. If _lastChunkSize_ is 2, then
1. If _strict_ is *true* and _bytes_[_byteLength_ - 2] is not 0, throw a *SyntaxError* exception.
1. Set _byteLength_ to _byteLength_ - 2.
1. Else if _lastChunkSize_ is 3, then
1. If _strict_ is *true* and _bytes_[_byteLength_ - 1] is not 0, throw a *SyntaxError* exception.
1. Set _byteLength_ to _byteLength_ - 1.
1. Let _result_ be ? AllocateTypedArray(*"Uint8Array"*, %Uint8Array%, %Uint8Array.prototype%, _byteLength_).
1. Set the value at each index of _result_.[[ViewedArrayBuffer]].[[ArrayBufferData]] to the value at the corresponding index of _bytes_.
1. Return _result_.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems much more opaque to me. What we mean is just do bog-standard base64 decoding. We shouldn't make implementations figure that that we've written down the standard base64 decoding algorithm, which they'll then ignore and use their own existing base64 decoding logic; we should just tell them to do base64 decoding.
The only reason it's phrased somewhat awkwardly is because the RFC doesn't actually specify a decoding algorithm, only encoding. But despite that awkwardness the current text seems much better to me than an explicit algorithm. The world does not need the base64 decoding algorithm written out yet again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The world does not need the base64 decoding algorithm written out yet again.
Perhaps not, but "the <value> such that applying <encoding algorithm> would produce input" is hardly a good reference to any prior example. And from another perspective, ECMA-262 explicitly specifies all kinds of common specialized algorithms, including near-trivial ones like UTF16EncodeCodePoint and longer ones like Encode, and also a number of non-specialized algorithms such as StringIndexOf and StringPad—so what makes base64 special?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps not, but "the such that applying <encoding algorithm> would produce input" is hardly a good reference to any prior example.
... Isn't it? I mean, I'm open to other wordings for the thing I'm getting at but I don't really think there's much possibility for confusion with that wording. I am not going to write down the base64 algorithm again.
so what makes base64 special?
None of those algorithms are widely implemented under a common name, specified elsewhere in a document easy to incorporate by reference, and unambiguous. This one is.
(I guess UTF16EncodeCodePoint might meet all those criteria, but that one is small enough that there's very little you need to read to figure out that it's doing the obvious thing.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this one be made into a link?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of those algorithms are widely implemented under a common name, specified elsewhere in a document easy to incorporate by reference, and unambiguous. This one is.
Actually, many of them are. For example, UTF16EncodeCodePoint is RFC 2781 section 2.1 and encodeURIComponent is the WHATWG URL Standard UTF-8 percent-encode … using the component percent-encode set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be OK swapping out our encodeURIComponent with a reference to WHATWG. But I would not exactly describe it as widely implemented, at least not nearly as much as base64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceded. I still really dislike algorithm steps that are basically "solve this puzzle" or "find a preimage", but base64 decoding is in fact well understood so in practice implementations will "figure that that we've written down obliquely referenced the standard base64 decoding algorithm [implied by RFC 4648 but not written therein], which they'll then ignore and use their own existing base64 decoding logic". To that end, one final suggestion:
1. Let _bytes_ be the unique (possibly empty) sequence of bytes such that applying the base64 encoding specified in section 4 of RFC 4648 to that sequence would produce _input_. | |
1. Let _bytes_ be the unique (possibly empty) sequence of bytes resulting from decoding _input_ as base64 (such that applying the base64 encoding specified in section 4 of RFC 4648 to _bytes_ would produce _input_). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, done.
Also made the references to RFC 4648 be links per @ljharb.
1. If _characters_ cannot result from applying the base64 encoding specified in section 4 of RFC 4648 to some sequence of bytes, throw a *SyntaxError* exception. | ||
1. Let _bytes_ be the unique sequence of bytes such that applying the base64 encoding specified in section 4 of RFC 4648 to that sequence would produce _characters_. | ||
1. Let _strict_ be ToBoolean(? Get(_opts_, *"strict"*)). | ||
1. NOTE: The order of validation and decoding in the algorithm below is not observable. Implementations are encouraged to perform them in whatever order is most efficient, possibly interleaving validation and stripping whitespace with decoding, as long as the behaviour is observably equivalent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. NOTE: The order of validation and decoding in the algorithm below is not observable. Implementations are encouraged to perform them in whatever order is most efficient, possibly interleaving validation and stripping whitespace with decoding, as long as the behaviour is observably equivalent. | |
1. NOTE: The order of validation and decoding in the algorithm below is not observable except that both must fully complete before invocation of AllocateTypedArray. Implementations are encouraged to perform them in whatever order is most efficient, possibly interleaving validation and stripping whitespace with decoding, as long as the behaviour is observably equivalent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ehhh, I don't think that's necessary. This is a non-normative note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the impracticality of covering this with tests, I consider it important to emphasize somewhere in the spec text that it actually is required to fully validate arbitrarily-long input before doing something that may produce a different kind of observable error (e.g., passing to Uint8Array.fromBase64
input that is a very long string of valid characters followed by an invalid character [such as from "A".repeat(1e9) + "~"
] must throw a SyntaxError rather than a RangeError).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handling for excessively-long inputs is not generally well-specified at all. Many other algorithms implicitly ask implementations to allocate memory during execution of the algorithm, which is always a fallible operation, and do not specify what happens when that fails, even though in practice it implementations generally throw a RangeError. I don't want to imply there's more restrictions on this algorithm than any other. So I don't want to write down they can't throw a RangeError in the middle of this algorithm; they can, in practice, throw a RangeError in the middle of basically any algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, it does seem common for observable behavior of implementations to correspond with validity checks preempting allocation failures (see below), and I would like the spec to be clear about that everywhere. But point taken regarding this algorithm not being distinguished in that dimension.
$ eshost -se 'encodeURIComponent("€".repeat(4e8)+"\uDBFF").length'
#### ChakraCore
URIError: The URI to be encoded contains an invalid character
#### JavaScriptCore
#### Moddable XS
Error: memory full
#### SpiderMonkey
URIError: malformed URI sequence
#### V8
URIError: URI malformed
(the JavaScriptCore blank seems a silent failure similar to that of e.g. encodeURIComponent("€".repeat(4e8)).length
—more like XS than like SpiderMonkey and V8)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, definitely agreed that it would be nice to be clear everywhere. I just don't want to do it in this place and not others, when there's nothing really special about this one. Rectifying memory handling is a larger project.
I'm also not totally sure that requiring validity checks up front is what we'd actually want - the case of excessively long inputs is quite rare, and requiring an extra pass prior to decoding just so the right error can get thrown in those cases seems like it would be a shame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also not totally sure that requiring validity checks up front is what we'd actually want - the case of excessively long inputs is quite rare, and requiring an extra pass prior to decoding just so the right error can get thrown in those cases seems like it would be a shame.
Sure. And that behavior doesn't require an extra pass, but it does require at least one pass (e.g., an implementation could check the remainder of input for validity before deciding to expose allocation failure as an error). Regardless, this discussion at least is settled as non-specific to base64.
Since it's been a week I'm going to go ahead and merge this so the proposal is in a more coherent state, but we can continue discussion here or in a new issue. |
This removes the streaming API per #13 (comment).
I've also filled out the details of the spec text; documented base64 variations in the RFCs, in other languages, and in existing base64 libraries; settled all the outstanding questions about the API; and updated the polyfill and added tests for it.
cc @phoddie