-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Writing a string with NUL characters to a socket results in 0xc0 0x80 utf-8 sequences #1271
Comments
Yes, this is known. No plans to fix it in this server. The caller can use TextEncoder to convert the string to binary UTF-8 and write that instead. FWIW – the ECMA-419 HTTP client and server only operate on binary data (not strings). That's deliberate so that these text conversion issues remain external. |
Noting this in the docs would have saved me a bunch of time and aggravation... |
Thanks for pointing this out again, |
(Added a note to the docs) |
FYI: the issue with NUL characters seems to also affect Array.fromString. Some uses, like in the ecma-419 mqtt client look vulnerable. I assume this is known and won't-fix. |
The default XS string encoding is almost UTF-8. The exception is NULLs which use the CESU-8 encoding. Here's a modified version of fx_ArrayBuffer_fromStringvoid fx_ArrayBuffer_fromString(txMachine* the)
{
txSize length;
if (mxArgc < 1)
mxTypeError("no argument");
length = mxStringLength(fxToString(the, mxArgv(0)));
txString c = mxArgv(0)->value.string, end = c + length - 1;
txInteger nulls = 0;
while (c < end) {
if ((0xc0 == (uint8_t)c[0]) && (0x80 == (uint8_t)c[1]))
nulls += 1;
c++;
}
fxConstructArrayBufferResult(the, mxThis, length - nulls);
if (!nulls)
c_memcpy(mxResult->value.reference->next->value.arrayBuffer.address, mxArgv(0)->value.string, length);
else {
txString c = mxArgv(0)->value.string, end = c + length;
txByte *out = mxResult->value.reference->next->value.arrayBuffer.address;
while (c < end) {
if ((0xc0 == (uint8_t)c[0]) && (0x80 == (uint8_t)c[1])) {
*out++ = 0;
c += 2;
}
else
*out++ = *c++;
}
}
} Here's the corresponding change to fx_String_fromArrayBuffervoid fx_String_fromArrayBuffer(txMachine* the)
{
txSlot* slot;
txSlot* arrayBuffer = C_NULL, *sharedArrayBuffer = C_NULL;
txSlot* bufferInfo;
txInteger limit, offset;
txInteger inLength, outLength = 0, nulls = 0;
unsigned char *in;
txString string;
if (mxArgc < 1)
mxTypeError("no argument");
slot = mxArgv(0);
if (slot->kind == XS_REFERENCE_KIND) {
slot = slot->value.reference->next;
if (slot) {
bufferInfo = slot->next;
if (slot->kind == XS_ARRAY_BUFFER_KIND)
arrayBuffer = slot;
else if (slot->kind == XS_HOST_KIND) {
if (!(slot->flag & XS_HOST_CHUNK_FLAG) && bufferInfo && (bufferInfo->kind == XS_BUFFER_INFO_KIND))
sharedArrayBuffer = slot;
}
}
}
if (!arrayBuffer && !sharedArrayBuffer)
mxTypeError("argument is no ArrayBuffer instance");
limit = bufferInfo->value.bufferInfo.length;
offset = fxArgToByteLength(the, 1, 0);
if (limit < offset)
mxRangeError("out of range byteOffset %ld", offset);
inLength = fxArgToByteLength(the, 2, limit - offset);
if ((limit < (offset + inLength)) || ((offset + inLength) < offset))
mxRangeError("out of range byteLength %ld", inLength);
in = offset + (unsigned char *)(arrayBuffer ? arrayBuffer->value.arrayBuffer.address : sharedArrayBuffer->value.host.data);
while (inLength > 0) {
unsigned char first = c_read8(in++), clen;
if (first < 0x80){
if (0 == first)
nulls += 1;
inLength -= 1;
outLength += 1;
continue;
}
if (0xC0 == (first & 0xE0))
clen = 2;
else if (0xE0 == (first & 0xF0))
clen = 3;
else if (0xF0 == (first & 0xF0))
clen = 4;
else
goto badUTF8;
inLength -= clen;
if (inLength < 0)
goto badUTF8;
outLength += clen;
clen -= 1;
do {
if (0x80 != (0xc0 & c_read8(in++)))
goto badUTF8;
} while (--clen > 0);
}
string = fxNewChunk(the, outLength + nulls + 1);
if (!nulls)
c_memcpy(string, offset + (txString)(arrayBuffer ? arrayBuffer->value.arrayBuffer.address : sharedArrayBuffer->value.host.data), outLength);
else {
txString c = string, end = c + outLength + nulls;
txString buf = offset + (txString)(arrayBuffer ? arrayBuffer->value.arrayBuffer.address : sharedArrayBuffer->value.host.data);
while (c < end) {
txByte b = *buf++;
if (b)
*c++ = b;
else {
*c++ = 0xC0;
*c++ = 0x80;
}
}
}
string[outLength + nulls] = 0;
mxResult->value.string = string;
mxResult->kind = XS_STRING_KIND;
return;
badUTF8:
mxTypeError("invalid UTF-8");
} Let me know if this works for you. If it does, I'll merge it. |
CESU-8: learn something new every day... |
Great. How'd it go? |
My test for Array.fromString looks good, ~the impact on larger buffers (1KB) is
Edit:
|
Thanks for the tests and benchmarks. Given the need for an extra pass over the data, it is going to take more time. Since the actual work is trivial, memory bandwidth is the limiting factor. Since this isn't generally performance critical, I think it is an OK tradeoff and keeps the code small. I just noticed that fx_ArrayBuffer_fromString 2void fx_ArrayBuffer_fromString(txMachine* the)
{
txSize length = 0;
if (mxArgc < 1)
mxTypeError("no argument");
txString c = mxArgv(0)->value.string;
txInteger nulls = 0;
while (true) {
uint8_t b = (uint8_t)c_read8(c++);
if (!b) break;
length += 1;
if ((0xc0 == b) && (0x80 == (uint8_t)c_read8(c)))
nulls += 1;
}
fxConstructArrayBufferResult(the, mxThis, length - nulls);
if (!nulls)
c_memcpy(mxResult->value.reference->next->value.arrayBuffer.address, mxArgv(0)->value.string, length);
else {
txString c = mxArgv(0)->value.string, end = c + length;
txByte *out = mxResult->value.reference->next->value.arrayBuffer.address;
while (c < end) {
uint8_t b = (uint8_t)c_read8(c++);
if ((0xc0 == (uint8_t)b) && (0x80 == (uint8_t)c_read8(c))) {
*out++ = 0;
c += 1;
}
else
*out++ = b;
}
}
} How's that look? Separately, I would expect |
For ArrayBuffer.fromString
For String.fromArrayBuffer
|
Thanks for rechecking. The fromString optimization seems to be working nicely. I'll integrate the changes. |
Moddable SDK version: 4.3
Target device: esp32
Description
In an HTTP
prepareResponse
handler I'm returning a string body that happens to contain NUL characters ('\0'
). On the client side these come across as 0xc0 0x80 byte pairs and a byte is missing at the end.Steps to Reproduce
Notice the 0xc0 (0300) and 0x80 (0200) instead of the NUL and the missing 'e' at the end.
Update: any response string with a non-ascii unicode character will trigger the incorrect content-length issue. For example,
return { status: 200, body: "Here is a [©] byte" }
results inNote the missing 'e' at the end.
The text was updated successfully, but these errors were encountered: