-
Notifications
You must be signed in to change notification settings - Fork 29.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
build: encode non-ASCII Latin1 characters as one byte in JS2C
Previously we had two encodings for JS files: 1. If a file contains only ASCII characters, encode it as a one-byte string (interpreted as uint8_t array during loading). 2. If a file contains any characters with code point above 127, encode it as a two-byte string (interpreted as uint16_t array during loading). This was done because V8 only supports Latin-1 and UTF16 encoding as underlying representation for strings. To store the JS code as external strings to save encoding cost and memory overhead we need to follow the representations supported by V8. Notice that there is a gap in the Latin1 range (128-255) that we encoded as two-byte, which was an undocumented TODO for a long time. That was fine previously because then files that contained code points beyond the 0-127 range contained code points >255. Now we have undici which contains code points in the range 0-255 (minus a replaceable code point >255). So this patch adds handling for the 128-255 range to reduce the size overhead caused by encoding them as two-byte. This could reduce the size of the binary by ~500KB and helps future files with this kind of code points. Drive-by: replace `’` with `'` in undici.js to make it a Latin-1 only string. That could be removed if undici updates itself to replace this character in the comment. PR-URL: #51605 Reviewed-By: Daniel Lemire <daniel@lemire.me> Reviewed-By: Ethan Arrowood <ethan@arrowood.dev>
- Loading branch information
1 parent
c33f860
commit d6e702f
Showing
5 changed files
with
212 additions
and
63 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
#include "embedded_data.h" | ||
#include <vector> | ||
|
||
namespace node { | ||
std::string ToOctalString(const uint8_t ch) { | ||
// We can print most printable characters directly. The exceptions are '\' | ||
// (escape characters), " (would end the string), and ? (trigraphs). The | ||
// latter may be overly conservative: we compile with C++17 which doesn't | ||
// support trigraphs. | ||
if (ch >= ' ' && ch <= '~' && ch != '\\' && ch != '"' && ch != '?') { | ||
return std::string(1, static_cast<char>(ch)); | ||
} | ||
// All other characters are blindly output as octal. | ||
const char c0 = '0' + ((ch >> 6) & 7); | ||
const char c1 = '0' + ((ch >> 3) & 7); | ||
const char c2 = '0' + (ch & 7); | ||
return std::string("\\") + c0 + c1 + c2; | ||
} | ||
|
||
std::vector<std::string> GetOctalTable() { | ||
size_t size = 1 << 8; | ||
std::vector<std::string> code_table(size); | ||
for (size_t i = 0; i < size; ++i) { | ||
code_table[i] = ToOctalString(static_cast<uint8_t>(i)); | ||
} | ||
return code_table; | ||
} | ||
|
||
const std::string& GetOctalCode(uint8_t index) { | ||
static std::vector<std::string> table = GetOctalTable(); | ||
return table[index]; | ||
} | ||
} // namespace node |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
#ifndef SRC_EMBEDDED_DATA_H_ | ||
#define SRC_EMBEDDED_DATA_H_ | ||
|
||
#include <cinttypes> | ||
#include <string> | ||
|
||
// This file must not depend on node.h or other code that depends on | ||
// the full Node.js implementation because it is used during the | ||
// compilation of the Node.js implementation itself (especially js2c). | ||
|
||
namespace node { | ||
|
||
const std::string& GetOctalCode(uint8_t index); | ||
|
||
} // namespace node | ||
|
||
#endif // SRC_EMBEDDED_DATA_H_ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters