-
-
Notifications
You must be signed in to change notification settings - Fork 484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(parser): utf16 spans #959
Comments
I don't think this would be too difficult to implement in the lexer, while it's running through the source character by character anyway. But is the idea that spans always specify positions as UTF16? Or spans specify both UTF8 and UTF16 positions? Or one or the other depending on user choice? As far as I understand, the source map "standard" states positions must be UTF16. But I have no idea what actual tools in the real world do. By the way, if you operate on source as a slice of /// Get length of string as number of UTF16 characters.
///
/// The implementation below is equivalent to
/// `s.chars().map(|c| c.len_utf16()).sum::<usize>()`
/// but hopefully faster.
///
/// It relies on translation of UTF8 coding to UTF16:
///
/// - 1-byte UTF8 sequence
/// = 1st byte `0xxxxxxx`
/// -> 1 x UTF16 char
/// UTF16 len = UTF8 len
/// - 2-byte UTF8 sequence
/// = 1st byte `110xxxxx`, remaining bytes `10xxxxxx`
/// -> 1 x UTF16
/// UTF16 len = UTF8 len - 1
/// - 3-byte UTF8 sequence
/// = 1st byte `1110xxxx`, remaining bytes `10xxxxxx`
/// -> 1 x UTF16
/// UTF16 len = UTF8 len - 2
/// - 4-byte UTF8 sequence
/// = 1st byte `1111xxxx`, remaining bytes `10xxxxxx`
/// -> 2 x UTF16
/// UTF16 len = UTF8 len - 2
///
/// So UTF16 len =
/// UTF8 len
/// minus count of UTF8 bytes indicating start of sequence 2 bytes or longer
/// minus count of UTF8 bytes indicating start of sequence 3 bytes or longer
///
/// See: https://stackoverflow.com/questions/5728045/c-most-efficient-way-to-determine-how-many-bytes-will-be-needed-for-a-utf-16-st
fn utf16_len(s: &str) -> usize {
s.len()
- s.bytes()
.map(|c| (c >= 0xC0) as usize + (c >= 0xE0) as usize)
.sum::<usize>()
} |
This is going to be a hard one. Here is the actual problem (sorry I should have stated the actual problem instead of doing a X/Y problem). When the span value is used on the JS side, they need to be converted to a utf16 span, otherwise unicode will blow up everything. oxc/website/playground/index.js Lines 706 to 708 in 1f3f21d
In source map generation, we need to keep track of line and column indices instead of byte offsets, all in UTF16 So what I'm actually looking for is a O(1) solution around Utf8 -> Utf16 conversion, and it seemed trivial to get the values from the lexer. By they way, the tendril library has a great summary around these topics (along with the other Atom issue) https://github.com/servo/tendril But I don't have the time to go down the rabbit hole :-( For example, wtf is WTF-8 lol |
Ah. Interesting. I think the 2 problems, although related, can be tackled separately. Span positions (
|
Sorry, I've written you an essay again. If you're short on time, the important question is this one:
|
I don't think so, everything in Rust land assumes UTF8. For example when we emit diagnostic code frames, miette will slice into the source text by the spans we provide.
I love these essays and discussions, keep up the good work! |
I just found another use case for the line / column problem oxc/crates/oxc_language_server/src/linter.rs Lines 392 to 399 in 1957770
|
These problems are some low hanging fruits, we don't need to tackle these problems right now. It's just good to know these problems exist when we code in Rust (UTF8), but these values will slip through to JavaScript (UTF16), which causes a conversion overhead. |
Ha. OK, great!
This is actually the nub of my interest. I'd like to find a way to reduce the overhead of passing ASTs from Rust to JavaScript. I tried to solve that problem on SWC, but OXC is actually much more suited to a very performant solution due to its use of arenas. Anyway, I'll open an issue about that at some point. |
This will soon be the only blocker for being able to use OXC as a parser for TS, I hope there is a way to to this in performant way! |
@ArnaudBarre Is prettier getting the span positions from these two getters?
Or do we need to rewrite all AST node spans? If it only getting it from oxc/website/playground/index.js Line 139 in 1f3f21d
oxc/website/playground/index.js Lines 706 to 708 in 1f3f21d
Or maybe we can add some kind of caching mechanism to the All I want is to get things running first, even with the slowest |
Yeah we can do the remapping in this two methods it will work but it will probably kill the perf! But I agree, let's first have something working and then profile! |
As these ASTs are coming from the parser (not transformed code which could move things around), you'd expect while traversing the tree, if order of operations for each node is:
...then each offset will be larger than the last offset processed. So then you'd only need to However, I think Longer term, could we look at doing this in the Lexer? 2 possible options: Lookup tabletrait PosConversion {}
struct NoPosConversion;
impl PosConversion for NoPosConversion {}
struct Utf16PosConversion(Vec<(u32, u32)>);
impl PosConversion for Utf16PosConversion {}
struct Lexer<P: PosConversion> {
pos_lookup_table: P,
// ...
} If (could also add Building the table in Lexer would be not too costly as it's already reading through the source and checking for Unicode bytes. Translating spans would be binary search (so reasonably fast), but would require traversing the AST in full, either in JS or in Rust. Span typeenum SpanType {
Utf8,
Utf16,
LineColumn,
}
struct Lexer<const SPAN_TYPE: SpanType> {
// ...
} This is more direct. If This would work perfectly for the use case of getting UTF-16 spans in JS AST. But maybe less easy to integrate with other parts of OXC. |
I've been struggling with the question of how to deal with this for a while. To summarize my thoughts:
@Boshen you've mentioned the 3rd point a few times, but the thing I'm struggling with is understanding for what purpose external consumers need this. It would really help if you could give some examples. |
The TextEncoder thing is working. Didn't yet measure the perf impact, as soon as it was on I started getting new inetresting diff 😅. Sadly the current one involves a difference of AST for |
Try this https://github.com/sxzz/ast-explorer at https://ast.sxzz.moe/ |
The linter is our external consumer, although it lives inside this repository. There are also people using the AST to build their own tools. Regarding building an external data structure vs building the thing inside the lexer - I'm drawing inspiration from how browsers do it https://github.com/servo/tendril?tab=readme-ov-file#utf-16-compatibility
parallelized? huge chunks of text all at once? what do they all mean? Anyway, I think it's a lot cleaner to build an external infrastructure first and see its performance impact, it may not be that bad ... |
Thanks for coming back Boshen. More questions (sorry!):
Understood. But my question is: What purpose do these external tools require UTF-8 spans for? So, for example, in the linter:
And:
The context for these questions is that I'm wondering along these lines:
If by "external data structure" you mean e.g. a UTF-8 -> UTF-16 lookup table, I don't see these 2 as mutually exclusive. We could build the external data structure in the lexer.
Servo's approach is interesting. Another example of prior art is V8, which I understand has quite a complex implementation for strings, with multiple internal representations (UTF-8/UTF-16 and contiguous/non-contiguous chunks stored as a tree). It uses whichever representation is most performant depending on how the string is originated, and then translates from one representation to another internally, depending on what the code using the string does. e.g. tree representation is good for lots of However, I'm not sure how similar a browser's use case is to OXC's. What OXC needs to do with strings/offsets may be quite different (or maybe not). You've turned me on to the notion of data-driven design, hence all my questions about what use
If you're willing, I'd like to try to clarify requirements as much as possible before implementation. Whichever way we go, implementation won't be entirely trivial, so in my view it'd be preferable for 1st iteration to be hopefully close to what we need in the end. |
Boshen did an experiment to see what the performance hit of adding UTF-16 offset fields to Result was 2% - 3% regression on parser benchmarks. Not as bad as expected (well my expectation anyway). Boshen asked there:
Nice idea. Maybe. The theoretical maximum for We could try some kind of compression:
or:
However, I don't know if the extra operations and branches to compress and uncompress the values would hurt performance more than the simpler solution of just adding more fields to |
Current usages so far are just slicing text for diagnostics, which are all on the cold path. |
Thanks for coming back. Right, that does inform this a lot. Do you envisage the same is true for other external consumers? |
I can't think of anything else besides source maps 🤔 |
OK great, this simplifies things a lot. I have an idea. Can't write it up now, but will try to later in the week. |
I have a proposal for how to tackle this. Please feel free to disagree with any of this, but I thought it'd be useful at this stage to make a concrete proposal, as a basis for discussion. Starting pointsPrimary use cases for different offset types:
Observations:
Proposal for eventual solutionI propose a flexible solution which prioritizes performance of the main use cases which are on hot paths.
Encoding offsets
Line + column encoding schemeLine + column encoded as one of these 32-bit bit patterns:
This representation allows internal encoding of all possible line + column pairs for any source which is either:
These 2 should cover almost all real world cases. Any very weird cases (e.g. a few very long lines in a "normal" JS source) would be handled by storing line+column as a pair of Decoding should be cheap - just a single branch and a few bitwise operations for the most common case: let (line, column) = match encoded & 0xC0000000 {
0 => (encoded >> 10, encoded & 0x3FF),
0x40000000 => (encoded & 0x3F, (encoded >> 6) & 0xFFFFFF),
_ => lookup_table[encoded as usize & 0x3FFFFFFF],
}; (and branch predictor will learn quickly which way to branch as it goes through a file) Implications
Short term solutionThe only hard part of the above to implement is line+column pairs. So in short term we could just support UTF-8 and UTF-16 spans, and leave line+column support until later. Another option is to increase size of Any thoughts? |
After going back and forth with all the requirements and solutions, I propose that we don't over-engineer this, keep it simple and accept the performance regression. The tasks will be broken down to:
|
Removing "P-high" label as this is not a current area of work. |
💡 Utf-8 to Utf-16 conversion is costly for downstream tools, is it doable to offer utf-16 spans from our parser?
The text was updated successfully, but these errors were encountered: