Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(parser): utf16 spans #959

Open
Boshen opened this issue Oct 6, 2023 · 25 comments
Open

feat(parser): utf16 spans #959

Boshen opened this issue Oct 6, 2023 · 25 comments
Assignees

Comments

@Boshen
Copy link
Member

Boshen commented Oct 6, 2023

💡 Utf-8 to Utf-16 conversion is costly for downstream tools, is it doable to offer utf-16 spans from our parser?

@overlookmotel
Copy link
Contributor

overlookmotel commented Dec 27, 2023

I don't think this would be too difficult to implement in the lexer, while it's running through the source character by character anyway.

But is the idea that spans always specify positions as UTF16? Or spans specify both UTF8 and UTF16 positions? Or one or the other depending on user choice?

As far as I understand, the source map "standard" states positions must be UTF16. But I have no idea what actual tools in the real world do.

By the way, if you operate on source as a slice of u8 bytes, rather than chars, calculating UTF16 length can be quite fast. Here's some WIP code I wrote a while back while hacking on SWC, which could probably be massaged a bit to take advantage of SIMD:

/// Get length of string as number of UTF16 characters.
///
/// The implementation below is equivalent to
/// `s.chars().map(|c| c.len_utf16()).sum::<usize>()`
/// but hopefully faster.
///
/// It relies on translation of UTF8 coding to UTF16:
///
/// - 1-byte UTF8 sequence
///   = 1st byte `0xxxxxxx`
///   -> 1 x UTF16 char
///   UTF16 len = UTF8 len
/// - 2-byte UTF8 sequence
///   = 1st byte `110xxxxx`, remaining bytes `10xxxxxx`
///   -> 1 x UTF16
///   UTF16 len = UTF8 len - 1
/// - 3-byte UTF8 sequence
///   = 1st byte `1110xxxx`, remaining bytes `10xxxxxx`
///   -> 1 x UTF16
///   UTF16 len = UTF8 len - 2
/// - 4-byte UTF8 sequence
///   = 1st byte `1111xxxx`, remaining bytes `10xxxxxx`
///   -> 2 x UTF16
///   UTF16 len = UTF8 len - 2
///
/// So UTF16 len =
///   UTF8 len
///   minus count of UTF8 bytes indicating start of sequence 2 bytes or longer
///   minus count of UTF8 bytes indicating start of sequence 3 bytes or longer
///
/// See: https://stackoverflow.com/questions/5728045/c-most-efficient-way-to-determine-how-many-bytes-will-be-needed-for-a-utf-16-st
fn utf16_len(s: &str) -> usize {
  s.len()
    - s.bytes()
      .map(|c| (c >= 0xC0) as usize + (c >= 0xE0) as usize)
      .sum::<usize>()
}

@Boshen
Copy link
Member Author

Boshen commented Dec 28, 2023

But is the idea that spans always specify positions as UTF16? Or spans specify both UTF8 and UTF16 positions? Or one or the other depending on user choice?

This is going to be a hard one.

Here is the actual problem (sorry I should have stated the actual problem instead of doing a X/Y problem).

When the span value is used on the JS side, they need to be converted to a utf16 span, otherwise unicode will blow up everything.

// convert utf8 to utf16 span so they show correctly in the editor
start = new TextDecoder().decode(this.sourceTextUtf8.slice(0, start)).length;
end = new TextDecoder().decode(this.sourceTextUtf8.slice(0, end)).length;

In source map generation, we need to keep track of line and column indices instead of byte offsets, all in UTF16

So what I'm actually looking for is a O(1) solution around Utf8 -> Utf16 conversion, and it seemed trivial to get the values from the lexer.


By they way, the tendril library has a great summary around these topics (along with the other Atom issue) https://github.com/servo/tendril

But I don't have the time to go down the rabbit hole :-( For example, wtf is WTF-8 lol

@overlookmotel
Copy link
Contributor

overlookmotel commented Dec 29, 2023

Ah. Interesting.

I think the 2 problems, although related, can be tackled separately.

Span positions (start and end)

Should Span's current start and end be changed from UTF8 positions to UTF16 positions?

i.e. Is it just a mistake that they're currently UTF8 positions, and it should be rectified to make them UTF16?

If so, that could be done in the lexer, and that'd solve problem no. 1.

That could be implemented now, without finding a solution to part 2.

But does anything/anyone depend on start and end being UTF8 positions? (e.g. to slice up the source text in Rust).

If absolutely necessary, both could be supported, with a parser option determining whether positions are recorded as UTF16 or UTF8.

Source-map friendly positions

Line+column positions could also be calculated quite easily for every token in the lexer, and then added to Span.

Getting the line/column for any AST node would then be free.

However, storing all the info in each AST node would increase the size of Span from 8 to 24 bytes:

struct LineColumn { line: u32, column: u32 }
struct Span {
  start: u32,
  end: u32,
  start_line_column: LineColumn,
  end_line_column: LineColumn,
}

That's not ideal, as every node has a Span, so it's a lot of memory in aggregate.

If it's acceptable for getting line + column to rely on data outside of the AST node, I can see two ways to reduce size of Span:

Solution 1

struct Span { start: u32, end: u32, start_line: u32, end_line: u32 }

Lexer would also fill a Vec recording the start position (relative to start of file) for each line. e.g. if first line of source is 20 characters, followed by \n, line_positions[0] = 0 & line_positions[1] = 21.

So then to get line + column for a Span:

fn get_columns(span: &Span, line_positions: &[u32]) -> (u32, u32) {
  let start_col = span.start - line_positions[span.start_line];
  let end_col = span.end - line_positions[span.end_line];
  (start_col, end_col)
}

That's O(1) and Span is 16 bytes.

Solution 2

struct Span { start: u32, end: u32 }

Lexer also records Vec of line start positions (as before).

Transforming start to column + line would involve:

  1. Binary search of line_positions to find the line with latest start position which is <= start.
  2. column = start - line_start

This is not O(1) but it's better than O(n) (O(log n)?). Span is 8 bytes.

There's also scope for an optimization: Very often an AST node will have a source position close to its preceding/enclosing node, so when traversing an entire AST to generate a source map, there could be a fast path for "this node is on same line as previous node". So maybe it'd be quite efficient.

I'm not sure which is preferable - O(1), or minimizing size of Span. There may well also be better solutions I've not thought of.

Looks like SWC is doing something like Solution 2:
https://github.com/swc-project/swc/blob/b76dd4635d02b1fc76a6f799d4bc70710025f889/crates/swc_common/src/syntax_pos.rs#L1440-L1445

WTF

I started reading into WTF-8 (!). I can't claim to have got my head around it entirely. But at first glance, I don't think it helps with the source maps line/column problem and, while it might help with UTF16 positions, there are much simpler solutions to hand for that, since the lexer scans source text from start to finish in order anyway.

@overlookmotel
Copy link
Contributor

Sorry, I've written you an essay again. If you're short on time, the important question is this one:

Should Span's current start and end be changed from UTF8 positions to UTF16 positions?

i.e. Is it just a mistake that they're currently UTF8 positions, and it should be rectified to make them UTF16?

@Boshen
Copy link
Member Author

Boshen commented Dec 29, 2023

Sorry, I've written you an essay again. If you're short on time, the important question is this one:

Should Span's current start and end be changed from UTF8 positions to UTF16 positions?
i.e. Is it just a mistake that they're currently UTF8 positions, and it should be rectified to make them UTF16?

I don't think so, everything in Rust land assumes UTF8.

For example when we emit diagnostic code frames, miette will slice into the source text by the spans we provide.

Sorry, I've written you an essay again.

I love these essays and discussions, keep up the good work!

@Boshen
Copy link
Member Author

Boshen commented Dec 29, 2023

I just found another use case for the line / column problem

#[allow(clippy::cast_possible_truncation)]
fn offset_to_position(offset: usize, source_text: &str) -> Option<Position> {
let rope = Rope::from_str(source_text);
let line = rope.try_char_to_line(offset).ok()?;
let first_char_of_line = rope.try_line_to_char(line).ok()?;
let column = offset - first_char_of_line;
Some(Position::new(line as u32, column as u32))
}

@Boshen
Copy link
Member Author

Boshen commented Dec 29, 2023

These problems are some low hanging fruits, we don't need to tackle these problems right now.

It's just good to know these problems exist when we code in Rust (UTF8), but these values will slip through to JavaScript (UTF16), which causes a conversion overhead.

@overlookmotel
Copy link
Contributor

I love these essays and discussions, keep up the good work!

Ha. OK, great!

these values will slip through to JavaScript (UTF16), which causes a conversion overhead.

This is actually the nub of my interest. I'd like to find a way to reduce the overhead of passing ASTs from Rust to JavaScript. I tried to solve that problem on SWC, but OXC is actually much more suited to a very performant solution due to its use of arenas. Anyway, I'll open an issue about that at some point.

@ArnaudBarre
Copy link
Contributor

This will soon be the only blocker for being able to use OXC as a parser for TS, I hope there is a way to to this in performant way!

@Boshen Boshen added the P-high Priority - High label Mar 4, 2024
@Boshen
Copy link
Member Author

Boshen commented Mar 4, 2024

This will soon be the only blocker for being able to use OXC as a parser for TS, I hope there is a way to to this in performant way!

@ArnaudBarre Is prettier getting the span positions from these two getters?

export const parsers: Plugin["parsers"] = {
  typescript: {
    astFormat: "estree",
    parse: (text, options) => oxcParse(text, options.filepath),
    locStart: (node) => node.start, // <---
    locEnd: (node) => node.end,  // <---
    hasPragma: () => false,
  },
};

Or do we need to rewrite all AST node spans?

If it only getting it from locStart and locEnd, is it possible for us to provide a slightly faster lookup data structure than

this.sourceTextUtf8 = new TextEncoder().encode(sourceText);

// convert utf8 to utf16 span so they show correctly in the editor
start = new TextDecoder().decode(this.sourceTextUtf8.slice(0, start)).length;
end = new TextDecoder().decode(this.sourceTextUtf8.slice(0, end)).length;

Or maybe we can add some kind of caching mechanism to the TextDecoder method.

All I want is to get things running first, even with the slowest TextDecoder approach 😅

@ArnaudBarre
Copy link
Contributor

Yeah we can do the remapping in this two methods it will work but it will probably kill the perf! But I agree, let's first have something working and then profile!
Yeah I can try to have a small method that keep track of already computed utf-8/utf-16 mapping to avoid passing the whole file each time, I will see the perf impact for both and report this here, probably tonight!

@overlookmotel
Copy link
Contributor

overlookmotel commented Mar 4, 2024

As these ASTs are coming from the parser (not transformed code which could move things around), you'd expect while traversing the tree, if order of operations for each node is:

  1. Convert span.start
  2. Visit node's children
  3. Convert span.end

...then each offset will be larger than the last offset processed. So then you'd only need to TextDecoder-ize the bytes in between current offset and previous offset.

However, I think TextDecoder().decode() involves a call across JS/native boundary. If so, that'll have a sizeable cost in aggregate over 1000s of calls, regardless of how many bytes you decode each time.

Longer term, could we look at doing this in the Lexer?

2 possible options:

Lookup table

trait PosConversion {}

struct NoPosConversion;
impl PosConversion for NoPosConversion {}

struct Utf16PosConversion(Vec<(u32, u32)>);
impl PosConversion for Utf16PosConversion {}

struct Lexer<P: PosConversion> {
  pos_lookup_table: P,
  // ...
}

If P is Utf16PosConversion, lexer methods would populate the table.

(could also add LineColumnPosConversion for a sourcemap-style lookup table)

Building the table in Lexer would be not too costly as it's already reading through the source and checking for Unicode bytes. Translating spans would be binary search (so reasonably fast), but would require traversing the AST in full, either in JS or in Rust.

Span type

enum SpanType {
  Utf8,
  Utf16,
  LineColumn,
}

struct Lexer<const SPAN_TYPE: SpanType> {
  // ...
}

This is more direct. If SPAN_TYPE == SpanType::Utf16, span.start and span.end would be UTF-16 offsets. This would be faster in Lexer as no need to build a table, and zero cost on JS side (no need to traverse).

This would work perfectly for the use case of getting UTF-16 spans in JS AST. But maybe less easy to integrate with other parts of OXC.

@overlookmotel
Copy link
Contributor

overlookmotel commented Mar 4, 2024

I've been struggling with the question of how to deal with this for a while. To summarize my thoughts:

  • The "Span type" solution above is definitely going to be most performant, and in a lot of ways the simplest.
  • OXC parser (and probably other parts of OXC) currently rely on Span offsets being UTF-8 to slice the source text. But I believe we can find other ways to fulfil that function in OXC, without using Spans for that. So then Span offsets could be UTF-16 (or anything else!) with no problem.
  • External consumers of OXC in Rust may also want UTF-8 offsets.

@Boshen you've mentioned the 3rd point a few times, but the thing I'm struggling with is understanding for what purpose external consumers need this. It would really help if you could give some examples.

@ArnaudBarre
Copy link
Contributor

The TextEncoder thing is working. Didn't yet measure the perf impact, as soon as it was on I started getting new inetresting diff 😅. Sadly the current one involves a difference of AST for export import foo = bar between TSESLint 5 and 6. This is not the first time I get frustrated of astexplorer lagging behind on parser versions, so I take my next evenings to build one (I need this also to quickly compare the ASTs of TSESLint & TS and go back to working on my TS linter 😅)

@Boshen
Copy link
Member Author

Boshen commented Mar 5, 2024

The TextEncoder thing is working. Didn't yet measure the perf impact, as soon as it was on I started getting new inetresting diff 😅. Sadly the current one involves a difference of AST for export import foo = bar between TSESLint 5 and 6. This is not the first time I get frustrated of astexplorer lagging behind on parser versions, so I take my next evenings to build one (I need this also to quickly compare the ASTs of TSESLint & TS and go back to working on my TS linter 😅)

Try this https://github.com/sxzz/ast-explorer at https://ast.sxzz.moe/

@Boshen
Copy link
Member Author

Boshen commented Mar 5, 2024

External consumers of OXC in Rust may also want UTF-8 offsets.

The linter is our external consumer, although it lives inside this repository. There are also people using the AST to build their own tools.


Regarding building an external data structure vs building the thing inside the lexer -

I'm drawing inspiration from how browsers do it

https://github.com/servo/tendril?tab=readme-ov-file#utf-16-compatibility

Solution: Use WTF-8 in parsing and in the DOM. Servo will convert to contiguous UTF-16 when necessary. The conversion can easily be parallelized, if we find a practical need to convert huge chunks of text all at once.

parallelized? huge chunks of text all at once? what do they all mean?

Anyway, I think it's a lot cleaner to build an external infrastructure first and see its performance impact, it may not be that bad ...

@overlookmotel
Copy link
Contributor

overlookmotel commented Mar 5, 2024

Thanks for coming back Boshen. More questions (sorry!):

The linter is our external consumer, although it lives inside this repository. There are also people using the AST to build their own tools.

Understood. But my question is: What purpose do these external tools require UTF-8 spans for? So, for example, in the linter:

  • In what circumstances does it need to use span.start / span.end, and needs them to be UTF-8 offsets?
  • Is this purely for extracting slices of source text? Or are there other uses?
  • How common are such operations?
  • How "hot" are those paths?
  • Maybe what linter actually needs is not either exactly? (maybe for displaying lint errors, it really needs UTF-8 offset of line start + UTF-8 column offset?)

And:

  • What other parts of OXC require UTF-8 offsets on a hot path? e.g. in transformer I think it'd only be for error messages (i.e. not hot)?
  • Can you think of any use cases for UTF-8 offsets apart from slicing source text?

The context for these questions is that I'm wondering along these lines:

  • There are common needs for UTF-16 offsets e.g. source maps.
  • There are clearly some needs for UTF-8 offsets (but I'm not sure what they are).
  • If the former is more common/hot than the latter, maybe we should flip it on it's head and:
    • Span contains UTF-16 offsets.
    • Provide a mechanism (e.g. lookup table) to convert UTF-16 offsets to UTF-8.
  • Or, if each use case requires only one of UTF-8 or UTF-16, we could satisfy both very performantly with a parameterized Lexer<SpanType> and consumer chooses which they want.

building an external data structure vs building the thing inside the lexer

If by "external data structure" you mean e.g. a UTF-8 -> UTF-16 lookup table, I don't see these 2 as mutually exclusive. We could build the external data structure in the lexer.

I'm drawing inspiration from how browsers do it

Servo's approach is interesting. Another example of prior art is V8, which I understand has quite a complex implementation for strings, with multiple internal representations (UTF-8/UTF-16 and contiguous/non-contiguous chunks stored as a tree). It uses whichever representation is most performant depending on how the string is originated, and then translates from one representation to another internally, depending on what the code using the string does. e.g. tree representation is good for lots of str += more operations, contiguous UTF-16 representation is best for str[n].

However, I'm not sure how similar a browser's use case is to OXC's. What OXC needs to do with strings/offsets may be quite different (or maybe not). You've turned me on to the notion of data-driven design, hence all my questions about what use Span offsets are being put to.

Anyway, I think it's a lot cleaner to build an external infrastructure first and see its performance impact, it may not be that bad ...

If you're willing, I'd like to try to clarify requirements as much as possible before implementation. Whichever way we go, implementation won't be entirely trivial, so in my view it'd be preferable for 1st iteration to be hopefully close to what we need in the end.

@overlookmotel
Copy link
Contributor

overlookmotel commented Mar 6, 2024

Boshen did an experiment to see what the performance hit of adding UTF-16 offset fields to Span: #2626

Result was 2% - 3% regression on parser benchmarks. Not as bad as expected (well my expectation anyway).

Boshen asked there:

Does it work if we add offsets that are smaller than u32, and do utf16_start = start + offset, utf16_end = end + offset instead?

Nice idea. Maybe. The theoretical maximum for offset is u32::MAX * 2 / 3 (if input was a 4 GiB JS file where every character is a 3-byte UTF-8 char). That'd be a pretty weird JS file, but legal according to the spec. So handling that would still need a u32.

We could try some kind of compression:

  1. Use a smaller type for UTF8-UTF16 offset fields (e.g. 2 x u16), with u16::MAX as a sentinel value which says "look up the actual value in a table" for weird cases.

or:

  1. Limit source text size to u32::MAX / 2 (2 GiB), which would free up the highest bit in start and end. Use those high bits as flags for "UTF-16 offset is different from UTF-8 offset - look it up in table". Then Span stays same size, and for JS files which are pure ASCII (probably a majority), the table would be empty.

However, I don't know if the extra operations and branches to compress and uncompress the values would hurt performance more than the simpler solution of just adding more fields to Span.

@Boshen
Copy link
Member Author

Boshen commented Mar 6, 2024

What purpose do these external tools require UTF-8 spans for?

Current usages so far are just slicing text for diagnostics, which are all on the cold path.

@overlookmotel
Copy link
Contributor

Current usages so far are just slicing text for diagnostics, which are all on the cold path.

Thanks for coming back. Right, that does inform this a lot. Do you envisage the same is true for other external consumers?

@Boshen
Copy link
Member Author

Boshen commented Mar 6, 2024

Do you envisage the same is true for other external consumers?

I can't think of anything else besides source maps 🤔

@overlookmotel
Copy link
Contributor

OK great, this simplifies things a lot. I have an idea. Can't write it up now, but will try to later in the week.

@overlookmotel
Copy link
Contributor

overlookmotel commented Mar 7, 2024

I have a proposal for how to tackle this. Please feel free to disagree with any of this, but I thought it'd be useful at this stage to make a concrete proposal, as a basis for discussion.

Starting points

Primary use cases for different offset types:

  • UTF-16 offsets are needed for JS AST (ESTree compat).
  • UTF-8 offsets are needed for slicing source text in Rust for diagnostics.
  • Line + column (column as UTF-16) needed for source maps.

Observations:

  • Source maps are core feature of JS transpilation/minification, and creating them will be a hot path in oxc_codegen.
  • JS AST is (in my view) an important target.
  • Diagnostics are a cold path.
  • In some use cases, spans aren't needed at all (e.g. if app is just parsing files in order to find what other files they import, it doesn't care where in file the import statements are).
  • Calculating sourcemap-style line+column from UTF-16 offsets is a bit quicker than converting from UTF-8, but still requires significant extra work (building a table and then searching it).

Proposal for eventual solution

I propose a flexible solution which prioritizes performance of the main use cases which are on hot paths.

  • Cater for all use cases by parameterizing Lexer and ParserImpl as Lexer<SpanType> and ParserImpl<SpanType>.
  • Consumer asks the parser for offsets be UTF-8, UTF-16, or line+column, as they require.
  • Span contains only one offset type - you have to choose.
  • Public Parser API would not need to be generic (i.e. Parser not Parser<SpanType>). User would call e.g. parser::utf16_spans() and then Parser::parse would create a ParserImpl<SpanType::Utf16> internally.
  • Requested offsets are calculated in the lexer, and parser stores them in Spans.
  • Consumer only pays for the conversions they need.
  • Provide an interface to lazily convert UTF-16 offsets or line+column back to UTF-8. This can be used for diagnostics. Zero cost unless it's used.

Encoding offsets

Span remains 2 x u32s, regardless of offset type. Any of the following can be encoded as a single u32:

  • UTF-8 offset (plain u32)
  • UTF-16 offset (plain u32)
  • Line + column pair (compressed encoding packed into a u32)
Line + column encoding scheme

Line + column encoded as one of these 32-bit bit patterns:

  • 00LLLLLLLLLLLLLLLLLLLLCCCCCCCCCC =
    • L = Line (20 bits = 0 - 1048575 range)
    • C = Column (10 bits = 0 - 1023 range)
  • 01CCCCCCCCCCCCCCCCCCCCCCCCLLLLLL
    • L = Line (6 bits = 0 - 63 range)
    • C = Column (24 bits = 0 - 16777215 range)
  • 1IIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Values to be found in external lookup table at index I (31 bits)

This representation allows internal encoding of all possible line + column pairs for any source which is either:

  • Less than 1 million lines, and no lines longer than 1023 chars - almost all "normal" JS sources.
  • Less than 64 lines, and no lines longer than 16 million chars - minified files where entire file is a single line, up to file size 16 MiB.

These 2 should cover almost all real world cases. Any very weird cases (e.g. a few very long lines in a "normal" JS source) would be handled by storing line+column as a pair of u32s in an external table.

Decoding should be cheap - just a single branch and a few bitwise operations for the most common case:

let (line, column) = match encoded & 0xC0000000 {
    0 => (encoded >> 10, encoded & 0x3FF),
    0x40000000 => (encoded & 0x3F, (encoded >> 6) & 0xFFFFFF),
    _ => lookup_table[encoded as usize & 0x3FFFFFFF],
};

(and branch predictor will learn quickly which way to branch as it goes through a file)

Implications

  • Span is always 2 x u32s for any of above offset types.
  • No need to paramaterize all AST node types to support Program<SpanType> (horrible!).
  • Consumer can choose whichever offset type they need for their use case, and they only pay for the conversions they need.
  • The fastest place to calculate UTF-16 offsets or line+column is going to be the lexer, where it's already spinning through the entire source byte by byte, and branching on Unicode characters and line breaks.
  • Consuming offsets / line+column pairs is then free or very cheap.
  • In particular this will be (I think) as good as we can get on speed for source maps. I believe it'll be a significant speed-up from what we have now.

Short term solution

The only hard part of the above to implement is line+column pairs. So in short term we could just support UTF-8 and UTF-16 spans, and leave line+column support until later.

Another option is to increase size of Span to 32 bytes and include both UTF-8 and UTF-16 offsets in it. As Boshen's experiment showed, the size increase is not enormously costly (though the UTF-16 translation will also add a further cost). Personally I prefer the Lexer<SpanType> solution, but Boshen you may disagree.

Any thoughts?

@Boshen
Copy link
Member Author

Boshen commented Mar 7, 2024

After going back and forth with all the requirements and solutions, I propose that we don't over-engineer this, keep it simple and accept the performance regression.

The tasks will be broken down to:

  • add the utf16 spans to the current Span and make prettier work
  • build a line + column external data structure within the lexer and remove the line column generator from codegen

@overlookmotel
Copy link
Contributor

Removing "P-high" label as this is not a current area of work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants