Matching Python lexer character offsets #250

dwoznicki · 2024-07-04T18:54:19Z

Problem

ಠ is text that, when rendered, looks like a single character. Rendered characters are called grapheme. To us, it looks like it has a width of 1. However, to store ಠ in unicode format, it requires 3 unicode points. So from unicode's perspective, it has a width of 3. In Rust, each char is a unicode point, not a grapheme. So when we call something like

fn peek(&self) -> Option<char> {
    self.source[self.current as usize..].chars().next()
}

we're actually getting the unicode point, not the grapheme. This is fine because inside Lexer.next , we increment the current character pointer by the unicode width.

self.current += c.len_utf8() as u32;

In the case of ಠ, len_utf8 returns 3. If we were to only do self.current += 1; , for example, we'd get a panic when trying to read the next character.

The problem I'm having is that the Python lexer does appear to track offsets using graphemes, not unicode points.

>>> len("ಠ")
1

This means that there's a fundamental difference in offsets between enderpy and the official Python tokenize library. Ruff also has this difference, since the project counts characters by unicode points (probably because it's written in Rust, and it's easer to use unicode points). In fact, there have been some issues raised in the Ruff repo because it "miscounts" the number of Japanese/Chinese/Korean characters in a line and warns about a line being too long too early.

Solution

The solution is to track grapheme offset in addition to unicode offset. We only need this for testing, so we'd want to put it behind a feature flag.

parser/Cargo.toml

[dependencies]
unicode-width = "0.1" # new

parser/src/lexer/mod.rs

use unicode_id_start::{is_id_continue, is_id_start};
use unicode_width::UnicodeWidthChar; # new

pub struct Lexer<'a> {
    current_grapheme: u32, # new
}

impl<'a> Lexer<'a> {
    current_grapheme: 0, # new

    pub fn next_token(&mut self) -> Token {
        if self.next_token_is_dedent > 0 {
            self.next_token_is_dedent -= 1;
            return Token {
                kind: Kind::Dedent,
                value: TokenValue::None,
                start: self.current,
                end: self.current,
                grapheme_start: self.current_grapheme, # new
                grapheme_end: self.current_grapheme, # new
            };
        }

        let start = self.current;
        let grapheme_start = self.current_grapheme; # new

        # ...

        let value = self.parse_token_value(kind, start);
        let end = self.current;
        let grapheme_end = self.current_grapheme;

        Token {
            kind,
            value,
            start,
            end,
            grapheme_start,
            grapheme_end,
        }
    }

    fn next(&mut self) -> Option<char> {
        let c = self.peek();
        if let Some(c) = c {
            self.current += c.len_utf8() as u32;
            self.current_grapheme += UnicodeWidthChar::width_cjk(c).unwrap() as u32; # new
        }
        c
    }
}

parser/src/token.rs

pub struct Token {
    pub kind: Kind,
    // Value might be deleted in the future
    pub value: TokenValue,
    pub start: u32,
    pub end: u32,
    pub grapheme_start: u32, # new
    pub grapheme_end: u32, # new
}

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matching Python lexer character offsets #250

Matching Python lexer character offsets #250

dwoznicki commented Jul 4, 2024 •

edited

Loading

Matching Python lexer character offsets #250

Matching Python lexer character offsets #250

Comments

dwoznicki commented Jul 4, 2024 • edited Loading

Problem

Solution

dwoznicki commented Jul 4, 2024 •

edited

Loading