Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matching Python lexer character offsets #250

Open
dwoznicki opened this issue Jul 4, 2024 · 0 comments
Open

Matching Python lexer character offsets #250

dwoznicki opened this issue Jul 4, 2024 · 0 comments

Comments

@dwoznicki
Copy link
Collaborator

dwoznicki commented Jul 4, 2024

Problem

is text that, when rendered, looks like a single character. Rendered characters are called grapheme. To us, it looks like it has a width of 1. However, to store in unicode format, it requires 3 unicode points. So from unicode's perspective, it has a width of 3. In Rust, each char is a unicode point, not a grapheme. So when we call something like

fn peek(&self) -> Option<char> {
    self.source[self.current as usize..].chars().next()
}

we're actually getting the unicode point, not the grapheme. This is fine because inside Lexer.next , we increment the current character pointer by the unicode width.

self.current += c.len_utf8() as u32;

In the case of , len_utf8 returns 3. If we were to only do self.current += 1; , for example, we'd get a panic when trying to read the next character.

The problem I'm having is that the Python lexer does appear to track offsets using graphemes, not unicode points.

>>> len("ಠ")
1

This means that there's a fundamental difference in offsets between enderpy and the official Python tokenize library. Ruff also has this difference, since the project counts characters by unicode points (probably because it's written in Rust, and it's easer to use unicode points). In fact, there have been some issues raised in the Ruff repo because it "miscounts" the number of Japanese/Chinese/Korean characters in a line and warns about a line being too long too early.

Solution

The solution is to track grapheme offset in addition to unicode offset. We only need this for testing, so we'd want to put it behind a feature flag.

parser/Cargo.toml

[dependencies]
unicode-width = "0.1" # new

parser/src/lexer/mod.rs

use unicode_id_start::{is_id_continue, is_id_start};
use unicode_width::UnicodeWidthChar; # new

pub struct Lexer<'a> {
    current_grapheme: u32, # new
}

impl<'a> Lexer<'a> {
    current_grapheme: 0, # new

    pub fn next_token(&mut self) -> Token {
        if self.next_token_is_dedent > 0 {
            self.next_token_is_dedent -= 1;
            return Token {
                kind: Kind::Dedent,
                value: TokenValue::None,
                start: self.current,
                end: self.current,
                grapheme_start: self.current_grapheme, # new
                grapheme_end: self.current_grapheme, # new
            };
        }

        let start = self.current;
        let grapheme_start = self.current_grapheme; # new

        # ...

        let value = self.parse_token_value(kind, start);
        let end = self.current;
        let grapheme_end = self.current_grapheme;

        Token {
            kind,
            value,
            start,
            end,
            grapheme_start,
            grapheme_end,
        }
    }

    fn next(&mut self) -> Option<char> {
        let c = self.peek();
        if let Some(c) = c {
            self.current += c.len_utf8() as u32;
            self.current_grapheme += UnicodeWidthChar::width_cjk(c).unwrap() as u32; # new
        }
        c
    }
}

parser/src/token.rs

pub struct Token {
    pub kind: Kind,
    // Value might be deleted in the future
    pub value: TokenValue,
    pub start: u32,
    pub end: u32,
    pub grapheme_start: u32, # new
    pub grapheme_end: u32, # new
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant