Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[\u{0}-\u{10FFFF}] matches any byte instead of any unicode character #202

Open
eduardosm opened this issue Feb 17, 2021 · 4 comments
Open

Comments

@eduardosm
Copy link

Example:

use logos::Logos;

#[derive(Logos, Debug, PartialEq)]
enum Token {
    #[error]
    Error,
    
    #[regex(r"[\u{0}-\u{10FFFF}]")]
    AnyChar,
}

fn main() {
    let mut lex = Token::lexer("Ω");

    assert_eq!(lex.next(), Some(Token::AnyChar));
    assert_eq!(lex.span(), 0..2); // length of Ω in utf8
    assert_eq!(lex.slice(), "Ω");
}

The second assert fails:

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `0..1`,
 right: `0..2`', src/main.rs:16:5

The derive(Logos) is expanded as:

impl<'s> ::logos::Logos<'s> for Token {
    type Extras = ();
    type Source = str;
    const ERROR: Self = Token::Error;
    fn lex(lex: &mut ::logos::Lexer<'s, Self>) {
        use ::logos::internal::{LexerInternal, CallbackResult};
        type Lexer<'s> = ::logos::Lexer<'s, Token>;
        fn _end<'s>(lex: &mut Lexer<'s>) {
            lex.end()
        }
        fn _error<'s>(lex: &mut Lexer<'s>) {
            lex.bump_unchecked(1);
            lex.error();
        }
        #[inline]
        fn goto1_x<'s>(lex: &mut Lexer<'s>) {
            lex.set(Token::AnyChar);
        }
        #[inline]
        fn goto2<'s>(lex: &mut Lexer<'s>) {
            let byte = match lex.read::<u8>() {
                Some(byte) => byte,
                None => return _end(lex),
            };
            match byte {
                0u8..=255u8 => {
                    lex.bump_unchecked(1usize);
                    goto1_x(lex)
                }
                _ => _error(lex),
            }
        }
        goto2(lex)
    }
}

It looks like [\u{0}-\u{10FFFF}] matches any raw byte instead of any unicode character

@eduardosm
Copy link
Author

Additionally, adding println!("{:?}", lex.slice().as_bytes()); prints [206], which means it is returning an invalid utf8 string.

@AshtonSnapp
Copy link

This is interesting. I've never tried to use \u in my regular expressions, but then again my project is an assembler for a 16-bit processor so ASCII is more likely to be used than Unicode. Would like to know if I could use this to check for every extended ASCII character (e.g. \x00 through \xFF instead of just \x00 through \x7F), by doing say \u{0} through \u{FF}.

@kiranshila
Copy link

Just wanted to gives this a bump as this is still an issue

@AshtonSnapp
Copy link

Just had a potentially dumb thought, but would changing the regex to r"[\u{0}-\u{10FFFE}\u{10FFFF}]" circumvent this?

pfoerster added a commit to latex-lsp/texlab that referenced this issue Mar 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants