[\u{0}-\u{10FFFF}] matches any byte instead of any unicode character #202

eduardosm · 2021-02-17T21:39:04Z

Example:

use logos::Logos;

#[derive(Logos, Debug, PartialEq)]
enum Token {
    #[error]
    Error,
    
    #[regex(r"[\u{0}-\u{10FFFF}]")]
    AnyChar,
}

fn main() {
    let mut lex = Token::lexer("Ω");

    assert_eq!(lex.next(), Some(Token::AnyChar));
    assert_eq!(lex.span(), 0..2); // length of Ω in utf8
    assert_eq!(lex.slice(), "Ω");
}

The second assert fails:

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `0..1`,
 right: `0..2`', src/main.rs:16:5

The derive(Logos) is expanded as:

impl<'s> ::logos::Logos<'s> for Token {
    type Extras = ();
    type Source = str;
    const ERROR: Self = Token::Error;
    fn lex(lex: &mut ::logos::Lexer<'s, Self>) {
        use ::logos::internal::{LexerInternal, CallbackResult};
        type Lexer<'s> = ::logos::Lexer<'s, Token>;
        fn _end<'s>(lex: &mut Lexer<'s>) {
            lex.end()
        }
        fn _error<'s>(lex: &mut Lexer<'s>) {
            lex.bump_unchecked(1);
            lex.error();
        }
        #[inline]
        fn goto1_x<'s>(lex: &mut Lexer<'s>) {
            lex.set(Token::AnyChar);
        }
        #[inline]
        fn goto2<'s>(lex: &mut Lexer<'s>) {
            let byte = match lex.read::<u8>() {
                Some(byte) => byte,
                None => return _end(lex),
            };
            match byte {
                0u8..=255u8 => {
                    lex.bump_unchecked(1usize);
                    goto1_x(lex)
                }
                _ => _error(lex),
            }
        }
        goto2(lex)
    }
}

It looks like [\u{0}-\u{10FFFF}] matches any raw byte instead of any unicode character

The text was updated successfully, but these errors were encountered:

eduardosm · 2021-02-17T21:41:22Z

Additionally, adding println!("{:?}", lex.slice().as_bytes()); prints [206], which means it is returning an invalid utf8 string.

AshtonSnapp · 2021-10-29T02:20:00Z

This is interesting. I've never tried to use \u in my regular expressions, but then again my project is an assembler for a 16-bit processor so ASCII is more likely to be used than Unicode. Would like to know if I could use this to check for every extended ASCII character (e.g. \x00 through \xFF instead of just \x00 through \x7F), by doing say \u{0} through \u{FF}.

kiranshila · 2022-01-25T03:29:32Z

Just wanted to gives this a bump as this is still an issue

AshtonSnapp · 2022-02-20T01:06:10Z

Just had a potentially dumb thought, but would changing the regex to r"[\u{0}-\u{10FFFE}\u{10FFFF}]" circumvent this?

Add a workaround for maciejhirsz/logos#202. See #857.

pfoerster added a commit to latex-lsp/texlab that referenced this issue Mar 9, 2023

Fix lexing commands with multi-byte characters

51f7179

Add a workaround for maciejhirsz/logos#202. See #857.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[\u{0}-\u{10FFFF}] matches any byte instead of any unicode character #202

[\u{0}-\u{10FFFF}] matches any byte instead of any unicode character #202

eduardosm commented Feb 17, 2021

eduardosm commented Feb 17, 2021

AshtonSnapp commented Oct 29, 2021

kiranshila commented Jan 25, 2022

AshtonSnapp commented Feb 20, 2022

[\u{0}-\u{10FFFF}] matches any byte instead of any unicode character #202

[\u{0}-\u{10FFFF}] matches any byte instead of any unicode character #202

Comments

eduardosm commented Feb 17, 2021

eduardosm commented Feb 17, 2021

AshtonSnapp commented Oct 29, 2021

kiranshila commented Jan 25, 2022

AshtonSnapp commented Feb 20, 2022