-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parser: remove Regexes from whitespace parser #1008
Conversation
cc @orf @MichaReiser as recent relevant contributors |
Codecov ReportPatch and project coverage have no change.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #1008 +/- ##
=======================================
Coverage 91.04% 91.04%
=======================================
Files 255 255
Lines 26366 26366
=======================================
Hits 24004 24004
Misses 2362 2362 ☔ View full report in Codecov by Sentry. |
newline_str.chars().count(), | ||
newline_str.len(), | ||
)?; | ||
let len = match newline_after.as_bytes() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was interested in the performance characteristics of this vs other ways of checking a string prefix. With this benchmark script,
match &thing.get(..2) {
Some("\n") => 1,
Some("\r\n") => 2,
Some("\r") => 1,
_ => 0,
}
is a bit faster in all cases: 914
picoseconds vs 680
picoseconds for the \r\n
case. Not that it makes much difference, but it is interesting because I would assume they optimize to the same code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, that's not exactly equivalent, it would return None
on single digit characters. Ignore me.
This looks good, but I was just thinking about something. It might be a dumb idea, but wouldn't it be more efficient to work out the byte indexes of whitespaces in one pass and re-use that? let regex = regex::bytes::Regex::new(r#"(([ \t\x0c]|(\r\n?|\n))+)"#,).unwrap();
let whitespace_indexes: Vec<Range<usize>> = regex.find_iter(bytes).map(|r| r.range()).collect(); This gives us a vec of the longest whitespace runs in the input in a single pass, which has got to be more efficient than continually advancing step by step? There are better datastructures than a vec but if we have this and a byte offset we know if we're in a range of whitespace and what the end position for that whitespace range is? I dunno, something like: let line_ws = config.ws_by_line[state.line];
for range in line_ws {
if range.contains(state.byte_offset) {
let ws = &config.input[state.byte_offset..range.end]
return Ok(SimpleWhitespace(ws))
}
} |
It depends on how frequently this is called, but searching a string is probably more efficient than running a regex, allocating and writing all the positions, and then searching for it (even if it is a binary search). I'll review the code changes tomorrow or on Monday |
let line = config.get_line_after_column(line, col)?; | ||
let bytes = line.as_bytes(); | ||
let mut idx = 0; | ||
while idx < bytes.len() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An abstraction that worked really well for Ruff (and Rust) is to use the following Cursor
abstraction to lex strings. It avoids the need for these manual match
statements.
removing Regexes from whitespace parser allows ditching of thread local storage + lazy initialization cost This shows a modest 2% improvement in overall parse time (inflate is improved by 10%)
parser: remove Regexes from whitespace parser
removing Regexes from whitespace parser allows ditching of thread local storage + lazy initialization cost
This shows a modest 2% improvement in overall parse time (inflate is improved by 10%)