-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add support for the new f-string tokens per PEP 701 (#6659)
## Summary This PR adds support in the lexer for the newly added f-string tokens as per PEP 701. The following new tokens are added: * `FStringStart`: Token value for the start of an f-string. This includes the `f`/`F`/`fr` prefix and the opening quote(s). * `FStringMiddle`: Token value that includes the portion of text inside the f-string that's not part of the expression part and isn't an opening or closing brace. * `FStringEnd`: Token value for the end of an f-string. This includes the closing quote. Additionally, a new `Exclamation` token is added for conversion (`f"{foo!s}"`) as that's part of an expression. ## Test Plan New test cases are added to for various possibilities using snapshot testing. The output has been verified using python/cpython@f2cc00527e. ## Benchmarks _I've put the number of f-strings for each of the following files after the file name_ ``` lexer/large/dataset.py (1) 1.05 612.6±91.60µs 66.4 MB/sec 1.00 584.7±33.72µs 69.6 MB/sec lexer/numpy/ctypeslib.py (0) 1.01 131.8±3.31µs 126.3 MB/sec 1.00 130.9±5.37µs 127.2 MB/sec lexer/numpy/globals.py (1) 1.02 13.2±0.43µs 222.7 MB/sec 1.00 13.0±0.41µs 226.8 MB/sec lexer/pydantic/types.py (8) 1.13 285.0±11.72µs 89.5 MB/sec 1.00 252.9±10.13µs 100.8 MB/sec lexer/unicode/pypinyin.py (0) 1.03 32.9±1.92µs 127.5 MB/sec 1.00 31.8±1.25µs 132.0 MB/sec ``` It seems that overall the lexer has regressed. I profiled every file mentioned above and I saw one improvement which is done in (098ee5d). But otherwise I don't see anything else. A few notes by isolating the f-string part in the profile: * As we're adding new tokens and functionality to emit them, I expect the lexer to take more time because of more code. * The `lex_fstring_middle_or_end` takes the most amount of time followed by the `current_mut` line when lexing the `:` token. The latter is to check if we're at the start of a format spec or not. * In a f-string heavy file such as https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py [^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted by string allocation for the string literal part of `FStringMiddle` token (https://share.firefox.dev/3ErEa1W) I don't see anything out of ordinary for `pydantic/types` profile (https://share.firefox.dev/45XcLRq) fixes: #7042 [^1]: We could add this in lexer and parser benchmark
- Loading branch information
1 parent
04183b0
commit 9820c04
Showing
24 changed files
with
2,317 additions
and
11 deletions.
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,158 @@ | ||
use bitflags::bitflags; | ||
|
||
use ruff_text_size::TextSize; | ||
|
||
bitflags! { | ||
#[derive(Debug)] | ||
pub(crate) struct FStringContextFlags: u8 { | ||
/// The current f-string is a triple-quoted f-string i.e., the number of | ||
/// opening quotes is 3. If this flag is not set, the number of opening | ||
/// quotes is 1. | ||
const TRIPLE = 1 << 0; | ||
|
||
/// The current f-string is a double-quoted f-string. If this flag is not | ||
/// set, the current f-string is a single-quoted f-string. | ||
const DOUBLE = 1 << 1; | ||
|
||
/// The current f-string is a raw f-string i.e., prefixed with `r`/`R`. | ||
/// If this flag is not set, the current f-string is a normal f-string. | ||
const RAW = 1 << 2; | ||
} | ||
} | ||
|
||
/// The context representing the current f-string that the lexer is in. | ||
#[derive(Debug)] | ||
pub(crate) struct FStringContext { | ||
flags: FStringContextFlags, | ||
|
||
/// The level of nesting for the lexer when it entered the current f-string. | ||
/// The nesting level includes all kinds of parentheses i.e., round, square, | ||
/// and curly. | ||
nesting: u32, | ||
|
||
/// The current depth of format spec for the current f-string. This is because | ||
/// there can be multiple format specs nested for the same f-string. | ||
/// For example, `{a:{b:{c}}}` has 3 format specs. | ||
format_spec_depth: u32, | ||
} | ||
|
||
impl FStringContext { | ||
pub(crate) const fn new(flags: FStringContextFlags, nesting: u32) -> Self { | ||
Self { | ||
flags, | ||
nesting, | ||
format_spec_depth: 0, | ||
} | ||
} | ||
|
||
pub(crate) const fn nesting(&self) -> u32 { | ||
self.nesting | ||
} | ||
|
||
/// Returns the quote character for the current f-string. | ||
pub(crate) const fn quote_char(&self) -> char { | ||
if self.flags.contains(FStringContextFlags::DOUBLE) { | ||
'"' | ||
} else { | ||
'\'' | ||
} | ||
} | ||
|
||
/// Returns the number of quotes for the current f-string. | ||
pub(crate) const fn quote_size(&self) -> TextSize { | ||
if self.is_triple_quoted() { | ||
TextSize::new(3) | ||
} else { | ||
TextSize::new(1) | ||
} | ||
} | ||
|
||
/// Returns the triple quotes for the current f-string if it is a triple-quoted | ||
/// f-string, `None` otherwise. | ||
pub(crate) const fn triple_quotes(&self) -> Option<&'static str> { | ||
if self.is_triple_quoted() { | ||
if self.flags.contains(FStringContextFlags::DOUBLE) { | ||
Some(r#"""""#) | ||
} else { | ||
Some("'''") | ||
} | ||
} else { | ||
None | ||
} | ||
} | ||
|
||
/// Returns `true` if the current f-string is a raw f-string. | ||
pub(crate) const fn is_raw_string(&self) -> bool { | ||
self.flags.contains(FStringContextFlags::RAW) | ||
} | ||
|
||
/// Returns `true` if the current f-string is a triple-quoted f-string. | ||
pub(crate) const fn is_triple_quoted(&self) -> bool { | ||
self.flags.contains(FStringContextFlags::TRIPLE) | ||
} | ||
|
||
/// Calculates the number of open parentheses for the current f-string | ||
/// based on the current level of nesting for the lexer. | ||
const fn open_parentheses_count(&self, current_nesting: u32) -> u32 { | ||
current_nesting.saturating_sub(self.nesting) | ||
} | ||
|
||
/// Returns `true` if the lexer is in a f-string expression i.e., between | ||
/// two curly braces. | ||
pub(crate) const fn is_in_expression(&self, current_nesting: u32) -> bool { | ||
self.open_parentheses_count(current_nesting) > self.format_spec_depth | ||
} | ||
|
||
/// Returns `true` if the lexer is in a f-string format spec i.e., after a colon. | ||
pub(crate) const fn is_in_format_spec(&self, current_nesting: u32) -> bool { | ||
self.format_spec_depth > 0 && !self.is_in_expression(current_nesting) | ||
} | ||
|
||
/// Returns `true` if the context is in a valid position to start format spec | ||
/// i.e., at the same level of nesting as the opening parentheses token. | ||
/// Increments the format spec depth if it is. | ||
/// | ||
/// This assumes that the current character for the lexer is a colon (`:`). | ||
pub(crate) fn try_start_format_spec(&mut self, current_nesting: u32) -> bool { | ||
if self | ||
.open_parentheses_count(current_nesting) | ||
.saturating_sub(self.format_spec_depth) | ||
== 1 | ||
{ | ||
self.format_spec_depth += 1; | ||
true | ||
} else { | ||
false | ||
} | ||
} | ||
|
||
/// Decrements the format spec depth unconditionally. | ||
pub(crate) fn end_format_spec(&mut self) { | ||
self.format_spec_depth = self.format_spec_depth.saturating_sub(1); | ||
} | ||
} | ||
|
||
/// The f-strings stack is used to keep track of all the f-strings that the | ||
/// lexer encounters. This is necessary because f-strings can be nested. | ||
#[derive(Debug, Default)] | ||
pub(crate) struct FStrings { | ||
stack: Vec<FStringContext>, | ||
} | ||
|
||
impl FStrings { | ||
pub(crate) fn push(&mut self, context: FStringContext) { | ||
self.stack.push(context); | ||
} | ||
|
||
pub(crate) fn pop(&mut self) -> Option<FStringContext> { | ||
self.stack.pop() | ||
} | ||
|
||
pub(crate) fn current(&self) -> Option<&FStringContext> { | ||
self.stack.last() | ||
} | ||
|
||
pub(crate) fn current_mut(&mut self) -> Option<&mut FStringContext> { | ||
self.stack.last_mut() | ||
} | ||
} |
66 changes: 66 additions & 0 deletions
66
...es/ruff_python_parser/src/snapshots/ruff_python_parser__lexer__tests__empty_fstrings.snap
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
--- | ||
source: crates/ruff_python_parser/src/lexer.rs | ||
expression: lex_source(source) | ||
--- | ||
[ | ||
( | ||
FStringStart, | ||
0..2, | ||
), | ||
( | ||
FStringEnd, | ||
2..3, | ||
), | ||
( | ||
String { | ||
value: "", | ||
kind: String, | ||
triple_quoted: false, | ||
}, | ||
4..6, | ||
), | ||
( | ||
FStringStart, | ||
7..9, | ||
), | ||
( | ||
FStringEnd, | ||
9..10, | ||
), | ||
( | ||
FStringStart, | ||
11..13, | ||
), | ||
( | ||
FStringEnd, | ||
13..14, | ||
), | ||
( | ||
String { | ||
value: "", | ||
kind: String, | ||
triple_quoted: false, | ||
}, | ||
15..17, | ||
), | ||
( | ||
FStringStart, | ||
18..22, | ||
), | ||
( | ||
FStringEnd, | ||
22..25, | ||
), | ||
( | ||
FStringStart, | ||
26..30, | ||
), | ||
( | ||
FStringEnd, | ||
30..33, | ||
), | ||
( | ||
Newline, | ||
33..33, | ||
), | ||
] |
88 changes: 88 additions & 0 deletions
88
crates/ruff_python_parser/src/snapshots/ruff_python_parser__lexer__tests__fstring.snap
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
--- | ||
source: crates/ruff_python_parser/src/lexer.rs | ||
expression: lex_source(source) | ||
--- | ||
[ | ||
( | ||
FStringStart, | ||
0..2, | ||
), | ||
( | ||
FStringMiddle { | ||
value: "normal ", | ||
is_raw: false, | ||
}, | ||
2..9, | ||
), | ||
( | ||
Lbrace, | ||
9..10, | ||
), | ||
( | ||
Name { | ||
name: "foo", | ||
}, | ||
10..13, | ||
), | ||
( | ||
Rbrace, | ||
13..14, | ||
), | ||
( | ||
FStringMiddle { | ||
value: " {another} ", | ||
is_raw: false, | ||
}, | ||
14..27, | ||
), | ||
( | ||
Lbrace, | ||
27..28, | ||
), | ||
( | ||
Name { | ||
name: "bar", | ||
}, | ||
28..31, | ||
), | ||
( | ||
Rbrace, | ||
31..32, | ||
), | ||
( | ||
FStringMiddle { | ||
value: " {", | ||
is_raw: false, | ||
}, | ||
32..35, | ||
), | ||
( | ||
Lbrace, | ||
35..36, | ||
), | ||
( | ||
Name { | ||
name: "three", | ||
}, | ||
36..41, | ||
), | ||
( | ||
Rbrace, | ||
41..42, | ||
), | ||
( | ||
FStringMiddle { | ||
value: "}", | ||
is_raw: false, | ||
}, | ||
42..44, | ||
), | ||
( | ||
FStringEnd, | ||
44..45, | ||
), | ||
( | ||
Newline, | ||
45..45, | ||
), | ||
] |
Oops, something went wrong.