-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change loop detection to work like Sublime Text #146
Conversation
This fixes issue #127. There's now 575 fewer failing assertions when running syntest. The following test files no longer have any failing assertions: * syntax_test_bash.sh * syntax_test_c#.cs * syntax_test_C#7.cs * syntax_test_GeneralStructure.cs * syntax_test_Generics.cs * syntax_test_Operators.cs * syntax_test_Using.cs See the added comment before ParseState for more details.
Here's a gist with before/after and the diff for syntest 🎉 : https://gist.github.com/robinst/df79211f81694dce8a913fd5990a3b51 |
pop_would_loop = check_pop_loop && !consuming && match match_pat.operation { | ||
MatchOperation::Pop => true, | ||
_ => false, | ||
}; | ||
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this pull request, but I think there's some potential for avoiding some unnecessary matching here. If we found a match with match_start == *start
(and it's not a looping pop), we can stop trying more patterns, as we'll not find a better match. Or am I missing something? I haven't tried it yet but I'll experiment with it and see if it makes a difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yah that sounds reasonable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I tried this and it improves time cargo run --release --example syntest > /dev/null
significantly:
- 5.59 real 5.30 user 0.24 sys
+ 2.21 real 1.92 user 0.24 sys
cargo bench is less clear, but I guess it depends heavily on how the syntax definition is written, e.g. for jquery.js:
jquery.js time: [653.05 ms 667.12 ms 681.12 ms]
change: [-13.724% -11.616% -9.5689%] (p = 0.00 < 0.05)
Performance has improved.
It would be very interesting to have timings per syntax test in syntest, then we could see improvements per language.
I'm gonna raise another PR with this change after this one is merged.
// advances one character and tries again, thus preventing the | ||
// loop. | ||
|
||
// println!("pop_would_loop for match {:?}, start {}", reg_match, *start); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and other printlns helped me understand why some things didn't work on the way to this solution. But I wonder if we should use something else here that's simpler to toggle, such as logging that can be enabled/disabled at compile-time (so there's no runtime overhead).
Also, it would help a lot if the context structs remembered their names (if they have one), that would make it much easier to understand where a match is coming from.. Was that considered or is the thinking that it would blow up the size of the structs for something that's only used when debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with these printlns and don't feel a huge need to switch to a logging solution, since I like being able to only turn on the individual prints I care about so as not to be drowned by noise, and that's harder with a logging solution than having an uncomment-line command.
The reason I don't have names in context structs is because the serialization of contexts to dumps is quite direct, so putting names in them would bloat dump sizes, and thus also binaries with the default dumps. It might be possible to have a cfg for names, or a low-overhead Option<Rc<String>>
that gets populated after loading for debugging or something like that though. Or a u32
context ID that gets looked up in a table or something. I dunno, I definitely agree it makes debugging hard, and I'm willing to sacrifice a little performance/space to make it happen, but not that much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, thanks for your thoughts.. Option<String>
also occurred to me, might be a nice way to do it. Regarding the context ID, I have the feeling that this might be the straightforward way to do it if we had #83.
@@ -220,24 +348,18 @@ impl ParseState { | |||
match_pat: &mut MatchPattern, | |||
captures: Option<&(Region, String)>, | |||
search_cache: &mut SearchCache, | |||
matched: &mut MatchedPatterns, | |||
regions: &mut Region) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it so expensive to create a new Region
that it's worth it to pass in an existing one to use? In case there was a match, it gets cloned later anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yah I think I remember profiling this and finding it was a significant expense. That might even be in a commit message in the history somewhere.
@@ -701,6 +889,320 @@ contexts: | |||
expect_scope_stacks(&line, &expect, syntax); | |||
} | |||
|
|||
#[test] | |||
fn can_parse_issue120() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one was just moved from the bottom.
} | ||
|
||
#[test] | ||
fn can_parse_non_consuming_pop_that_would_loop() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might seem like a lot of test cases, but most of them came out of running syntest on Sublime's syntax tests, looking at a failure, finding the problem and reducing it to a minimal test case in here.
@keith-hall: I wonder if some of these would be useful to have in sublimehq/Packages somewhere, to help people understand how matching works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably it would make sense for SublimeHQ to document it at http://www.sublimetext.com/docs/3/syntax.html - or, hopefully, they will open source some documentation pages soon for us to submit PRs to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for these, much better that everyone doesn't have to minimize cases out of syntax tests themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it, we could maybe post some test cases at sublimehq/Packages#757 to help people understand how the matching works, or depending on the outcome of sublimehq/Packages#1522, on the wiki pages of that repo. My understanding is that, because Sublime Text directly bundles the contents of that repository with it's releases, the test cases would be less welcome in the repository code base directly, but I could be wrong.
@keith-hall It would be cool if you could have a look to see if the changes make sense. It seems like you have the most experience when it comes to understanding how Sublime Text's matching works :). |
Looks good to me, nice work @robinst ! You even added test cases with multiple pushes - when I logged the issue reporting the differences, my biggest concern was that this could get broken when our implementation changes, but you've covered it perfectly :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, especially all the commenting and testing.
I have one significant change I'd like made to the way state is stored first so caching doesn't break.
src/parsing/parser.rs
Outdated
// | ||
// * If there's another rule that matches at the same position and does not | ||
// result in a loop, use that instead. | ||
// * Otherwise, go to the next position and continue matching in the current |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I'm understanding this correctly, the only difference between this and pretending the looping rule didn't match, is that when you go to the next position the looping rule could match again properly.
If so might be worth adding that to the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe what I wrote "continue matching in the current context" is confusing, I meant "go through the rules in the current context again". So yes, if we get to the same "pop" at that point, it's no longer looping and we use it as normal.
I'll amend the text to make that more clear.
pop_would_loop = check_pop_loop && !consuming && match match_pat.operation { | ||
MatchOperation::Pop => true, | ||
_ => false, | ||
}; | ||
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yah that sounds reasonable
// advances one character and tries again, thus preventing the | ||
// loop. | ||
|
||
// println!("pop_would_loop for match {:?}, start {}", reg_match, *start); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with these printlns and don't feel a huge need to switch to a logging solution, since I like being able to only turn on the individual prints I care about so as not to be drowned by noise, and that's harder with a logging solution than having an uncomment-line command.
The reason I don't have names in context structs is because the serialization of contexts to dumps is quite direct, so putting names in them would bloat dump sizes, and thus also binaries with the default dumps. It might be possible to have a cfg for names, or a low-overhead Option<Rc<String>>
that gets populated after loading for debugging or something like that though. Or a u32
context ID that gets looked up in a table or something. I dunno, I definitely agree it makes debugging hard, and I'm willing to sacrifice a little performance/space to make it happen, but not that much.
@@ -220,24 +348,18 @@ impl ParseState { | |||
match_pat: &mut MatchPattern, | |||
captures: Option<&(Region, String)>, | |||
search_cache: &mut SearchCache, | |||
matched: &mut MatchedPatterns, | |||
regions: &mut Region) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yah I think I remember profiling this and finding it was a significant expense. That might even be in a commit message in the history somewhere.
src/parsing/parser.rs
Outdated
// See issue #101. Contains indices of frames pushed by `with_prototype`s. | ||
// Doesn't look at `with_prototype`s below top of stack. | ||
proto_starts: Vec<usize>, | ||
// The line being parsed (starting at 0) | ||
line: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this doesn't fit the parsing state model I want to preserve. I want editors to be able to cache parse states allowing things like inserting a line to not have to re-parse the rest of a file, storing the line number negates that optimization. I know at least Xi plans on using caching like this (and they might already). The ParseState
and StateLevel
s should only hold things necessary between lines.
Luckily, as far as I can tell it's not necessary anyhow, and neither is the non_consuming_push
in the state levels.
Instead what you can do is have a line-local state like matched
that's an Option<(usize, usize)>
holding a column and a state stack depth. Then just compare the current column and depth before popping that depth at that level. I spent a bit of time thinking and I think this should work just as well but is more space-efficient and it keeps the ability to cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right! I actually had that at some point while trying to figure this thing out, but it didn't work because I didn't get it quite right (and I didn't fully understand how ST worked yet). But I went back now and redid it like that, and it's even simpler now :).
} | ||
|
||
#[test] | ||
fn can_parse_non_consuming_pop_that_would_loop() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for these, much better that everyone doesn't have to minimize cases out of syntax tests themselves.
This makes the code simpler and it's actually not needed to store more than one of these states (as it was before), because once we consume a character, we don't need any earlier state anymore. It also has the benefit of allowing caching of the parse state and re-parsing the same line multiple times, which would not have made sense with the line number before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback! I've addressed all the comments, and as a result the solution is now simpler too :).
src/parsing/parser.rs
Outdated
// | ||
// * If there's another rule that matches at the same position and does not | ||
// result in a loop, use that instead. | ||
// * Otherwise, go to the next position and continue matching in the current |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe what I wrote "continue matching in the current context" is confusing, I meant "go through the rules in the current context again". So yes, if we get to the same "pop" at that point, it's no longer looping and we use it as normal.
I'll amend the text to make that more clear.
src/parsing/parser.rs
Outdated
// See issue #101. Contains indices of frames pushed by `with_prototype`s. | ||
// Doesn't look at `with_prototype`s below top of stack. | ||
proto_starts: Vec<usize>, | ||
// The line being parsed (starting at 0) | ||
line: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right! I actually had that at some point while trying to figure this thing out, but it didn't work because I didn't get it quite right (and I didn't fully understand how ST worked yet). But I went back now and redid it like that, and it's even simpler now :).
// advances one character and tries again, thus preventing the | ||
// loop. | ||
|
||
// println!("pop_would_loop for match {:?}, start {}", reg_match, *start); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, thanks for your thoughts.. Option<String>
also occurred to me, might be a nice way to do it. Regarding the context ID, I have the feeling that this might be the straightforward way to do it if we had #83.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me as is, just some comments that could make it cleaner or less clean depending on your taste.
src/parsing/parser.rs
Outdated
// The match consumes some characters. So update the position | ||
// and clear state we use for checking for loops. | ||
*start = match_end; | ||
*non_consuming_push_at = None; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is strictly necessary, since any future checks against it won't succeed since it will have an earlier start position. But it's totally fine to leave it, maybe makes it easier to think about.
I also realized that you can even avoid the logic of it being an Option
because (0,0)
is a starting state that should act correctly. Unsure of whether I prefer the cleaner logic of doing that or the perhaps more conceptually nice way of using an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, all good ideas :). Done!
src/parsing/parser.rs
Outdated
// because a non-consuming "set" could also result in a loop. | ||
let context = reg_match.context.borrow(); | ||
let match_pattern = context.match_at(reg_match.pat_index); | ||
match match_pattern.operation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be an if let
src/parsing/parser.rs
Outdated
if consuming { | ||
// The match consumes some characters. So update the position | ||
// and clear state we use for checking for loops. | ||
*start = match_end; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be done unconditionally, combined with the next comment, this branch of the if is unnecessary. If you think it's easier to understand this way then I'm fine with it though.
src/parsing/parser.rs
Outdated
@@ -321,11 +317,31 @@ impl ParseState { | |||
return false; | |||
} | |||
*start += 1; | |||
*non_consuming_push_at = None; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also unnecessary, see other similar comment for more.
We don't need an option, and we don't need to set it to None, as it won't match if the position changes anyway.
All done! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks for doing all this.
Cool! Would you mind creating a release with this and the other changes? |
Actually, let's first get the change in to abort early on matching, I'll raise a PR. |
This fixes issue #127. There's now 575 fewer failing assertions when
running syntest. The following test files no longer have any failing
assertions:
See the added comment before ParseState for more details.