Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change loop detection to work like Sublime Text #146

Merged
merged 4 commits into from
Apr 27, 2018

Conversation

robinst
Copy link
Collaborator

@robinst robinst commented Apr 24, 2018

This fixes issue #127. There's now 575 fewer failing assertions when
running syntest. The following test files no longer have any failing
assertions:

  • syntax_test_bash.sh
  • syntax_test_c#.cs
  • syntax_test_C#7.cs
  • syntax_test_GeneralStructure.cs
  • syntax_test_Generics.cs
  • syntax_test_Operators.cs
  • syntax_test_Using.cs

See the added comment before ParseState for more details.

This fixes issue #127. There's now 575 fewer failing assertions when
running syntest. The following test files no longer have any failing
assertions:

* syntax_test_bash.sh
* syntax_test_c#.cs
* syntax_test_C#7.cs
* syntax_test_GeneralStructure.cs
* syntax_test_Generics.cs
* syntax_test_Operators.cs
* syntax_test_Using.cs

See the added comment before ParseState for more details.
@robinst
Copy link
Collaborator Author

robinst commented Apr 24, 2018

Here's a gist with before/after and the diff for syntest 🎉 : https://gist.github.com/robinst/df79211f81694dce8a913fd5990a3b51

pop_would_loop = check_pop_loop && !consuming && match match_pat.operation {
MatchOperation::Pop => true,
_ => false,
};
}
}
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this pull request, but I think there's some potential for avoiding some unnecessary matching here. If we found a match with match_start == *start (and it's not a looping pop), we can stop trying more patterns, as we'll not find a better match. Or am I missing something? I haven't tried it yet but I'll experiment with it and see if it makes a difference.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah that sounds reasonable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I tried this and it improves time cargo run --release --example syntest > /dev/null significantly:

-        5.59 real         5.30 user         0.24 sys
+        2.21 real         1.92 user         0.24 sys

cargo bench is less clear, but I guess it depends heavily on how the syntax definition is written, e.g. for jquery.js:

jquery.js               time:   [653.05 ms 667.12 ms 681.12 ms]
                        change: [-13.724% -11.616% -9.5689%] (p = 0.00 < 0.05)
                        Performance has improved.

It would be very interesting to have timings per syntax test in syntest, then we could see improvements per language.

I'm gonna raise another PR with this change after this one is merged.

// advances one character and tries again, thus preventing the
// loop.

// println!("pop_would_loop for match {:?}, start {}", reg_match, *start);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and other printlns helped me understand why some things didn't work on the way to this solution. But I wonder if we should use something else here that's simpler to toggle, such as logging that can be enabled/disabled at compile-time (so there's no runtime overhead).

Also, it would help a lot if the context structs remembered their names (if they have one), that would make it much easier to understand where a match is coming from.. Was that considered or is the thinking that it would blow up the size of the structs for something that's only used when debugging?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with these printlns and don't feel a huge need to switch to a logging solution, since I like being able to only turn on the individual prints I care about so as not to be drowned by noise, and that's harder with a logging solution than having an uncomment-line command.

The reason I don't have names in context structs is because the serialization of contexts to dumps is quite direct, so putting names in them would bloat dump sizes, and thus also binaries with the default dumps. It might be possible to have a cfg for names, or a low-overhead Option<Rc<String>> that gets populated after loading for debugging or something like that though. Or a u32 context ID that gets looked up in a table or something. I dunno, I definitely agree it makes debugging hard, and I'm willing to sacrifice a little performance/space to make it happen, but not that much.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, thanks for your thoughts.. Option<String> also occurred to me, might be a nice way to do it. Regarding the context ID, I have the feeling that this might be the straightforward way to do it if we had #83.

@@ -220,24 +348,18 @@ impl ParseState {
match_pat: &mut MatchPattern,
captures: Option<&(Region, String)>,
search_cache: &mut SearchCache,
matched: &mut MatchedPatterns,
regions: &mut Region)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it so expensive to create a new Region that it's worth it to pass in an existing one to use? In case there was a match, it gets cloned later anyway.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah I think I remember profiling this and finding it was a significant expense. That might even be in a commit message in the history somewhere.

@@ -701,6 +889,320 @@ contexts:
expect_scope_stacks(&line, &expect, syntax);
}

#[test]
fn can_parse_issue120() {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one was just moved from the bottom.

}

#[test]
fn can_parse_non_consuming_pop_that_would_loop() {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might seem like a lot of test cases, but most of them came out of running syntest on Sublime's syntax tests, looking at a failure, finding the problem and reducing it to a minimal test case in here.

@keith-hall: I wonder if some of these would be useful to have in sublimehq/Packages somewhere, to help people understand how matching works?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably it would make sense for SublimeHQ to document it at http://www.sublimetext.com/docs/3/syntax.html - or, hopefully, they will open source some documentation pages soon for us to submit PRs to.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these, much better that everyone doesn't have to minimize cases out of syntax tests themselves.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it, we could maybe post some test cases at sublimehq/Packages#757 to help people understand how the matching works, or depending on the outcome of sublimehq/Packages#1522, on the wiki pages of that repo. My understanding is that, because Sublime Text directly bundles the contents of that repository with it's releases, the test cases would be less welcome in the repository code base directly, but I could be wrong.

@robinst
Copy link
Collaborator Author

robinst commented Apr 24, 2018

@keith-hall It would be cool if you could have a look to see if the changes make sense. It seems like you have the most experience when it comes to understanding how Sublime Text's matching works :).

@keith-hall
Copy link
Collaborator

Looks good to me, nice work @robinst ! You even added test cases with multiple pushes - when I logged the issue reporting the differences, my biggest concern was that this could get broken when our implementation changes, but you've covered it perfectly :)

Copy link
Owner

@trishume trishume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, especially all the commenting and testing.

I have one significant change I'd like made to the way state is stored first so caching doesn't break.

//
// * If there's another rule that matches at the same position and does not
// result in a loop, use that instead.
// * Otherwise, go to the next position and continue matching in the current
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I'm understanding this correctly, the only difference between this and pretending the looping rule didn't match, is that when you go to the next position the looping rule could match again properly.

If so might be worth adding that to the comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe what I wrote "continue matching in the current context" is confusing, I meant "go through the rules in the current context again". So yes, if we get to the same "pop" at that point, it's no longer looping and we use it as normal.

I'll amend the text to make that more clear.

pop_would_loop = check_pop_loop && !consuming && match match_pat.operation {
MatchOperation::Pop => true,
_ => false,
};
}
}
}
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah that sounds reasonable

// advances one character and tries again, thus preventing the
// loop.

// println!("pop_would_loop for match {:?}, start {}", reg_match, *start);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with these printlns and don't feel a huge need to switch to a logging solution, since I like being able to only turn on the individual prints I care about so as not to be drowned by noise, and that's harder with a logging solution than having an uncomment-line command.

The reason I don't have names in context structs is because the serialization of contexts to dumps is quite direct, so putting names in them would bloat dump sizes, and thus also binaries with the default dumps. It might be possible to have a cfg for names, or a low-overhead Option<Rc<String>> that gets populated after loading for debugging or something like that though. Or a u32 context ID that gets looked up in a table or something. I dunno, I definitely agree it makes debugging hard, and I'm willing to sacrifice a little performance/space to make it happen, but not that much.

@@ -220,24 +348,18 @@ impl ParseState {
match_pat: &mut MatchPattern,
captures: Option<&(Region, String)>,
search_cache: &mut SearchCache,
matched: &mut MatchedPatterns,
regions: &mut Region)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah I think I remember profiling this and finding it was a significant expense. That might even be in a commit message in the history somewhere.

// See issue #101. Contains indices of frames pushed by `with_prototype`s.
// Doesn't look at `with_prototype`s below top of stack.
proto_starts: Vec<usize>,
// The line being parsed (starting at 0)
line: usize,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this doesn't fit the parsing state model I want to preserve. I want editors to be able to cache parse states allowing things like inserting a line to not have to re-parse the rest of a file, storing the line number negates that optimization. I know at least Xi plans on using caching like this (and they might already). The ParseState and StateLevels should only hold things necessary between lines.

Luckily, as far as I can tell it's not necessary anyhow, and neither is the non_consuming_push in the state levels.

Instead what you can do is have a line-local state like matched that's an Option<(usize, usize)> holding a column and a state stack depth. Then just compare the current column and depth before popping that depth at that level. I spent a bit of time thinking and I think this should work just as well but is more space-efficient and it keeps the ability to cache.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! I actually had that at some point while trying to figure this thing out, but it didn't work because I didn't get it quite right (and I didn't fully understand how ST worked yet). But I went back now and redid it like that, and it's even simpler now :).

}

#[test]
fn can_parse_non_consuming_pop_that_would_loop() {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these, much better that everyone doesn't have to minimize cases out of syntax tests themselves.

This makes the code simpler and it's actually not needed to store more
than one of these states (as it was before), because once we consume a
character, we don't need any earlier state anymore.

It also has the benefit of allowing caching of the parse state and
re-parsing the same line multiple times, which would not have made sense
with the line number before.
Copy link
Collaborator Author

@robinst robinst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! I've addressed all the comments, and as a result the solution is now simpler too :).

//
// * If there's another rule that matches at the same position and does not
// result in a loop, use that instead.
// * Otherwise, go to the next position and continue matching in the current
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe what I wrote "continue matching in the current context" is confusing, I meant "go through the rules in the current context again". So yes, if we get to the same "pop" at that point, it's no longer looping and we use it as normal.

I'll amend the text to make that more clear.

// See issue #101. Contains indices of frames pushed by `with_prototype`s.
// Doesn't look at `with_prototype`s below top of stack.
proto_starts: Vec<usize>,
// The line being parsed (starting at 0)
line: usize,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! I actually had that at some point while trying to figure this thing out, but it didn't work because I didn't get it quite right (and I didn't fully understand how ST worked yet). But I went back now and redid it like that, and it's even simpler now :).

// advances one character and tries again, thus preventing the
// loop.

// println!("pop_would_loop for match {:?}, start {}", reg_match, *start);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, thanks for your thoughts.. Option<String> also occurred to me, might be a nice way to do it. Regarding the context ID, I have the feeling that this might be the straightforward way to do it if we had #83.

Copy link
Owner

@trishume trishume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me as is, just some comments that could make it cleaner or less clean depending on your taste.

// The match consumes some characters. So update the position
// and clear state we use for checking for loops.
*start = match_end;
*non_consuming_push_at = None;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is strictly necessary, since any future checks against it won't succeed since it will have an earlier start position. But it's totally fine to leave it, maybe makes it easier to think about.

I also realized that you can even avoid the logic of it being an Option because (0,0) is a starting state that should act correctly. Unsure of whether I prefer the cleaner logic of doing that or the perhaps more conceptually nice way of using an option.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, all good ideas :). Done!

// because a non-consuming "set" could also result in a loop.
let context = reg_match.context.borrow();
let match_pattern = context.match_at(reg_match.pat_index);
match match_pattern.operation {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be an if let

if consuming {
// The match consumes some characters. So update the position
// and clear state we use for checking for loops.
*start = match_end;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be done unconditionally, combined with the next comment, this branch of the if is unnecessary. If you think it's easier to understand this way then I'm fine with it though.

@@ -321,11 +317,31 @@ impl ParseState {
return false;
}
*start += 1;
*non_consuming_push_at = None;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also unnecessary, see other similar comment for more.

We don't need an option, and we don't need to set it to None, as it
won't match if the position changes anyway.
@robinst
Copy link
Collaborator Author

robinst commented Apr 26, 2018

All done!

Copy link
Owner

@trishume trishume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for doing all this.

@trishume trishume merged commit 128666e into master Apr 27, 2018
@trishume trishume deleted the issue-127-make-loop-detection-follow-st branch April 27, 2018 16:34
@robinst
Copy link
Collaborator Author

robinst commented Apr 28, 2018

Cool! Would you mind creating a release with this and the other changes?

@robinst
Copy link
Collaborator Author

robinst commented Apr 28, 2018

Actually, let's first get the change in to abort early on matching, I'll raise a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants