Change loop detection to work like Sublime Text #146

robinst · 2018-04-24T07:02:55Z

This fixes issue #127. There's now 575 fewer failing assertions when
running syntest. The following test files no longer have any failing
assertions:

syntax_test_bash.sh
syntax_test_c#.cs
syntax_test_C#7.cs
syntax_test_GeneralStructure.cs
syntax_test_Generics.cs
syntax_test_Operators.cs
syntax_test_Using.cs

See the added comment before ParseState for more details.

This fixes issue #127. There's now 575 fewer failing assertions when running syntest. The following test files no longer have any failing assertions: * syntax_test_bash.sh * syntax_test_c#.cs * syntax_test_C#7.cs * syntax_test_GeneralStructure.cs * syntax_test_Generics.cs * syntax_test_Operators.cs * syntax_test_Using.cs See the added comment before ParseState for more details.

robinst · 2018-04-24T07:05:51Z

Here's a gist with before/after and the diff for syntest 🎉 : https://gist.github.com/robinst/df79211f81694dce8a913fd5990a3b51

robinst · 2018-04-24T12:25:46Z

src/parsing/parser.rs

+                            pop_would_loop = check_pop_loop && !consuming && match match_pat.operation {
+                                MatchOperation::Pop => true,
+                                _ => false,
+                            };
                        }
                    }
                }
            }


Unrelated to this pull request, but I think there's some potential for avoiding some unnecessary matching here. If we found a match with match_start == *start (and it's not a looping pop), we can stop trying more patterns, as we'll not find a better match. Or am I missing something? I haven't tried it yet but I'll experiment with it and see if it makes a difference.

Yah that sounds reasonable

So I tried this and it improves time cargo run --release --example syntest > /dev/null significantly:

- 5.59 real 5.30 user 0.24 sys + 2.21 real 1.92 user 0.24 sys

cargo bench is less clear, but I guess it depends heavily on how the syntax definition is written, e.g. for jquery.js:

jquery.js time: [653.05 ms 667.12 ms 681.12 ms] change: [-13.724% -11.616% -9.5689%] (p = 0.00 < 0.05) Performance has improved.

It would be very interesting to have timings per syntax test in syntest, then we could see improvements per language.

I'm gonna raise another PR with this change after this one is merged.

robinst · 2018-04-24T12:27:48Z

src/parsing/parser.rs

+                // advances one character and tries again, thus preventing the
+                // loop.
+
+                // println!("pop_would_loop for match {:?}, start {}", reg_match, *start);


This and other printlns helped me understand why some things didn't work on the way to this solution. But I wonder if we should use something else here that's simpler to toggle, such as logging that can be enabled/disabled at compile-time (so there's no runtime overhead).

Also, it would help a lot if the context structs remembered their names (if they have one), that would make it much easier to understand where a match is coming from.. Was that considered or is the thinking that it would blow up the size of the structs for something that's only used when debugging?

I'm okay with these printlns and don't feel a huge need to switch to a logging solution, since I like being able to only turn on the individual prints I care about so as not to be drowned by noise, and that's harder with a logging solution than having an uncomment-line command.

The reason I don't have names in context structs is because the serialization of contexts to dumps is quite direct, so putting names in them would bloat dump sizes, and thus also binaries with the default dumps. It might be possible to have a cfg for names, or a low-overhead Option<Rc<String>> that gets populated after loading for debugging or something like that though. Or a u32 context ID that gets looked up in a table or something. I dunno, I definitely agree it makes debugging hard, and I'm willing to sacrifice a little performance/space to make it happen, but not that much.

Yeah, thanks for your thoughts.. Option<String> also occurred to me, might be a nice way to do it. Regarding the context ID, I have the feeling that this might be the straightforward way to do it if we had #83.

robinst · 2018-04-24T12:32:36Z

src/parsing/parser.rs

@@ -220,24 +348,18 @@ impl ParseState {
              match_pat: &mut MatchPattern,
              captures: Option<&(Region, String)>,
              search_cache: &mut SearchCache,
-              matched: &mut MatchedPatterns,
              regions: &mut Region)


Is it so expensive to create a new Region that it's worth it to pass in an existing one to use? In case there was a match, it gets cloned later anyway.

Yah I think I remember profiling this and finding it was a significant expense. That might even be in a commit message in the history somewhere.

robinst · 2018-04-24T12:33:22Z

src/parsing/parser.rs

@@ -701,6 +889,320 @@ contexts:
        expect_scope_stacks(&line, &expect, syntax);
    }

+    #[test]
+    fn can_parse_issue120() {


This one was just moved from the bottom.

robinst · 2018-04-24T12:48:33Z

src/parsing/parser.rs

+    }
+
+    #[test]
+    fn can_parse_non_consuming_pop_that_would_loop() {


This might seem like a lot of test cases, but most of them came out of running syntest on Sublime's syntax tests, looking at a failure, finding the problem and reducing it to a minimal test case in here.

@keith-hall: I wonder if some of these would be useful to have in sublimehq/Packages somewhere, to help people understand how matching works?

Probably it would make sense for SublimeHQ to document it at http://www.sublimetext.com/docs/3/syntax.html - or, hopefully, they will open source some documentation pages soon for us to submit PRs to.

Thanks for these, much better that everyone doesn't have to minimize cases out of syntax tests themselves.

Thinking about it, we could maybe post some test cases at sublimehq/Packages#757 to help people understand how the matching works, or depending on the outcome of sublimehq/Packages#1522, on the wiki pages of that repo. My understanding is that, because Sublime Text directly bundles the contents of that repository with it's releases, the test cases would be less welcome in the repository code base directly, but I could be wrong.

robinst · 2018-04-24T12:50:14Z

@keith-hall It would be cool if you could have a look to see if the changes make sense. It seems like you have the most experience when it comes to understanding how Sublime Text's matching works :).

keith-hall · 2018-04-24T14:32:11Z

Looks good to me, nice work @robinst ! You even added test cases with multiple pushes - when I logged the issue reporting the differences, my biggest concern was that this could get broken when our implementation changes, but you've covered it perfectly :)

trishume

Thanks for this, especially all the commenting and testing.

I have one significant change I'd like made to the way state is stored first so caching doesn't break.

trishume · 2018-04-24T18:14:57Z

src/parsing/parser.rs

+//
+// * If there's another rule that matches at the same position and does not
+//   result in a loop, use that instead.
+// * Otherwise, go to the next position and continue matching in the current


So if I'm understanding this correctly, the only difference between this and pretending the looping rule didn't match, is that when you go to the next position the looping rule could match again properly.

If so might be worth adding that to the comment.

Maybe what I wrote "continue matching in the current context" is confusing, I meant "go through the rules in the current context again". So yes, if we get to the same "pop" at that point, it's no longer looping and we use it as normal.

I'll amend the text to make that more clear.

trishume · 2018-04-24T18:21:19Z

src/parsing/parser.rs

+                            pop_would_loop = check_pop_loop && !consuming && match match_pat.operation {
+                                MatchOperation::Pop => true,
+                                _ => false,
+                            };
                        }
                    }
                }
            }


Yah that sounds reasonable

trishume · 2018-04-24T18:27:40Z

src/parsing/parser.rs

+                // advances one character and tries again, thus preventing the
+                // loop.
+
+                // println!("pop_would_loop for match {:?}, start {}", reg_match, *start);


I'm okay with these printlns and don't feel a huge need to switch to a logging solution, since I like being able to only turn on the individual prints I care about so as not to be drowned by noise, and that's harder with a logging solution than having an uncomment-line command.

The reason I don't have names in context structs is because the serialization of contexts to dumps is quite direct, so putting names in them would bloat dump sizes, and thus also binaries with the default dumps. It might be possible to have a cfg for names, or a low-overhead Option<Rc<String>> that gets populated after loading for debugging or something like that though. Or a u32 context ID that gets looked up in a table or something. I dunno, I definitely agree it makes debugging hard, and I'm willing to sacrifice a little performance/space to make it happen, but not that much.

trishume · 2018-04-24T18:28:55Z

src/parsing/parser.rs

@@ -220,24 +348,18 @@ impl ParseState {
              match_pat: &mut MatchPattern,
              captures: Option<&(Region, String)>,
              search_cache: &mut SearchCache,
-              matched: &mut MatchedPatterns,
              regions: &mut Region)


Yah I think I remember profiling this and finding it was a significant expense. That might even be in a commit message in the history somewhere.

trishume · 2018-04-24T19:02:29Z

src/parsing/parser.rs

    // See issue #101. Contains indices of frames pushed by `with_prototype`s.
    // Doesn't look at `with_prototype`s below top of stack.
    proto_starts: Vec<usize>,
+    // The line being parsed (starting at 0)
+    line: usize,


So this doesn't fit the parsing state model I want to preserve. I want editors to be able to cache parse states allowing things like inserting a line to not have to re-parse the rest of a file, storing the line number negates that optimization. I know at least Xi plans on using caching like this (and they might already). The ParseState and StateLevels should only hold things necessary between lines.

Luckily, as far as I can tell it's not necessary anyhow, and neither is the non_consuming_push in the state levels.

Instead what you can do is have a line-local state like matched that's an Option<(usize, usize)> holding a column and a state stack depth. Then just compare the current column and depth before popping that depth at that level. I spent a bit of time thinking and I think this should work just as well but is more space-efficient and it keeps the ability to cache.

You're right! I actually had that at some point while trying to figure this thing out, but it didn't work because I didn't get it quite right (and I didn't fully understand how ST worked yet). But I went back now and redid it like that, and it's even simpler now :).

trishume · 2018-04-24T19:03:18Z

src/parsing/parser.rs

+    }
+
+    #[test]
+    fn can_parse_non_consuming_pop_that_would_loop() {


Thanks for these, much better that everyone doesn't have to minimize cases out of syntax tests themselves.

This makes the code simpler and it's actually not needed to store more than one of these states (as it was before), because once we consume a character, we don't need any earlier state anymore. It also has the benefit of allowing caching of the parse state and re-parsing the same line multiple times, which would not have made sense with the line number before.

robinst

Thanks for the feedback! I've addressed all the comments, and as a result the solution is now simpler too :).

robinst · 2018-04-26T01:23:11Z

src/parsing/parser.rs

+//
+// * If there's another rule that matches at the same position and does not
+//   result in a loop, use that instead.
+// * Otherwise, go to the next position and continue matching in the current


Maybe what I wrote "continue matching in the current context" is confusing, I meant "go through the rules in the current context again". So yes, if we get to the same "pop" at that point, it's no longer looping and we use it as normal.

I'll amend the text to make that more clear.

robinst · 2018-04-26T01:24:48Z

src/parsing/parser.rs

    // See issue #101. Contains indices of frames pushed by `with_prototype`s.
    // Doesn't look at `with_prototype`s below top of stack.
    proto_starts: Vec<usize>,
+    // The line being parsed (starting at 0)
+    line: usize,


You're right! I actually had that at some point while trying to figure this thing out, but it didn't work because I didn't get it quite right (and I didn't fully understand how ST worked yet). But I went back now and redid it like that, and it's even simpler now :).

robinst · 2018-04-26T01:28:23Z

src/parsing/parser.rs

+                // advances one character and tries again, thus preventing the
+                // loop.
+
+                // println!("pop_would_loop for match {:?}, start {}", reg_match, *start);


Yeah, thanks for your thoughts.. Option<String> also occurred to me, might be a nice way to do it. Regarding the context ID, I have the feeling that this might be the straightforward way to do it if we had #83.

trishume

This looks good to me as is, just some comments that could make it cleaner or less clean depending on your taste.

trishume · 2018-04-26T18:36:15Z

src/parsing/parser.rs

+                // The match consumes some characters. So update the position
+                // and clear state we use for checking for loops.
+                *start = match_end;
+                *non_consuming_push_at = None;


I don't think this is strictly necessary, since any future checks against it won't succeed since it will have an earlier start position. But it's totally fine to leave it, maybe makes it easier to think about.

I also realized that you can even avoid the logic of it being an Option because (0,0) is a starting state that should act correctly. Unsure of whether I prefer the cleaner logic of doing that or the perhaps more conceptually nice way of using an option.

Yes, all good ideas :). Done!

trishume · 2018-04-26T18:36:47Z

src/parsing/parser.rs

+                // because a non-consuming "set" could also result in a loop.
+                let context = reg_match.context.borrow();
+                let match_pattern = context.match_at(reg_match.pat_index);
+                match match_pattern.operation {


This could be an if let

trishume · 2018-04-26T18:37:58Z

src/parsing/parser.rs

+            if consuming {
+                // The match consumes some characters. So update the position
+                // and clear state we use for checking for loops.
+                *start = match_end;


This can be done unconditionally, combined with the next comment, this branch of the if is unnecessary. If you think it's easier to understand this way then I'm fine with it though.

trishume · 2018-04-26T18:39:15Z

src/parsing/parser.rs

@@ -321,11 +317,31 @@ impl ParseState {
                    return false;
                }
                *start += 1;
+                *non_consuming_push_at = None;


This is also unnecessary, see other similar comment for more.

We don't need an option, and we don't need to set it to None, as it won't match if the position changes anyway.

robinst · 2018-04-26T23:59:52Z

All done!

trishume

Awesome, thanks for doing all this.

robinst · 2018-04-28T02:10:48Z

Cool! Would you mind creating a release with this and the other changes?

robinst · 2018-04-28T08:01:17Z

Actually, let's first get the change in to abort early on matching, I'll raise a PR.

robinst commented Apr 24, 2018

View reviewed changes

trishume reviewed Apr 24, 2018

View reviewed changes

robinst added 2 commits April 26, 2018 11:29

Clarify comment about skipping a character

0675299

robinst commented Apr 26, 2018

View reviewed changes

robinst mentioned this pull request Apr 26, 2018

Difference in highlighting some TypeScript React files (Sublime vs. Syntect) #97

Closed

trishume approved these changes Apr 26, 2018

View reviewed changes

Simplify code some more

675a49c

We don't need an option, and we don't need to set it to None, as it won't match if the position changes anyway.

trishume approved these changes Apr 27, 2018

View reviewed changes

trishume merged commit 128666e into master Apr 27, 2018

trishume deleted the issue-127-make-loop-detection-follow-st branch April 27, 2018 16:34

robinst mentioned this pull request Apr 28, 2018

Stop matching as soon as we found the best possible match #150

Merged

keith-hall mentioned this pull request May 3, 2018

[WIP] Kinda-working fancy-regex support #34

Closed

7 tasks

keith-hall mentioned this pull request Jun 11, 2018

Java syntax highlighting wrong (JavaDoc does not end) #160

Closed

Change loop detection to work like Sublime Text #146

Change loop detection to work like Sublime Text #146

Conversation

robinst commented Apr 24, 2018

robinst commented Apr 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robinst commented Apr 24, 2018

keith-hall commented Apr 24, 2018

trishume left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robinst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trishume left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robinst commented Apr 26, 2018

trishume left a comment

Choose a reason for hiding this comment

robinst commented Apr 28, 2018

robinst commented Apr 28, 2018