Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fancy-regex: problems with patterns in some syntaxes #287

Open
sharkdp opened this issue Mar 30, 2020 · 15 comments
Open

fancy-regex: problems with patterns in some syntaxes #287

sharkdp opened this issue Mar 30, 2020 · 15 comments
Labels

Comments

@sharkdp
Copy link
Contributor

sharkdp commented Mar 30, 2020

I just tried the fancy-regex version of syntect for the first time (thank you for everyone involved!).

Here are a few of the problems I have encountered so far:

  • The PHP syntax in sublimehq/Packages (master) uses some sort of inline comment:
        - match: |-
          (?x)/\*\*(?:
            (?:\#@\+)?\s*$ (?# multi-line doc )
            |
            (?=\s+@.*\s\*/\s*$) (?# inline doc )
          )
    which causes the following error:
    Error while compiling regex '…': Unknown group flag
    
  • The AsciiDoc syntax (https://github.com/SublimeText/AsciiDoc/blob/master/AsciiDoc.tmLanguage) fails with:
    Error while compiling regex '(?x)^
    (?= ([/+-.*_=]{4,})\s*(?m:$)
    | ([ \t]{1,})
    | [=]{1,6}\s*+
    | [ ]{,3}(?<marker>[-*_])([ ]{,2}\k<marker>){2,}[ \t]*+(?m:$)
    )': Unknown group flag
    
    (note that this is after .sublime-syntax conversion and possibly regex rewriting in syntect).
  • The ARM Assembly syntax (https://github.com/tvi/Sublime-ARM-Assembly) fails with a similar error:
    Error while compiling regex '(?x)
    ^\s*\#\s*(define)\s+             # define
    ((?<id>[a-zA-Z_][a-zA-Z0-9_]*))  # macro name
    (?:                              # and optionally:
        (\()                         # an open parenthesis
            (
                \s* \g<id> \s*       # first argument
                ((,) \s* \g<id> \s*)*  # additional arguments
                (?:\.\.\.)?          # varargs ellipsis?
            )
        (\))                         # a close parenthesis
    )?': Unknown group flag
    
  • The Haskell/Cabal syntax (https://github.com/SublimeHaskell/SublimeHaskell) fails with:
    Error while compiling regex '(=>|\u21D2)\s+([A-Z][\w']*)': Invalid escape
    
  • Elixir (https://github.com/princemaple/elixir-sublime-syntax/) fails with a similar error:
    Error while compiling regex '(?x)
    \\g(?:
      <( ((?>-[1-9]\d*|\d+) | [a-zA-Z_][a-zA-Z_\d]{,31}) | \g<-1>?([^\[\\(){}|^$.?*+\n]+) )> | '\g<1>' |
      (<([^\[\\(){}|^$.?*+\n]+*)>? | '\g<-1>'?) )': Invalid escape
    
  • Same for JavaScript/Babel (https://github.com/babel/babel-sublime):
    Error while compiling regex '(?x)
    (?:([_$a-zA-Z][$\w]*)\s*(=)\s*)?
    (?:\b(async)\s+)?
    (?=(\((?>(?>[^()]+)|\g<-1>)*\))\s*(=>))': Invalid escape
    
  • The SLS syntax fails with
    Error while compiling regex '(?x)
    (?: ^ [ \t]* | [ \t]+ )
    (?:(\#) \p{Print}* )?
    (\n|\z)
    ': Regex error: regex parse error:
        (?:^[ \t]*|[ \t]+)(?:(\#)\p{Print}*)?(\n|\z)
                                 ^^^^^^^^^
    error: Unicode property not found
    

This list continues for quite some time, but I'm not sure if it's worth to list them all. Most of them seem to be related to "unknown group flag" or "invalid escape".

Note: I just wanted to try this out within bat, there is absolutely no "pressure" to get this fixed (as always, of course 😄). I was just curious and thought this might help.

@trishume
Copy link
Owner

cc @robinst any thoughts on what might needed to be added to either the rewriter or fancy-regex to fix these and guesses as to how much work it would be?

@Keats
Copy link
Contributor

Keats commented Apr 12, 2020

Ah damn I was about to try it in Zola and rust-onig still hasn't shipped a new version with optional bindgen :(

@sharkdp how do you feel about creating a repository to contain all .sublime-syntaxes? We are kind of duplicating work in Zola and bat (and potentially other tools) having our own repo of syntaxes.

@sharkdp
Copy link
Contributor Author

sharkdp commented Apr 12, 2020

@sharkdp how do you feel about creating a repository to contain all .sublime-syntaxes? We are kind of duplicating work in Zola and bat (and potentially other tools) having our own repo of syntaxes.

Sounds like a great idea. Should we move this discussion to bats issue tracker? There are probably a lot of details which would need to be figured out (what to include? what not to include? how to deal with temporary patches, as bat currently does? what kind of tooling do we want to share? etc.)

Update: see sharkdp/bat#919

robinst added a commit to fancy-regex/fancy-regex that referenced this issue Apr 13, 2020
Same as Oniguruma. See trishume/syntect#287
where the lack of support for this is a problem.
@robinst
Copy link
Collaborator

robinst commented Apr 14, 2020

cc @robinst any thoughts on what might needed to be added to either the rewriter or fancy-regex to fix these and guesses as to how much work it would be?

Yeah, I think some of these should be easy to fix in fancy-regex, e.g. (?# ...) and \u, I'll work on those first. Others are a bit more work like named capture groups, but on the radar: fancy-regex/fancy-regex#34

@Keats
Copy link
Contributor

Keats commented Apr 24, 2020

@robinst any chance you can release a new version of fancy-regex? I want to update some syntaxes and want to know exactly which ones are failing after your latest patches.

@robinst
Copy link
Collaborator

robinst commented Apr 28, 2020

@Keats Published version 0.3.4 now: https://github.com/fancy-regex/fancy-regex/blob/master/CHANGELOG.md#034---2020-04-28

Note that it's unlikely that I'll implement \g<...> for subexp calls in the near future. I would recommend replacing them with {{variable}} references instead. Note that that might even make it better for Sublime Text itself, as I don't think sregex implements that syntax either.

@robinst
Copy link
Collaborator

robinst commented Apr 28, 2020

Also created a PR to make error messages include what the unknown flag/escape is: fancy-regex/fancy-regex#46

@Keats
Copy link
Contributor

Keats commented Apr 28, 2020

Thanks a lot @robinst , I'll give it a try asap

@robinst
Copy link
Collaborator

robinst commented Apr 28, 2020

Released fancy-regex/fancy-regex#46 as 0.3.5

@Keats
Copy link
Contributor

Keats commented Apr 30, 2020

It looks like a lot of syntaxes are using \G (TypeScript like https://github.com/getzola/zola/blob/bc496e61010be1094a9192003ea59506c14d9397/sublime/syntaxes/TypeScript.sublime-syntax#L518) and \g (Elixir Regex https://github.com/princemaple/elixir-sublime-syntax/blob/master/Regular%20Expressions%20(Elixir).sublime-syntax#L52 and probably a few others). I had never heard of subexp calls before, could the linked Elixir file be changed to not use \g?

@robinst
Copy link
Collaborator

robinst commented May 12, 2020

\G is an interesting one, it will anchor the match at where the current search begins. I wonder if that's necessary for correctness for those syntaxes or if they do it for performance (but not sure it would actually make things better). @keith-hall maybe you have some insight for that?

For \g, it should be possible to replace by repeating the pattern or using variables. E.g. this:

<({{capture_name}})>|'\g<-1>'|\g<-1>

Should be the same as:

<({{capture_name}})>|'({{capture_name}})'|({{capture_name}})

(But captures might need to be extended.)

@keith-hall
Copy link
Collaborator

\G is needed for correctness - this was often used in tmLanguage grammars where there was less functionality for working with contexts. Pretty sure it wouldn't have much impact on performance, or it would slow if down if no matches were found and all the patterns in the context use \G, as it would move the string pointer one char along and try again until it finds a match.

@robinst
Copy link
Collaborator

robinst commented May 13, 2020

Hmm so do you know why none of the built-in syntaxes use \G at all? Seems strange. I wonder if sregex doesn't support it either and so they avoid it.

@keith-hall
Copy link
Collaborator

I believe that is correct, yes.

@robinst
Copy link
Collaborator

robinst commented Sep 27, 2020

fancy-regex 0.4.0 now supports named groups and backrefs, see changelog. \g and \G are not yet supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants