Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does not support (?!...) negative lookahead assertion? #127

Closed
messense opened this issue Nov 7, 2015 · 21 comments
Closed

Does not support (?!...) negative lookahead assertion? #127

messense opened this issue Nov 7, 2015 · 21 comments

Comments

@messense
Copy link

messense commented Nov 7, 2015

Python re module supports (?!...) syntax, see https://docs.python.org/2/library/re.html#regular-expression-syntax

The code below compiles but paniced at runtime:

extern crate regex;

fn main() {
    let re = regex::Regex::new(r"Isaac (?!Asimov)").unwrap();
    println!("{}", re.is_match("Isaac "));
    println!("{}", re.is_match("Isaac Asimov"));
}
thread '<main>' panicked at 'called `Result::unwrap()` on an `Err` value:
Syntax(Error { pos: 8, surround: "ac (?!Asim", kind: UnrecognizedFlag('!') })', 
../src/libcore/result.rs:738

The pattern works fine in Python:

import re

pt = re.compile(r"Isaac (?!Asimov)")
pt.match("Isaac ")
pt.match("Isaac Asimov")
@messense messense changed the title Does not support ?!... negative lookahead assertion? Does not support (?!...) negative lookahead assertion? Nov 7, 2015
@BurntSushi
Copy link
Member

Those aren't supported. From the documentation:

Notably, backreferences and arbitrary lookahead/lookbehind assertions are not provided. In return, regular expression searching provided by this package has excellent worst-case performance. The specific syntax supported is documented further down.

The reason why your code panicked is that the regular expression is invalid and your code called unwrap on it.

@messense
Copy link
Author

messense commented Nov 7, 2015

So when I need to use those unsupported syntax, is there anything I can turn to?

@BurntSushi
Copy link
Member

I don't know because I don't know what problem you're trying to solve. Any of the following things might work depending on your situation:

  1. Reformulate your regex into one that does not require arbitrary look-ahead in the input.
  2. Post process your matches to exclude matches that would have been excluded by the negative assertion.
  3. Use rust-pcre (I'm not sure if that's being maintained or if it works.)

@obskyr
Copy link

obskyr commented Jun 21, 2017

Does implementing lookahead / lookbehind worsen performance for expressions that don't use them?

@BurntSushi
Copy link
Member

No.

@obskyr
Copy link

obskyr commented Jun 21, 2017

Hmm... Maybe it'd make sense to implement them, then? To achieve compatibility with other regex engines and with people's expectations.

@BurntSushi
Copy link
Member

There is a reasonable case that someone could make for that, sure. But it's mostly theoretical. You also need to find someone willing to do it. (And by that, I mean, "someone willing to write the code, maintain it and do the necessary API design work." It is significant.) You also need to provide a compelling argument that the added complexity is worth it. For example, "If I want to run a regular expression supplied by an end user and I want to be sure that it completes in a reasonable amount of time, how can I do that?" The answer today is, "that's already true."

I probably won't comment much more on this. Plenty of other people have made arguments for and against these features on the Internet. Both sides are reasonable.

@obskyr
Copy link

obskyr commented Jun 21, 2017

Ah, right, regexes supplied by users. You're right, hat's a consideration. One possibility is to provide an option for turning off features that may increase run time significantly, or a feature to detect whether they're present beforehand.

As for "maintain it"... there's already someone who maintains the regex package, I hope. And as for API design work, that's not really significant at all, is it? Sure, deciding whether and if so how to handle shutting off the "slow" features might take a bit of consideration, but beyond that there's no extra API design to do at all. The main (only?) real point is that you have to find someone to write the code. Which of course isn't trivial at all, but I don't think entirely foregoing very common features (I'd even dare to say they're expected) for performance is the right choice.

@BurntSushi
Copy link
Member

there's already someone who maintains the regex package

Yes. That's me. :-)

And as for API design work, that's not really significant at all, is it?

I included a lot more than API design work in my previous comment.

(I'd even dare to say they're expected) for performance is the right choice.

I disagree. So do a lot of people. Predictable performance is important.

I don't think we're going to get very far with this. Here are the facts on the ground:

  1. There are no plans to add lookaround or any other features that lead to hard-to-predict performance. Even if there were, the time scale for that happening would be "years."
  2. If you want fancier features, you might elect to just use PCRE2, or, if you want a pure Rust solution, you could use the fancy-regex crate (which is built on this one).

@tgross35
Copy link

@BurntSushi Complete hypothetical here - if the maintainers of fancy_regex (repo here) were to help & support, would it maybe be suitable to merge into this project?

I am not associated with that repo in any way, but it is actively maintained since 2016 and is more or less a drop-in for this Regex crate. Looks like it uses this crate as the default implementation, but just switches to its implementation if backrefs or lookarounds are involved. Overall it seems like the maintainers are very performance-competent.

A builder option .full_pcre(false) so crates have the option of disabling/enabling it if they expect errors and don't want PCRE (I'd expect these cases are likely rare so defaulting true would likely be OK)

Just a thought, curious to see what you think.

@BurntSushi
Copy link
Member

Nope. Just use fancy_regex.

@finalclass
Copy link

In some cases using this could work:

[x[^xyz]]     Nested/grouping character class (matching any character except y and z)

See: https://docs.rs/regex/latest/regex/#character-classes

@tgross35
Copy link

Thanks for the tip. Usually you can refactor to kind of fake the positive lookarounds just using capture groups, but the negative ones are more difficult. For example:

# Test strings
foobar foobaz fuubar not foo but foo

# PCRE expression ->   workaround
foo(?=bar)        ->   (foo)bar
foo(?!bar)        ->   (foo)(?:[b][a][^z]|[^b][^a][^z]|$). # Not a "true" workaround
(?<=foo)bar       ->   foo(bar)
(?<!not )foo      ->   no "true" workaround, could do similar to ?!

There just really isn't a solid concept of "not" outside of full PCRE, so these negatives are limitations without good refactoring, where you kind of have to know all possibilities you might get. Consumption is the other issue - i.e., the first (foo)bar will consume the entire foobar, making bar unavailable for further matching (bad example here, this matters more when your start and end delimiters are the same).

@mrbsvf
Copy link

mrbsvf commented Nov 11, 2022

I have trouble converting a regex that uses negative lookahead to rust's regex.
this is what I'm trying to convert:
((https?://)?\b(?:(?!tenor|giphy|discord|clips|twitch|cdn|discordapp)\w)+\b\.[\w][^\s]+)

this is my use case if anyone is interested:
Discord is testing with a new AutoMod system and it accepts regex for filtering messages. I have come up with this regex which filters out URLs: ((https?://)?[^\s.]+\.[\w][^\s]+)
This works fine, but it also filters the GIF URLs too. so I want to exclude a few domains (tenor.com, giphy.com, discord.com, discordapp.net, cdn.discordapp.com, clips.twitch.tv) so I came up with the one I'm trying to convert.

Is there a way to convert this? I don't know how to approach this problem.

@BurntSushi
Copy link
Member

Ideally the filter system would let you specify another regex to whitelist your gif URLs. It will otherwise be difficult to convert small convenient look-arounds.

And it's good that Discord doesn't support look-arounds, because then otherwise they might not exist the filter system at all, for fear of easy ReDoS.

@mrbsvf
Copy link

mrbsvf commented Nov 12, 2022

@BurntSushi Unfortunately, it doesn't have a whitelist, hence having this problem in the first place. I had no luck finding an alternative in one of Discord's official servers (Discord Admin Community) other than using a third-party bot (We're using Dyno) but we want to migrate everything to Discord's AutoMod solution.
I honestly don't care if it turns out to be a [very] long regex, but to just make it work.
I've heard converting it is possible, but it'll make it look complicated and long... and I have no experience with regex, and even StackOverflow didn't give me an answer other than a link to here, so here I am.

@BurntSushi
Copy link
Member

BurntSushi commented Nov 12, 2022

You might have more luck at a general help forum. I'm generally the only one answering questions here, and I don't really have time to convert look-arounds to non-look-arounds. Maybe reddit.com/r/regex?

But if you're right, then Discord's regex filtering sounds pretty limited. You might not be able to make it far if you require more sophisticated filtering.

On top of that, Discord likely has limits on the size of regexes they allow, so you can't just go as big as you want.

@BurntSushi
Copy link
Member

And in the future, it would be helpful if you post a Discussion question instead of bumping an old issue that is only tangentially related to the problem you're trying to solve.

@rain2307
Copy link

In some cases using this could work:

[x[^xyz]]     Nested/grouping character class (matching any character except y and z)

See: https://docs.rs/regex/latest/regex/#character-classes

Thanks , it works.
We don't need other crate.

@spacesynth

This comment was marked as abuse.

@BurntSushi
Copy link
Member

BurntSushi commented Apr 29, 2024

I think this issue has run its course. I'm locking it. If you have questions about the regex crate, please open a new Discussion question. If you need help converting a regex that uses look-around to one that doesn't (which may not be possible or may require more than one regex), then I think a general help forum would be more appropriate. Failing that, you'll want to use either fancy-regex (which wraps around the regex crate and provides fancier features such as look-around) or a different regex engine entirely, such as PCRE2.

@rust-lang rust-lang locked as resolved and limited conversation to collaborators Apr 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants