Support for quantifiers and the shuffle operator in regular expressions #109

EduardoGoulart1 · 2022-12-02T10:42:52Z

Is there a specific reason why quantifiers for regular expressions are not supported? (so things like a{2,10}) If there is no specific reason, would you be open to a PR that implements it?

Also, what do you think of adding a special regex shuffle operator? This would be very handy to express languages like "all permutations of a,b,c". My proposal would be to use the ~ character and implement a separate parser for that

The text was updated successfully, but these errors were encountered:

caleb531 · 2022-12-02T17:53:32Z

@EduardoGoulart1 I don't recall a specific reason, but I am certainly open to any PRs you would be willing to submit!

@eliotwrobson would quantifiers (like Eduardo is describing) fit within the theoretical regular expressions this library implements? I only ask because it seems like Perl/PCRE regex and the automata-based regex syntaxes are not the same, otherwise we would be using the stdlib's re module in more places.

EduardoGoulart1 · 2022-12-02T18:11:29Z

@caleb531 the quantifiers would be only syntactic sugar to avoid concatenating the same expression repeatedly. So a^{n,m} would get translated to something like a..a_n(a_n+1)?...(a_m)?

The shuffle operator might require some rework on the parser. We would need to construct the NFAs and compute their products explicitly.

eliotwrobson · 2022-12-02T18:12:55Z

So I think quantifiers do fit (as they correspond to just concatenating a certain number of copies of an NFA with itself). I actually thought about adding this in before, but it causes some issues with the way that the default alphabet is set for regexes. If no objections to changing this though, I think it's very doable.

I also think that the shuffle operator is easy to do. I think this can actually be done with the same logic as the NFA shuffle_product function. I'm not sure about the ~ symbol, since this is usually used to denote compliment. In the literature, the Cyrillic sha ш is typically used (see https://link.springer.com/chapter/10.1007/978-3-031-19685-0_3#Sec2).

EduardoGoulart1 · 2022-12-02T18:15:23Z

@eliotwrobson but we still want the parser to be based on ASCII characters, right? I can look up for what other characters would better suit it

eliotwrobson · 2022-12-02T18:17:13Z

@EduardoGoulart1 yes, sticking to ascii characters is best I think. Perhaps the @ symbol? I'm staring at my keyboard and unfortunately nothing seems to resemble the sha 😅 though we should definitely stick to something that's easy for users to input and doesn't require typing in anything special.

With respect to implementation, the shuffle product is pretty easy to integrate with the existing parser. However, the quantifiers requires some changes to what gets passed in to factory functions for the Token object. The token object will need to be passed the whole match object, not just the string that was matched on. I think this is straightforward to implement but requires changing the way some of the tests are written (which is why I didn't do it before).

EduardoGoulart1 · 2022-12-02T18:28:46Z

@eliotwrobson ok thank you, I'll give it a try. Just to understand the design choice: You opted for writing the Lexer following the structure of the tdparser. Was that just to simplify the implementation? I see pyparsing is added a dependency but not used

eliotwrobson · 2022-12-02T18:33:27Z

@EduardoGoulart1 This was mainly as a guide to myself while writing. The parser here is actually a port of one in another project, and there we aren't allowed to add external dependencies, so everything had to be written from scratch. I didn't realize that pyparsing was a dependency, that may have been used as part of the old parser. Let me know if you have any issues or want some input.

EDIT: @caleb531 do you know why pyparsing is a dependency here? It doesn't look like it's getting used anywhere in the library right now.

caleb531 · 2022-12-02T21:03:39Z

@eliotwrobson pyparsing is a dependency of pandoc, which I used to compile my Markdown README to RST for PyPI (see c669875). I removed pandoc a while back since PyPI now has native Markdown support, but I guess I forgot to remove pyparsing as well.

@EduardoGoulart1 Looking forward to your PR! Please remember to add the appropriate unit tests and documentation (under docs/regular-expressions.md) as well.

caleb531 · 2022-12-07T00:03:15Z

@eliotwrobson @EduardoGoulart1 What's left to do before this issue can be closed? The regex shuffle operator is now in, don't quantifiers still need to be implemented?

eliotwrobson · 2022-12-07T00:15:29Z

@caleb531 Yeah, still missing quantifiers. Discussed a bit above, but that requires a bit of a refactor to the regex engine. The token class now needs to accept a match object instead of just the text that was matched (so that items from specific match groups can be read). This refactor isn't that bad in terms of the code changes that need to be made, but there are a lot of tests that need to be modified.

EduardoGoulart1 · 2022-12-07T00:38:18Z

@eliotwrobson I'd like to take over that one if it's not too urgent to close the ticket. It would be a good opportunity to get into the code a bit more...

caleb531 · 2022-12-07T00:40:39Z

@EduardoGoulart1 Nope, it's not urgent at all. Just wanted to confirm what was still outstanding. 🙂

eliotwrobson · 2022-12-07T00:49:25Z

@EduardoGoulart1 go for it! As I mentioned #115, this will be a good step towards that refactor (and I'm a bit occupied next week). Let me know if you want any input.

EduardoGoulart1 · 2022-12-20T10:04:18Z

Closed because turns out I'll be a bit busy the next days/weeks

eliotwrobson · 2022-12-20T22:16:41Z

@EduardoGoulart1 @caleb531 I can actually pick this up. I was going to ask about the status since I want to finish out some work for my side-projects by the end of the year. Should be good to reopen.

caleb531 · 2022-12-20T22:17:30Z

@eliotwrobson Very well! Issue reopened. 🙂

EduardoGoulart1 · 2023-04-03T09:35:06Z

@caleb531 we close this as both were implemented, right?

caleb531 · 2023-04-04T00:18:43Z

@EduardoGoulart1 Thank you for the nudge—yes, both quantifiers and the shuffle operator are merged, so we can close this issue now. cc @eliotwrobson

EduardoGoulart1 changed the title ~~Support for quantifiers~~ Support for quantifiers and the shuffle operator in regular expressions Dec 2, 2022

caleb531 added the enhancement label Dec 2, 2022

eliotwrobson mentioned this issue Dec 5, 2022

Shuffle regex #112

Merged

eliotwrobson mentioned this issue Dec 7, 2022

Allow non-alphanumeric literal characters in regular expressions #115

Open

EduardoGoulart1 closed this as completed Dec 20, 2022

EduardoGoulart1 closed this as not planned Won't fix, can't repro, duplicate, stale Dec 20, 2022

caleb531 reopened this Dec 20, 2022

eliotwrobson mentioned this issue Mar 29, 2023

Add support for regex quantifiers #121

Merged

caleb531 closed this as completed Apr 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for quantifiers and the shuffle operator in regular expressions #109

Support for quantifiers and the shuffle operator in regular expressions #109

EduardoGoulart1 commented Dec 2, 2022 •

edited

Loading

caleb531 commented Dec 2, 2022

EduardoGoulart1 commented Dec 2, 2022 •

edited

Loading

eliotwrobson commented Dec 2, 2022

EduardoGoulart1 commented Dec 2, 2022

eliotwrobson commented Dec 2, 2022 •

edited

Loading

EduardoGoulart1 commented Dec 2, 2022 •

edited

Loading

eliotwrobson commented Dec 2, 2022 •

edited

Loading

caleb531 commented Dec 2, 2022

caleb531 commented Dec 7, 2022

eliotwrobson commented Dec 7, 2022

EduardoGoulart1 commented Dec 7, 2022

caleb531 commented Dec 7, 2022

eliotwrobson commented Dec 7, 2022

EduardoGoulart1 commented Dec 20, 2022 •

edited

Loading

eliotwrobson commented Dec 20, 2022

caleb531 commented Dec 20, 2022

EduardoGoulart1 commented Apr 3, 2023

caleb531 commented Apr 4, 2023

Support for quantifiers and the shuffle operator in regular expressions #109

Support for quantifiers and the shuffle operator in regular expressions #109

Comments

EduardoGoulart1 commented Dec 2, 2022 • edited Loading

caleb531 commented Dec 2, 2022

EduardoGoulart1 commented Dec 2, 2022 • edited Loading

eliotwrobson commented Dec 2, 2022

EduardoGoulart1 commented Dec 2, 2022

eliotwrobson commented Dec 2, 2022 • edited Loading

EduardoGoulart1 commented Dec 2, 2022 • edited Loading

eliotwrobson commented Dec 2, 2022 • edited Loading

caleb531 commented Dec 2, 2022

caleb531 commented Dec 7, 2022

eliotwrobson commented Dec 7, 2022

EduardoGoulart1 commented Dec 7, 2022

caleb531 commented Dec 7, 2022

eliotwrobson commented Dec 7, 2022

EduardoGoulart1 commented Dec 20, 2022 • edited Loading

eliotwrobson commented Dec 20, 2022

caleb531 commented Dec 20, 2022

EduardoGoulart1 commented Apr 3, 2023

caleb531 commented Apr 4, 2023

EduardoGoulart1 commented Dec 2, 2022 •

edited

Loading

EduardoGoulart1 commented Dec 2, 2022 •

edited

Loading

eliotwrobson commented Dec 2, 2022 •

edited

Loading

EduardoGoulart1 commented Dec 2, 2022 •

edited

Loading

eliotwrobson commented Dec 2, 2022 •

edited

Loading

EduardoGoulart1 commented Dec 20, 2022 •

edited

Loading