Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flavour of RegEx allowing negation #349

Open
atomczak opened this issue Sep 26, 2024 Discussed in #347 · 8 comments
Open

Flavour of RegEx allowing negation #349

atomczak opened this issue Sep 26, 2024 Discussed in #347 · 8 comments
Labels
discuss & decide documentation Improvements or additions to documentation improvement New feature or request
Milestone

Comments

@atomczak
Copy link
Contributor

Discussed in #347

Originally posted by atomczak September 25, 2024
I want to have a requirement that takes all Walls that don't have the "FireRating" property. In other words, I want an applicability to include IFCWALL entity and a property allowing any name/pset except "FireRating" and "Pset_WallCommon". I thought I could do that with a pattern, but I see that XML flavour doesn't really support a negation lookahead. Has anyone found a way to express it?

Use Case:

Suppose we have 2 properties, “ID 1” and “ID 2”. BIM team is required to assign either 1 of the 2 properties. Difficult to write the “ids:applicability” section for such cases. E.g.,

<ids:applicability>
ID 1 property does not exist
</ids:applicability>
<ids:requirements>
ID 2 is mandatory
</ids:requirements>

As a solution, I propose to reconsider agreeing on RegEx flavours other than XSD, such as PCRE or JavaScript / Python. We could also agree on explicit IDS flavour, but this will make implementation harder.

@atomczak atomczak added documentation Improvements or additions to documentation improvement New feature or request decide labels Sep 26, 2024
@atomczak atomczak added this to the 1.0 milestone Sep 26, 2024
@NickNisbet
Copy link

NickNisbet commented Sep 26, 2024 via email

@atomczak
Copy link
Contributor Author

yes, in RASE terms that would be a Selection. Choosing RegEx flavour that supports negation would enable Selection/Exception use cases without changing the IDS schema.

@andyward
Copy link
Contributor

Worth referencing #29 for the origins of the decision to only target xs:pattern regexs, and #177 for the wider Selection/Exclusion topic with regex.

I also found a useful resource comparing the different Regex flavours https://gist.github.com/CMCDragonkai/6c933f4a7d713ef712145c5eb94a1816 (with a handy table in the comments)

It feels like we should be able to find a baseline set of features that's above xsd's limited pattern that is widely enough supported to give a useful trade-off between functionality and breadth of implementor support.

Bearing in mind that almost all tech platforms can access 3rd party regex engine implementations, the decision to constrain IDS to a small subset just because, say Golang, doesn't support negative lookaheads in their standard library seems limiting.

Given the ubiquity of JavaScript and its standardisation I'd support basing features on the ECMA feature set. We could always subtract a features if it did look like a major tech segment would be hindered by its inclusion.

@aothms
Copy link

aothms commented Sep 26, 2024

I think the concern here is not so much features, but that, due to complex interaction of such features, differences between implementations will originate. Even if ports of well known libraries exist, taking on @andyward's example of Go, and PCRE (which I think the closest to a defacto standard that's used as a reference in other implementations) it would mean you're stuck to a package updated 8 years ago, used 26 times. https://pkg.go.dev/github.com/gijsbers/go-pcre Maybe I'm unlucky in my example, but I don't think this the best way forward. I prefer to keep regexes as simple pattern matches and use proper semantic structures where needed.

@atomczak
Copy link
Contributor Author

For a moment, I thought this (test){0} would work, but it looks like instead of "there can't be no "test", it works like "no test is fine".

How about instead of choosing one flavour, we select only shared aspects of popular regexes, based on the table shared by @andyward? This way we could make sure those are be supported by most languages.

Category Feature .NET Java PCRE Python XML
Characters Backslash escapes one metacharacter
Characters \n (LF), \r (CR) and \t (tab)
Character Classes or Character Sets [abc] [abc] character class
Character Classes or Character Sets [abc] [^abc] negated character class
Character Classes or Character Sets [abc] [a-z] character class range
Character Classes or Character Sets [abc] Backslash escapes one character class metacharacter
Character Classes or Character Sets [abc] \D, \W and \S shorthand negated character classes
Dot . (dot; any character except line break)
Alternation | (alternation)
Quantifiers ? (0 or 1)
Quantifiers * (0 or more)
Quantifiers + (1 or more)
Quantifiers {n} (exactly n)
Quantifiers {n,m} (between n and m)
Quantifiers {n,} (n or more)
Grouping and Backreferences (regex) (numbered capturing group)
Characters \x00 through \xFF (ASCII character)
Characters \f (form feed) and \v (vtab)
Characters \a (bell)
Character Classes or Character Sets [abc] [\b] backspace
Anchors ^ (start of string/line)
Anchors $ (end of string/line)
Anchors \A (start of string)
Quantifiers ? after any of the above quantifiers to make it "lazy"
Grouping and Backreferences (?:regex) (non-capturing group)
Grouping and Backreferences \1 through \9 (backreferences)
Modifiers (?i) (case insensitive)
Modifiers (?s) (dot matches newlines)
Modifiers (?m) (^ and $ match at line breaks)
Modifiers (?x) (free-spacing mode)
Lookaround (?=regex) (positive lookahead)
Lookaround (?!regex) (negative lookahead)
Free-Spacing Syntax Free-spacing syntax supported
Grouping and Backreferences \10 through \99 (backreferences)
Grouping and Backreferences Backreferences non-existent groups are an error
Grouping and Backreferences Backreferences to failed groups also fail
Free-Spacing Syntax # starts a comment

We could also add regex test cases making sure that each implementation interprets regex features the same way.

@andyward
Copy link
Contributor

I prefer to keep regexes as simple pattern matches and use proper semantic structures where needed.

Totally agree. The old jwz quip about "having a problem ... using regex and now having two problems" comes to mind. But given the design choices in IDS1.0, patterns often the only 'trap door' we have available to implement some of the more complex requirements. In particular the lack of 'exclusions' is a blocker.

But the point about Go and its patchy PCRE regex support kind of backs up my point. In our small niche, I'd wager every single IDS solution out there, whether commercial or open source is built in one of Java, Python, .NET, PHP or JavaScript. While I know they are all great languages, I'm unaware of any Go, Haskell or indeed Fortran 77 implementations of IFC (which is a pre-requisite for IDS model checking) - but we're concerning ourselves with how IDS could be supported in languages that have no penetration in our problem space. I feel like we maybe need to apply a bit of Pareto Principle?

@andyward
Copy link
Contributor

andyward commented Sep 26, 2024

How about instead of choosing one flavour, we select only shared aspects of popular regexes, based on the table shared by @andyward? This way we could make sure those are be supported by most languages.

Great I was doing something similar. There's significant commonality amongst those 4 mainstream engines (and seemingly ECMA too).

I agree on the test cases - this would help baseline what I suspect is a lot of different behaviour across implementors. There's probably only 5-6 of those features that are ever going to be used so that may limit the testing. (Anchors, Lookaround and maybe the modifiers)

@atomczak
Copy link
Contributor Author

atomczak commented Sep 26, 2024

There you go, I added ECMA (and JGSoft), resulting in excluding ten more rows (good!):

Category Feature .NET Java PCRE Python JGsoft ECMA XML
Characters Backslash escapes one metacharacter
Characters \n (LF), \r (CR) and \t (tab)
Character Classes/Sets [abc] character class
Character Classes/Sets [^abc] negated character class
Character Classes/Sets [a-z] character class range
Character Classes/Sets Backslash escapes one character class metacharacter
Character Classes/Sets \D, \W and \S shorthand negated character classes
Dot . (dot; any character except line break)
Alternation | (alternation)
Quantifiers ? (0 or 1)
Quantifiers * (0 or more)
Quantifiers + (1 or more)
Quantifiers {n} (exactly n)
Quantifiers {n,m} (between n and m)
Quantifiers {n,} (n or more)
Grouping and Backreferences (regex) (numbered capturing group)
Characters \x00 through \xFF (ASCII character)
Characters \f (form feed) and \v (vtab)
Character Classes/Sets [\b] backspace
Anchors ^ (start of string/line)
Anchors $ (end of string/line)
Quantifiers ? after any of the above quantifiers to make it "lazy"
Grouping and Backreferences (?:regex) (non-capturing group)
Grouping and Backreferences \1 through \9 (backreferences)
Lookaround (?=regex) (positive lookahead)
Lookaround (?!regex) (negative lookahead)
Grouping and Backreferences \10 through \99 (backreferences)
Characters \a (bell)
Anchors \A (start of string)
Modifiers (?i) (case insensitive)
Modifiers (?s) (dot matches newlines)
Modifiers (?m) (^ and $ match at line breaks)
Modifiers (?x) (free-spacing mode)
Free-Spacing Syntax Free-spacing syntax supported
Grouping and Backreferences Backreferences non-existent groups are an error
Grouping and Backreferences Backreferences to failed groups also fail
Free-Spacing Syntax # starts a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss & decide documentation Improvements or additions to documentation improvement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants