-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regular expression evaluation should be fixed. #243
Conversation
👍 , while java and python inbuilt regex both have minor differences to perl, they could both use libpcre. Alternative would be posix regex, but I don't know that the libraries are as well supported. |
I would really prefer to see this standardized on POSIX basic regexes but I think this is a great thing to do in general |
Note that examples should take heed of the universal string escaping rules in https://github.com/openwdl/wdl/blob/master/versions/development/SPEC.md#whitespace-strings-identifiers-constants which should happen before we being the regex parsing itself EDIT:
|
Not a particularly strong opinion but I prefer perl over posix because it uses "escape to suppress metacharacters" rather than "escape to enable metacharacters" which is (a) more natural to my way of thinking and (b) closer to the current Cromwell behavior. I realize that "what we happen to do" is a pretty weak argument, but there it is... 😄 |
The reason I prefer posix basic is that it is more standard and more common, i.e. less surprising to people. And you're right that "what we happen to do [in one implementation]" is a pretty weak argument :) |
After a little looking into this, I'd be:
I'd push back a little on "less surprising" since I personally find it much more surprising to have to use
FWIW in case anyone else is trying to find one, while looking around I found a nice page on the POSIX ERE vs BRE differences: https://www.regular-expressions.info/posix.html |
Heads up that I'm going to vote against anything other than basic POSIX, but I'm just one vote. |
I agree with @cjllanwarne that POSIX ERE would be the way to go. The syntax is most similar to the way regex works in languages I know (Python, Scala) and considering the users come from a scientific background they probably have some experience with python regexes. Just my 2 cents. |
I also agree the POSIX ERE is the best option. Having to escape things like group delimiters seems counter-intuitive for me (although people used to using other REGEX Engines find it fine). Extending on what @rhpvorderman mentioned, other comonly used tools, like egrep, are quite similar in syntax to ERE, making it I think the best choice moving forward |
Since we seem to be gravitating towards POSIX ERE (or BRE) I have changed the PR to reflect this. |
Changing my tune re my hard stance on BRE, I'm cool with ERE |
Wow, it seems everyone is in favor of POSIX ERE now. So are there any changes that need to be made to the contents of the pull request before we can start voting? |
@@ -2998,7 +2998,8 @@ Varieties of the `size` function also exist for the following compound types. Th | |||
## String sub(String, String, String) | |||
|
|||
Given 3 String parameters `input`, `pattern`, `replace`, this function will replace any occurrence matching `pattern` in `input` by `replace`. | |||
`pattern` is expected to be a [regular expression](https://en.wikipedia.org/wiki/Regular_expression). Details of regex evaluation will depend on the execution engine running the WDL. | |||
`pattern` is expected to be a [regular expression](https://en.wikipedia.org/wiki/Regular_expression). | |||
The regular expression will be evaluated as a [POSIX Extended Regular Expression (ERE)](https://en.wikipedia.org/wiki/Regular_expression#POSIX_basic_and_extended). | |||
|
|||
Example 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not already the case can you make sure the examples are valid ERE. And if nothing exercises particulars of ERE, perhaps throw in an example of that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the existing examples and threw in some new ones. I hope they are up to standards.
Cool, since I was the only real dissenter anyways and that's cleared up let's open up for voting |
👍 can anyone comment on the current cromwell compliance? |
👍 |
ERE |
👍 |
2 similar comments
👍 |
👍 |
Passes unanimously and is waiting for implementation |
String chocoearlylate = sub(chocolike, "[^ ]late", "early") # I like chocoearly when it's late | ||
String choco4 = sub(chocolike, " [:alpha:]{4} ", " 4444 ") # I 4444 chocolate 4444 it's late |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this has already passed voting, but I briefly looked into how this might be implemented in miniwdl and noticed that there are some typos in these examples. They should look like the following:
String chocoearlylate = sub(chocolike, "[^ ]late", "early") # I like chocearly when it's late
The second o
will also be replaced, due to the [^ ]
.
String choco4 = sub(chocolike, " [[:alpha:]]{4} ", " 4444 ") # I 4444 chocolate 4444 it's late
Character classes require and additional set of []
.
This is now implemented on miniwdl's master branch (through chanzuckerberg/miniwdl#321). |
To pick low-hanging fruit, I think this PR can merged along with an appropriate entry in CHANGELOG. Any objections? |
I have no objections because it meets the criteria for merge (one implemented engine). What is the status of this in other engines like Cromwell or DnaNexus? |
Regex evaluation has been stable for quite some time in Cromwell. If it is correct according to the spec I don't know. But @cjllanwarne or @aednichols probably knows more. |
I've heard (out-of-band) a concern about lack of trustworthy POSIX ERE implementation for JVM stacks. I'm not too familiar with the ecosystem myself, does anyone have a suggestion? |
If that is the case we can still change the spec 😇 ... The point of this change is to have all execution use the same regex evaluation. Which regex evaluation is of much lesser importance, as long as it is something easy to get into. If POSIX ERE is hard to implement in scala/java code we can look for an alternative that is easy in Python, Java and Scala. (I do not know of execution engines in other languages as of yet.) |
Thanks @mlin and @patmagee. Looking back at the history of this thread, I think perl was the other main contender (and indeed, a library was mooted: It's not quite as nice-looking as POSIX ERE for the reasons given further up in the thread, but I agree that "standard across all languages" is a bigger plus than specific syntax IMO (unless we want to get into the business or implementing our own regex support, which I think we really probably don't...) |
@cjllanwarne @mlin @patmagee The speed is not necessary for WDL necessarily given the amount of regexes in the average
Golang's regexp library implements re2's syntax: https://golang.org/pkg/regexp/. It seems like re2 has got all our bases covered as the Java library can be easily implemented in the current Scala implementations and there is a Python library available that provides python bindings for re2, which works for the current python implementations. Furthermore this will create no barriers for people wanting to implement a new engine or wdl-syntax tool in C, C++, Python, Java, Scala, Golang, R, Ruby, Node.js or Perl. I think that is a wide enough language base. EDIT: The syntax for re2 can be found here: https://github.com/google/re2/wiki/Syntax |
Thank you @rhpvorderman for doing the research. This sounds promising indeed. I will try to do a little due diligence on the re2 Python bindings as there seem to be numerous versions of it. Even with appropriate language bindings I presume all users would need to install the re2 OS package (apt/yum/brew/etc.), which might present a small roadbump compared to the ideal of standard syntax implemented natively in each language. Of course that ideal might be unattainable here, making the comparison quite moot 😅 |
@mlin. Damn, you are correct. Unfortunately the same holds true for PCRE bindings. But re2 is much more supported. |
This is implemented in wdlTools |
This is in v1.1 and will be propagated to development. |
See broadinstitute/cromwell#3990 .
Even within the same execution engine, not having specified how the regex should be evaluated leads to portability problems. Not to mention between execution engines.
This should be fixed to a certain standard.
I opted for PERL-style here because PERL is supposed to be the king of regex. But It does not really matter which standard is chosen, as long as there is a standard.Given the discussion below POSIX ERE are now set as standard.