Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make startswith, endswith work with Regex #29790

Merged
merged 17 commits into from
Feb 1, 2019

Conversation

dalum
Copy link
Contributor

@dalum dalum commented Oct 24, 2018

The rationale behind this PR is two-fold.

The first is motivated by a real use–case of checking the file extension of a file, where something like occursin(r"\.skf$"i, filename) seems to employ a "reverse query" logic to me that reduces readability: does this pattern, which must be at the end, occur in filename? Compared with: endswith(filename, r"\.skf"i): does filename have this pattern at the end?

The second is an argument about genericity, where a function such as occurs_not_endswith(pat, s) = occursin(pat, s) && !endswith(s, pat) presently wouldn't work for pat::Regex. I don't have a use for this, but it seems to me that it should work.

As an aside, the implementation here allows use such as endswith("abc\ndef", r"c"m) to check if any line in the string ends with "c".

The implementation of course has the slight overhead that if passed an already compiled Regex, this will generate and compile a new Regex on every call, but I would think that the cases where this is a real issue are not that frequent. And users who need this little bit of extra speed could probably be expected to understand the caveats of using a convenience function like the ones introduced here. 🙂

@dalum
Copy link
Contributor Author

dalum commented Oct 25, 2018

From suggestion by @ScottPJones, I changed the methods to use ANCHORED and ENDANCHORED in the compile options instead of manipulating the pattern. I had to add ENDANCHORED to the allowed COMPILE_MASK in pcre.jl, and I have no clue if there was a reason it was left out or not. At least the tests are passing locally, so hopefully there is no deeper meaning to its omission.

test/regex.jl Outdated Show resolved Hide resolved
Copy link
Member

@StefanKarpinski StefanKarpinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. I like the version with anchor options very much.

@dalum
Copy link
Contributor Author

dalum commented Oct 25, 2018

The Windows 32 build is failing over something related to profiling. Can't quite figure out if it is related to this PR. 🤷‍♀️

Copy link
Member

@fredrikekre fredrikekre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some simple docstrings?

@fredrikekre fredrikekre added the strings "Strings!" label Oct 25, 2018
base/regex.jl Outdated Show resolved Hide resolved
@stev47
Copy link
Contributor

stev47 commented Oct 26, 2018

Isn't the weirdness of occursin(r"\.skf$"i, filename) mostly a naming problem? A pattern doesn't "occur" in a string, it "matches" a string.
Extending methods startswith and endswith (that are used since the regex counterpart would be slower) to incorporate functionality that regex already delivers on might be unnecessary.
Thinking out loudly: somebody reading a regex pattern will also understand that one extra character ^ or $.

Copy link
Member

@KristofferC KristofferC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to address the 100x slowdown over occursin.

chethega and others added 2 commits October 28, 2018 10:21
Co-Authored-By: dalum <17059936+dalum@users.noreply.github.com>
Co-Authored-By: dalum <17059936+dalum@users.noreply.github.com>
base/regex.jl Outdated Show resolved Hide resolved
@dalum
Copy link
Contributor Author

dalum commented Oct 30, 2018

With the latest changes thanks to @chethega, this version is now only 1.3x slower than using occursin on my machine when the pattern doesn't match, and about 2x slower when the pattern matches. I will add a note to the docstrings mentioning that occursin should be preferred in performance critical situations.

base/regex.jl Outdated Show resolved Hide resolved
base/regex.jl Outdated Show resolved Hide resolved
@KristofferC KristofferC dismissed their stale review October 30, 2018 16:18

1.3x seems good enough :)

@dalum
Copy link
Contributor Author

dalum commented Feb 1, 2019

Error messages during bootstrap are obscure. This bit meant that it failed to interpolate a $ inside a docstring: 😕

Core.CodeInfo(code=Array{Any, (2,)}[
  Expr(:call, :include, "regex.jl"),
  Expr(:return, nothing)], codelocs=Array{Int32, (2,)}[1, 1], method_for_inference_limit_heuristics=nothing, ssavaluetypes=2, linetable=Array{Any, (1,)}[Core.LineInfoNode(mod=Base, method=Symbol("top-level scope"), file=Symbol("Base.jl"), line=207, inlined_at=0)], ssaflags=Array{UInt8, (0,)}[], slotflags=Array{UInt8, (0,)}[], slotnames=Array{Any, (0,)}[], inferred=false, inlineable=false, propagate_inbounds=false, pure=false)Core.CodeInfo(code=Array{Any, (3,)}[
  Expr(:call, Core.getproperty, :Core, :(:include)),
  Expr(:call, SSAValue(1), :Main, "Base.jl"),
  Expr(:return, nothing)], codelocs=Array{Int32, (3,)}[1, 1, 1], method_for_inference_limit_heuristics=nothing, ssavaluetypes=3, linetable=Array{Any, (1,)}[Core.LineInfoNode(mod=Main, method=Symbol("top-level scope"), file=Symbol("sysimg.jl"), line=3, inlined_at=0)], ssaflags=Array{UInt8, (0,)}[], slotflags=Array{UInt8, (0,)}[], slotnames=Array{Any, (0,)}[], inferred=false, inlineable=false, propagate_inbounds=false, pure=false)

@dalum
Copy link
Contributor Author

dalum commented Feb 1, 2019

@StefanKarpinski @fredrikekre Ping! Tests are passing. I think it's ready now. 🎉

@StefanKarpinski StefanKarpinski merged commit 2b4f003 into JuliaLang:master Feb 1, 2019
@dalum dalum deleted the dalum/regex branch February 2, 2019 17:55
@fredrikekre fredrikekre added needs news A NEWS entry is required for this change needs compat annotation Add !!! compat "Julia x.y" to the docstring labels Feb 5, 2019
fredrikekre added a commit that referenced this pull request Feb 6, 2019
fredrikekre added a commit that referenced this pull request Feb 6, 2019
@fredrikekre fredrikekre mentioned this pull request Feb 6, 2019
@fredrikekre fredrikekre removed needs compat annotation Add !!! compat "Julia x.y" to the docstring needs news A NEWS entry is required for this change labels Feb 6, 2019
fredrikekre added a commit that referenced this pull request Feb 6, 2019
compat annotations, news and manual updates for #29790, #30496 and #30915.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants