Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Remove RevString #23612

Closed
wants to merge 2 commits into from
Closed

WIP: Remove RevString #23612

wants to merge 2 commits into from

Conversation

ararslan
Copy link
Member

@ararslan ararslan commented Sep 7, 2017

This PR removes RevString from Base. It's the companion to JuliaStrings/LegacyStrings.jl#22, which adds RevString to LegacyStrings.

Fixes #22611.

@ararslan ararslan added deprecation This change introduces or involves a deprecation strings "Strings!" labels Sep 7, 2017
@ararslan
Copy link
Member Author

ararslan commented Sep 7, 2017

@nanosoldier runbenchmarks("strings", vs=":master")

@ararslan ararslan added this to the 1.0 milestone Sep 7, 2017
@nanosoldier
Copy link
Collaborator

Your benchmark job has completed, but no benchmarks were actually executed. Perhaps your tag predicate contains misspelled tags? cc @ararslan

@ararslan
Copy link
Member Author

ararslan commented Sep 7, 2017

Whoops. @nanosoldier runbenchmarks("string", vs=":master")

@ararslan ararslan changed the title Remove RevString WIP: Remove RevString Sep 7, 2017
@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@@ -99,32 +99,18 @@ end

## reversed strings without data movement ##
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should be removed as well?

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It's hard to check that all computations of indices are correct, but this looks good and tests pass...

@@ -239,18 +238,19 @@ function find_start_brace(s::AbstractString; c_start='(', c_end=')')
in_back_ticks = true
end
else
if !in_back_ticks && !in_double_quotes && c == '\'' && !done(r, i) && next(r, i)[1]!='\\'
if !in_back_ticks && !in_double_quotes && c == '\'' && i > 0 && s[prevind(s, i)] != '\\'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make sense to save the result of prevind(s, i) since it's also used below.

e-j+1
function rsearch(s::AbstractString, c::Chars, i::Integer=start(s))
if isempty(c)
return 1 <= i <= nextind(s, endof(s)) ? i : throw(BoundsError(s, i))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think indices beyond endof(s) should be allowed. See also #22548 and #22572. You could probably use checkbounds for that? Also, it looks like you could first check the bounds using the block below, and only then handle the special case of empty c?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

endof(s) is the last valid index, so 1 <= i <= endof(s) should be sufficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I was just mimicking the structure of the search code here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK. So maybe better keep the same (weird) approach as search and fix that later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like I might as well go with the better approach now and update search to match.

end

function _rsearchindex(s, t, i)
if isempty(t)
return 1 <= i <= nextind(s,endof(s)) ? i :
throw(BoundsError(s, i))
end
t = RevString(t)
rs = RevString(s)
t = reverse(t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not implement the reversed algorithm just like above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't sure how best to do it. Suggestions welcome.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know... Doesn't changing every next into prevind work like you did above? I admit that method looks a bit scary. Or just keep RevString as an internal helper type just for that method. ;-)

@@ -1,6 +1,6 @@
# This file is a part of Julia. License is MIT: https://julialang.org/license

# SubString and RevString types
# SubString type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is kind of inconsistent now. Maybe better rename it to substring.jl and move unrelated functions elsewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Once I'm done with implementations and cleanup here I'll make that a separate commit.

@@ -99,32 +99,18 @@ end

## reversed strings without data movement ##
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"without data movement"?

@@ -158,7 +158,6 @@
<item> RegexMatch </item>
<item> RegexMatchIterator </item>
<item> RepString </item>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could remove RepString too while you're at it (probably in other files too; but are they maintained?).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that must have been an oversight in #19867. I don't even know what this file is for.

@@ -174,11 +165,6 @@ for T in (String, GenericString)
s = convert(T, string(prefix, c, suffix))
r = reverse(s)
ri = search(r, c)
@test r == RevString(s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, you don't check that r is correct now. Should probably hardcode the results.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have more extensive tests for the correctness of reverse in test/strings/unicode.jl; I assume those would be sufficient.

@ararslan
Copy link
Member Author

ararslan commented Sep 8, 2017

@nanosoldier runbenchmarks("string", vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - no performance regressions were detected. A full report can be found here. cc @ararslan

@ararslan
Copy link
Member Author

ararslan commented Sep 8, 2017

To remove RevString we need a type for the general reverse(::AbstractString) case, and the most natural choice is String. However, there is a difference in indexing behavior, which is proving problematic:

julia> s = "X\U0001d4c1β\U0001d6a4\U0001d4c1β\U0001d6a4"
"X𝓁β𝚤𝓁β𝚤"

julia> b = SubString(s, 1, 8)
"X𝓁β𝚤"

julia> r = reverse(b)
"𝚤β𝓁X"

julia> search(r, 'X')
8

julia> search(String(r), 'X')
11

These results are both valid indices for their inputs and indeed they correspond to the same character (i.e. r[8] == String(r)[11] == 'X'). However, the difference in indexing implies that s[reverseind(s, i)] isn't necessarily valid, nor is it necessarily the same as reverse(s)[i]. This is causing the tests to fail. (And in fact, marking them as broken also causes the tests to fail, since in some cases it does work as expected.)

@ararslan
Copy link
Member Author

Bump. Any ideas about the indexing? It would also be good to hear from @stevengj, who introduced reverseind IIRC.

@nalimilan
Copy link
Member

Just to restate my position: let's get rid of reversind, which is used in only one place in Base.

@ararslan ararslan force-pushed the aa/no-revstring branch 2 times, most recently from ff9744e to 679e7ee Compare September 10, 2017 23:42
@ararslan
Copy link
Member Author

I'd be okay keeping reverseind if there's a potential resolution to the indexing issue, e.g. deprecating reverseind(s::AbstractString, i) in favor of reverseind(convert(String, s), i) or something so that the indexing matches, or revamping the implementation of reverseind so that it's aware that the input may have different indexing than reverse.

@ararslan ararslan requested a review from stevengj September 13, 2017 20:23
@nalimilan
Copy link
Member

BTW, I've noticed the following bug on current master:

julia> rsearch(SubString("", 1, 0), "")
ERROR: BoundsError: attempt to access ""
  at index [0]
Stacktrace:
 [1] _rsearchindex(::SubString{String}, ::String, ::Int64) at ./strings/search.jl:219
 [2] _rsearch(::SubString{String}, ::String, ::Int64) at ./strings/search.jl:357
 [3] rsearch(::SubString{String}, ::String) at ./strings/search.jl:365
 [4] macro expansion at ./REPL.jl:97 [inlined]
 [5] (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:73

@stevengj
Copy link
Member

The basic motivation for reverseind (from #9249) hasn't changed: the only reason we have Base.reverse is to support reversed regex searching and other reversed processing by external libraries (#6165). As long as we export reverse for such uses, we need to export reverseind as otherwise the index mapping is extremely error prone for UTF-8.

@ararslan
Copy link
Member Author

That's fine. What do you propose I do in this PR then, given the difference in indexing?

@stevengj
Copy link
Member

stevengj commented Sep 19, 2017

Remove reverse(::AbstractString)?

@ararslan
Copy link
Member Author

Yeah, I guess that would be okay. It'd be a pretty breaking change without providing a deprecation though. (Though I suppose not all that much more breaking than silently changing the behavior to convert to String.) Any idea what a sensible deprecation would look like?

@stevengj
Copy link
Member

Deprecate to reverse(String(s)) so that the conversion is explicit

@ararslan
Copy link
Member Author

Hm, I just noticed that in deprecating RepString, we didn't deprecate repeat(::AbstractString) but instead defined it to convert to String first. Should we be consistent with reverse and do the conversion (breaking change) rather than the deprecation?

matched = false
break
end
c, k = next(rs,k)
d, j = next(t,j)
# Using `reverseind` with `prevind` mimics `next` but for iteration over
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still wonder whether it wouldn't be more efficient to work with indices in the original string (using prevind), and only call reverseind once the match is found. Or maybe I'm missing something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, though this approach didn't introduce any regressions as far as Nanosoldier could tell. My problem with changing it is that it took me quite a while to figure out how to even get to this; translating this algorithm to be in reverse broke my brain a bit. I'd be happy to change it with some guidance though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah maybe just leave a TODO so that we can find it when looking for possible performance enhancements.

If you want to try changing it, I'd say you should just have to call prevind on both rs and t, extract the chars and compare them as currently? Actually I don't understand why reversind is needed here, nor how it could be correct. In the first iteration, k contains the index in s, not in reverse(s), so how can that work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish I knew.

@nalimilan
Copy link
Member

Hm, I just noticed that in deprecating RepString, we didn't deprecate repeat(::AbstractString) but instead defined it to convert to String first. Should we be consistent with reverse and do the conversion (breaking change) rather than the deprecation?

Yeah, why not. Converting to String is a reasonable fallback, and anyway by calling reverse you accepted (after this PR) the allocation of a string of the same length already.

@ararslan ararslan added the breaking This change will break code label Sep 21, 2017
@ararslan ararslan added the excision Removal of code from Base or the repository label Sep 29, 2017
@ararslan ararslan changed the title WIP: Remove RevString Remove RevString Oct 7, 2017
@ararslan
Copy link
Member Author

ararslan commented Oct 7, 2017

I think things should be all sorted out now, if folks want to have another look.

@nalimilan
Copy link
Member

This would look good... except that CI doesn't pass. :-p

@ararslan
Copy link
Member Author

ararslan commented Oct 7, 2017

Okay, well that's what I get for only running a subset of the tests locally. 😑

@ararslan ararslan changed the title Remove RevString WIP: Remove RevString Oct 7, 2017
NEWS.md Outdated
@@ -243,6 +246,9 @@ This section lists changes that do not have deprecation warnings.
* All command line arguments passed via `-e`, `-E`, and `-L` will be executed in the order
given on the command line ([#23665]).

* `reverse(::AbstractString)` now unconditionally returns a `String`. Previously it
returned a `RepString`, which has been removed from Base ([#23612]).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RepString -> RevString

The `RevString` type for lazily reversed strings has been moved to the
LegacyStrings package. Fixes #22611.

Calling `reverse` on an `AbstractString` with no more specific method
now unconditionally returns a `String`.
Move the non-substring related functionality to strings/basic.jl.
@StefanKarpinski
Copy link
Member

Hey @ararslan, just so you know, I'm about to push a branch that deletes RevString; it was surprisingly involved and involved changing a generic reverseind, which in turn required adding ncodeunits and thisind. Sorry for the effort you put into this :(

StefanKarpinski added a commit that referenced this pull request Nov 22, 2017
These seem unrelated, but they're actually linked:

* If you reverse generic strings by wrapping them in `RevString` then
  then this generic `reverseind` is incorrect.

* In order to have a correct generic `reverseind` one needs to assume
  that `reverse(s)` returns a string of the same type and encoding as
  `s` with code points in reverse order; one also needs to assume that
  the code units encoding each character remain the same when reversed.
  This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32.

Reverse string search functions are pretty messed up by this and I've
fixed them well enough to work but they may be quite inefficient for
long strings now. I'm not going to spend too much time on this since
there's other work going on to generalize and unify searching APIs.

Close #22611
Close #24613

See also: #10593 #23612 #24103
StefanKarpinski added a commit that referenced this pull request Nov 22, 2017
These seem unrelated, but they're actually linked:

* If you reverse generic strings by wrapping them in `RevString` then
  then this generic `reverseind` is incorrect.

* In order to have a correct generic `reverseind` one needs to assume
  that `reverse(s)` returns a string of the same type and encoding as
  `s` with code points in reverse order; one also needs to assume that
  the code units encoding each character remain the same when reversed.
  This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32.

Reverse string search functions are pretty messed up by this and I've
fixed them well enough to work but they may be quite inefficient for
long strings now. I'm not going to spend too much time on this since
there's other work going on to generalize and unify searching APIs.

Close #22611
Close #24613

See also: #10593 #23612 #24103
@ararslan ararslan closed this Nov 22, 2017
@ararslan ararslan deleted the aa/no-revstring branch November 22, 2017 19:46
StefanKarpinski added a commit that referenced this pull request Nov 27, 2017
These seem unrelated, but they're actually linked:

* If you reverse generic strings by wrapping them in `RevString` then
  then this generic `reverseind` is incorrect.

* In order to have a correct generic `reverseind` one needs to assume
  that `reverse(s)` returns a string of the same type and encoding as
  `s` with code points in reverse order; one also needs to assume that
  the code units encoding each character remain the same when reversed.
  This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32.

Reverse string search functions are pretty messed up by this and I've
fixed them well enough to work but they may be quite inefficient for
long strings now. I'm not going to spend too much time on this since
there's other work going on to generalize and unify searching APIs.

Close #22611
Close #24613

See also: #10593 #23612 #24103
StefanKarpinski added a commit that referenced this pull request Nov 28, 2017
These seem unrelated, but they're actually linked:

* If you reverse generic strings by wrapping them in `RevString` then
  then this generic `reverseind` is incorrect.

* In order to have a correct generic `reverseind` one needs to assume
  that `reverse(s)` returns a string of the same type and encoding as
  `s` with code points in reverse order; one also needs to assume that
  the code units encoding each character remain the same when reversed.
  This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32.

Reverse string search functions are pretty messed up by this and I've
fixed them well enough to work but they may be quite inefficient for
long strings now. I'm not going to spend too much time on this since
there's other work going on to generalize and unify searching APIs.

Close #22611
Close #24613

See also: #10593 #23612 #24103
StefanKarpinski added a commit that referenced this pull request Nov 29, 2017
These seem unrelated, but they're actually linked:

* If you reverse generic strings by wrapping them in `RevString` then
  then this generic `reverseind` is incorrect.

* In order to have a correct generic `reverseind` one needs to assume
  that `reverse(s)` returns a string of the same type and encoding as
  `s` with code points in reverse order; one also needs to assume that
  the code units encoding each character remain the same when reversed.
  This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32.

Reverse string search functions are pretty messed up by this and I've
fixed them well enough to work but they may be quite inefficient for
long strings now. I'm not going to spend too much time on this since
there's other work going on to generalize and unify searching APIs.

Close #22611
Close #24613

See also: #10593 #23612 #24103
StefanKarpinski added a commit that referenced this pull request Dec 4, 2017
These seem unrelated, but they're actually linked:

* If you reverse generic strings by wrapping them in `RevString` then
  then this generic `reverseind` is incorrect.

* In order to have a correct generic `reverseind` one needs to assume
  that `reverse(s)` returns a string of the same type and encoding as
  `s` with code points in reverse order; one also needs to assume that
  the code units encoding each character remain the same when reversed.
  This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32.

Reverse string search functions are pretty messed up by this and I've
fixed them well enough to work but they may be quite inefficient for
long strings now. I'm not going to spend too much time on this since
there's other work going on to generalize and unify searching APIs.

Close #22611
Close #24613

See also: #10593 #23612 #24103
StefanKarpinski added a commit that referenced this pull request Dec 4, 2017
These seem unrelated, but they're actually linked:

* If you reverse generic strings by wrapping them in `RevString` then
  then this generic `reverseind` is incorrect.

* In order to have a correct generic `reverseind` one needs to assume
  that `reverse(s)` returns a string of the same type and encoding as
  `s` with code points in reverse order; one also needs to assume that
  the code units encoding each character remain the same when reversed.
  This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32.

Reverse string search functions are pretty messed up by this and I've
fixed them well enough to work but they may be quite inefficient for
long strings now. I'm not going to spend too much time on this since
there's other work going on to generalize and unify searching APIs.

Close #22611
Close #24613

See also: #10593 #23612 #24103
evetion pushed a commit to evetion/julia that referenced this pull request Dec 12, 2017
These seem unrelated, but they're actually linked:

* If you reverse generic strings by wrapping them in `RevString` then
  then this generic `reverseind` is incorrect.

* In order to have a correct generic `reverseind` one needs to assume
  that `reverse(s)` returns a string of the same type and encoding as
  `s` with code points in reverse order; one also needs to assume that
  the code units encoding each character remain the same when reversed.
  This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32.

Reverse string search functions are pretty messed up by this and I've
fixed them well enough to work but they may be quite inefficient for
long strings now. I'm not going to spend too much time on this since
there's other work going on to generalize and unify searching APIs.

Close JuliaLang#22611
Close JuliaLang#24613

See also: JuliaLang#10593 JuliaLang#23612 JuliaLang#24103
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking This change will break code deprecation This change introduces or involves a deprecation excision Removal of code from Base or the repository strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants