Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement of the functions for handling string indexing #23765

Closed
bkamins opened this issue Sep 19, 2017 · 26 comments
Closed

Improvement of the functions for handling string indexing #23765

bkamins opened this issue Sep 19, 2017 · 26 comments
Labels
strings "Strings!"

Comments

@bkamins
Copy link
Member

bkamins commented Sep 19, 2017

Assuming that #22572 and #22511 go through tests and get accepted I have the following proposal.

My experience from making #22572 and #22511 go though CI show that I would expect many users to make errors in indexing strings using something like somestring[2:end-1]. The problem is that such a construct will pass typical tests as it is valid for ASCII characters but invalid in general.

Here are my thoughts how this risk could be minimized:

  • improve the presentation of the problem in https://docs.julialang.org/en/latest/manual/strings/#man-strings-1 (especially the part starting with You can perform arithmetic and other operations with end, just like a normal value: that actually suggests the wrong way to index strings - in my opinion - although it is valid); in particular describe nextind and prevind functions and the use of SubString type;
  • add additional argument count::Int to nextind and prevind that defaults to 1 and indicates the number of characters to move in nextind/prevind (such cases happen in the code);
  • add additional argument count::Int to chop that defaults to 1 giving number of characters to remove at the end of the string;
  • add a function lchop(s::AbstractString, count:Int) = SubString(s, nextind(s, 1, count), endof(s)) (I am not sure about a name) that is identical to chop but removes characters from the beginning of the string.

This list would cover the most common cases I have encountered. I do not make a PR here because #22572 and #22511 have to be merged first and maybe there would be already some comments regarding the proposal.

@nalimilan
Copy link
Member

Makes sense to me, feel free to file separate PRs (unless somebody disagrees). However I'm less convinced about lchop since it wouldn't be symmetric with chop. Deprecating chop in favor of rchop could make sense, but no language seems to do that except OCaml's ExtLib. Relevant Rosetta page: https://www.rosettacode.org/wiki/Substring/Top_and_tail (with a broken example for Julia, FWIW :-).

See also the radical approach at #9297 (which is unlikely to be implemented in 1.0, and maybe ever).

@bkamins
Copy link
Member Author

bkamins commented Sep 19, 2017

I have not been aware of #9297 and have written something similar independently for testing purposes :).
I believe we need lchop but I was not sure about the name either. Another solutions I was thinking about were:

  • adding third parameter tail::Bool=true;
  • stating that when count<0 we chop -count characters from the head;
  • make a definition chop(s::AbstractString, tail::Int=1, head::Int=0) allowing to specify number of characters to chop from the tail and from the head at the same time.

@JeffBezanson JeffBezanson added the strings "Strings!" label Sep 19, 2017
bkamins added a commit to bkamins/julia that referenced this issue Sep 20, 2017
@KristofferC We will have to live with such hacks until JuliaLang#23765 is implemented.
bkamins added a commit to bkamins/julia that referenced this issue Sep 20, 2017
@KristofferC We will have to live with such hacks until JuliaLang#23765 is implemented.
@bkamins
Copy link
Member Author

bkamins commented Sep 21, 2017

Having reviewed the code base for common cases of what portions of strings are extracted the following patterns appear (along with what I propose to implement):

  • get first nchar characters from a string:

    extend first function for AbstractString with additional argument nchar with default value equal to 1;

  • get last nchar characters from a string:

    extend last function for AbstractString with additional argument nchar with default value equal to 1;

  • chop last tail characters from a string or chop first head characters from a string:

    extend chop function to the following signature chop(s::AbstractString, head::Integer=0, tail::Integer=1), where head and tail give number of characters to chop from head and tail;

  • get byte index of a character nchar characters to the left from a given byte index:

    extend prevind function with nchar argument with default value equal to 1;

  • get byte index of a character nchar characters to the right from a given byte index:

    extend nextind function with nchar argument with default value equal to 1;

If this would be acceptable I can push an appropriate PR.

@nalimilan
Copy link
Member

Makes sense to me. first(x, n) would even be useful for any collection (DataArrays used to provide a head function for that inspired from R).

@bkamins
Copy link
Member Author

bkamins commented Sep 21, 2017

Now I see that there is a problem with the reuse of first and last as their return type is eltype(x). So for strings first("abc") is 'a' not "a". Maybe introduce head(x,n) and tail(x,n)?
However, I am not sure if they are not used in some standard package.

@nalimilan
Copy link
Member

Good catch, but maybe that's not an issue: the one-argument method has to return an eltype(x) object, but the two-argument method has to return a collection since the number of requested elements could be one or several.

@bkamins
Copy link
Member Author

bkamins commented Sep 21, 2017

OK

@bkamins
Copy link
Member Author

bkamins commented Sep 21, 2017

I have decided to partition this issue into several PRs to make it more manageable #23805 is the first (fundamental low level) part.

@StefanKarpinski
Copy link
Member

head(x, n) and tail(x, n) may be better names for first(x, n) and last(x, n) since otherwise there's possible confusion that first(x) and first(x, 1) don't return the same thing.

@nalimilan
Copy link
Member

@StefanKarpinski Is the confusion really problematic though? I'm not sure what's our policy regarding the difference between one- and two-argument methods of the same function. In some sense, it's similar to getindex returning either a collection or a scalar depending on the type of the index; but of course the difference here is that the return type would depend on the number of arguments rather than on their types.

Anyway, I'd be fine with head and tail too.

@StefanKarpinski
Copy link
Member

Might be fine and not confusing at all. Just mentioning it as a consideration.

@bkamins
Copy link
Member Author

bkamins commented Sep 21, 2017

I would stay with first and last.
After rethinking it head and tail would be new in base. People probably already have used such natural names in their code and this could lead to some problems (e.g. method conflict or piracy).

Last thing to decide is what those functions should return. I opt for:

  • first and last return String (as probably most of the string is discarded)
  • chop returns SubString (as probably most of the string is retained)

@TotalVerb
Copy link
Contributor

TotalVerb commented Sep 22, 2017

I would recommend join(Iterators.take(string, n)) to avoid introducing new verbs. last might make some sense, however.

(I guess there's nothing wrong with first either; it would be nice if join(take(...)) would share the same implementation though.)

@bkamins
Copy link
Member Author

bkamins commented Sep 22, 2017

@TotalVerb Agreed with new verbs. That is why I want to go with first and last.
As for performance join/take approach is way slower unfortunately and does not perform a check for end of string. Here is the benchmark:

Here goes the definition of first (I have not pushed it yet - I am waiting for some other PRs to merge):

function first(s::AbstractString, nchar::Integer)
    nchar == 1 && return s[1:1]
    s[1:nextind(s, 1, nchar-1)]
end

and the tests:

julia> x = randstring(1000)

julia> @benchmark join(Iterators.take(x, 1))
BenchmarkTools.Trial:
  memory estimate:  256 bytes
  allocs estimate:  7
  --------------
  minimum time:     1.493 μs (0.00% GC)
  median time:      1.586 μs (0.00% GC)
  mean time:        1.705 μs (1.41% GC)
  maximum time:     247.402 μs (96.89% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark first(x, 1)
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     42.454 ns (0.00% GC)
  median time:      44.320 ns (0.00% GC)
  mean time:        49.177 ns (5.29% GC)
  maximum time:     2.157 μs (95.13% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark join(Iterators.take(x, 2))
BenchmarkTools.Trial:
  memory estimate:  256 bytes
  allocs estimate:  7
  --------------
  minimum time:     1.633 μs (0.00% GC)
  median time:      1.679 μs (0.00% GC)
  mean time:        1.786 μs (1.37% GC)
  maximum time:     252.394 μs (96.88% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark first(x, 2)
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     49.801 ns (0.00% GC)
  median time:      58.257 ns (0.00% GC)
  mean time:        60.764 ns (3.88% GC)
  maximum time:     2.035 μs (93.74% GC)
  --------------
  samples:          10000
  evals/sample:     993

julia> @benchmark join(Iterators.take(x, 100))
BenchmarkTools.Trial:
  memory estimate:  656 bytes
  allocs estimate:  12
  --------------
  minimum time:     8.864 μs (0.00% GC)
  median time:      11.663 μs (0.00% GC)
  mean time:        11.916 μs (0.00% GC)
  maximum time:     76.045 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark first(x, 100)
BenchmarkTools.Trial:
  memory estimate:  128 bytes
  allocs estimate:  1
  --------------
  minimum time:     246.046 ns (0.00% GC)
  median time:      247.114 ns (0.00% GC)
  mean time:        262.899 ns (0.90% GC)
  maximum time:     1.539 μs (72.80% GC)
  --------------
  samples:          10000
  evals/sample:     438

@StefanKarpinski
Copy link
Member

Reading it now I think that first(s, 10) and last(s, 10) do read very nicely. We should probably have corresponding functionality for all collections that can support it.

@bkamins
Copy link
Member Author

bkamins commented Sep 22, 2017

Strings will need a separate implementation anyway (because of nextind/prevind). So in this thread I would leave it as is (unless it is really uncomplicated otherwise I would make a separate PR for this).

For general collections:

  • are there any collections except strings for which collection[start(collection):start(collection)+9] and collection[end-9:end] would not work?
  • is there a single abstract type for which it could be defined?

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Sep 22, 2017

Strings will need a separate implementation anyway

Sure, but the API can be generalized. Was not suggesting it should be done here.

@StefanKarpinski
Copy link
Member

are there any collections except strings for which collection[start(collection):start(collection)+9] and collection[end-9:end] would not work?

Indexing and iteration are not the same at all so that would absolutely not work for general collections. The state object (which need not be a valid index) can be passed to next but not necessarily to getindex.

@bkamins
Copy link
Member Author

bkamins commented Sep 23, 2017

Related to discussion in #23805 about thisind. This is what I would propose:

  1. Introduce a new name for this functionality and not overload prevind/nextind with nchar=0 with it as I believe it would be confusing (I have explained why in Improved nextind and prevind #23805)
  2. Decide if thisind is a proper name; for me a natural name is fixind as this is what we want to do (but thisind would be OK for me also if there is a general preference for this)
  3. The signature for fixind(str::AbstractString, i::Integer, forward::Bool=false), the additional argument forward would say if we want to go forward if we are at invalid byte index (if true) or backward (if false), by default we go backward
  4. The behavior would be:
  • if forward==true
    • throw ArgumentError if i<1 || i>endof(str)
    • otherwise return first valid index j such that j>=i
  • if forward==false
    • throw ArgumentError if i<1 || i>=next(str, endof(str))[2]
    • otherwise return first valid index j such that j<=i

@bkamins
Copy link
Member Author

bkamins commented Sep 26, 2017

Prompted by #23880 (and earlier fixes I have made) a pattern that is commonly used is s[i:j-1] and now it raises error (in the past it was acceptable and was not ambiguous).
Now it can be written as s[i, prevind(s,j)] or chop(s[i:j]) but the former is clumsy and the later is a bit inefficient.

If there would be a decision that handling of a pattern select characters from i to j, excluding j is important enough getindex and SubString could be extended with an additional parameter chop::Bool=false which would make the range exclusive if it is set to true.

@nalimilan
Copy link
Member

getindex generally doesn't take arguments other than the indices, so adding chop would be unusual.

IMHO these things would be better handled via a StringIndex type (#9297). At the time, @StefanKarpinski was afraid it would be too verbose, but the current codebase is more verbose than before since i-1 now uses prevind anyway. Maybe we could revisit that approach. It would be interesting to see what the nalimilan/stringindex branch would look like on current master and with more convenience features (like accepting 1 instead of start(s)).

@bkamins
Copy link
Member Author

bkamins commented Oct 2, 2017

#23960 implements first and last from this proposal.

@bkamins
Copy link
Member Author

bkamins commented Oct 13, 2017

Improved chop is implemented in #24126.

@bkamins
Copy link
Member Author

bkamins commented Oct 21, 2017

There is only one open point left in this issue - should thisind (or fixind) be reintroduced?
And if yes - what should be the functionality: is forward argument needed or only going to the beginning of the current character is enough.

@nalimilan I would discuss the issue of StringIndex in #9297 and close this issue when thisind point is decided.

@bkamins
Copy link
Member Author

bkamins commented Oct 30, 2017

In #24414 I propose how the implementation of thisind could look like (and I have decided to stick with the original name). If that PR gets accepted or it is decided that thisind is not needed this issue can be closed.

I think that we should encourage users to use only valid indices, but thisind can come handy e.g. when UTF-8 strings are read in chunks as it was indicated in discussion in #23805.

@bkamins
Copy link
Member Author

bkamins commented Nov 30, 2017

All done in this issue. Close?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!"
Projects
None yet
Development

No branches or pull requests

5 participants