Improvement of the functions for handling string indexing #23765

bkamins · 2017-09-19T08:59:19Z

Assuming that #22572 and #22511 go through tests and get accepted I have the following proposal.

My experience from making #22572 and #22511 go though CI show that I would expect many users to make errors in indexing strings using something like somestring[2:end-1]. The problem is that such a construct will pass typical tests as it is valid for ASCII characters but invalid in general.

Here are my thoughts how this risk could be minimized:

improve the presentation of the problem in https://docs.julialang.org/en/latest/manual/strings/#man-strings-1 (especially the part starting with You can perform arithmetic and other operations with end, just like a normal value: that actually suggests the wrong way to index strings - in my opinion - although it is valid); in particular describe nextind and prevind functions and the use of SubString type;
add additional argument count::Int to nextind and prevind that defaults to 1 and indicates the number of characters to move in nextind/prevind (such cases happen in the code);
add additional argument count::Int to chop that defaults to 1 giving number of characters to remove at the end of the string;
add a function lchop(s::AbstractString, count:Int) = SubString(s, nextind(s, 1, count), endof(s)) (I am not sure about a name) that is identical to chop but removes characters from the beginning of the string.

This list would cover the most common cases I have encountered. I do not make a PR here because #22572 and #22511 have to be merged first and maybe there would be already some comments regarding the proposal.

The text was updated successfully, but these errors were encountered:

nalimilan · 2017-09-19T12:58:46Z

Makes sense to me, feel free to file separate PRs (unless somebody disagrees). However I'm less convinced about lchop since it wouldn't be symmetric with chop. Deprecating chop in favor of rchop could make sense, but no language seems to do that except OCaml's ExtLib. Relevant Rosetta page: https://www.rosettacode.org/wiki/Substring/Top_and_tail (with a broken example for Julia, FWIW :-).

See also the radical approach at #9297 (which is unlikely to be implemented in 1.0, and maybe ever).

bkamins · 2017-09-19T13:22:31Z

I have not been aware of #9297 and have written something similar independently for testing purposes :).
I believe we need lchop but I was not sure about the name either. Another solutions I was thinking about were:

adding third parameter tail::Bool=true;
stating that when count<0 we chop -count characters from the head;
make a definition chop(s::AbstractString, tail::Int=1, head::Int=0) allowing to specify number of characters to chop from the tail and from the head at the same time.

@KristofferC

@KristofferC We will have to live with such hacks until JuliaLang#23765 is implemented.

@KristofferC

@KristofferC We will have to live with such hacks until JuliaLang#23765 is implemented.

bkamins · 2017-09-21T07:18:26Z

Having reviewed the code base for common cases of what portions of strings are extracted the following patterns appear (along with what I propose to implement):

get first nchar characters from a string:

extend first function for AbstractString with additional argument nchar with default value equal to 1;
get last nchar characters from a string:

extend last function for AbstractString with additional argument nchar with default value equal to 1;
chop last tail characters from a string or chop first head characters from a string:

extend chop function to the following signature chop(s::AbstractString, head::Integer=0, tail::Integer=1), where head and tail give number of characters to chop from head and tail;
get byte index of a character nchar characters to the left from a given byte index:

extend prevind function with nchar argument with default value equal to 1;
get byte index of a character nchar characters to the right from a given byte index:

extend nextind function with nchar argument with default value equal to 1;

If this would be acceptable I can push an appropriate PR.

nalimilan · 2017-09-21T08:43:00Z

Makes sense to me. first(x, n) would even be useful for any collection (DataArrays used to provide a head function for that inspired from R).

bkamins · 2017-09-21T08:52:12Z

Now I see that there is a problem with the reuse of first and last as their return type is eltype(x). So for strings first("abc") is 'a' not "a". Maybe introduce head(x,n) and tail(x,n)?
However, I am not sure if they are not used in some standard package.

nalimilan · 2017-09-21T08:58:52Z

Good catch, but maybe that's not an issue: the one-argument method has to return an eltype(x) object, but the two-argument method has to return a collection since the number of requested elements could be one or several.

bkamins · 2017-09-21T09:16:45Z

OK

bkamins · 2017-09-21T11:11:36Z

I have decided to partition this issue into several PRs to make it more manageable #23805 is the first (fundamental low level) part.

StefanKarpinski · 2017-09-21T14:15:35Z

head(x, n) and tail(x, n) may be better names for first(x, n) and last(x, n) since otherwise there's possible confusion that first(x) and first(x, 1) don't return the same thing.

nalimilan · 2017-09-21T16:49:18Z

@StefanKarpinski Is the confusion really problematic though? I'm not sure what's our policy regarding the difference between one- and two-argument methods of the same function. In some sense, it's similar to getindex returning either a collection or a scalar depending on the type of the index; but of course the difference here is that the return type would depend on the number of arguments rather than on their types.

Anyway, I'd be fine with head and tail too.

StefanKarpinski · 2017-09-21T19:04:32Z

Might be fine and not confusing at all. Just mentioning it as a consideration.

bkamins · 2017-09-21T21:22:23Z

I would stay with first and last.
After rethinking it head and tail would be new in base. People probably already have used such natural names in their code and this could lead to some problems (e.g. method conflict or piracy).

Last thing to decide is what those functions should return. I opt for:

first and last return String (as probably most of the string is discarded)
chop returns SubString (as probably most of the string is retained)

TotalVerb · 2017-09-22T01:10:16Z

I would recommend join(Iterators.take(string, n)) to avoid introducing new verbs. last might make some sense, however.

(I guess there's nothing wrong with first either; it would be nice if join(take(...)) would share the same implementation though.)

bkamins · 2017-09-22T05:53:37Z

@TotalVerb Agreed with new verbs. That is why I want to go with first and last.
As for performance join/take approach is way slower unfortunately and does not perform a check for end of string. Here is the benchmark:

Here goes the definition of first (I have not pushed it yet - I am waiting for some other PRs to merge):

function first(s::AbstractString, nchar::Integer)
    nchar == 1 && return s[1:1]
    s[1:nextind(s, 1, nchar-1)]
end

and the tests:

julia> x = randstring(1000)

julia> @benchmark join(Iterators.take(x, 1))
BenchmarkTools.Trial:
  memory estimate:  256 bytes
  allocs estimate:  7
  --------------
  minimum time:     1.493 μs (0.00% GC)
  median time:      1.586 μs (0.00% GC)
  mean time:        1.705 μs (1.41% GC)
  maximum time:     247.402 μs (96.89% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark first(x, 1)
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     42.454 ns (0.00% GC)
  median time:      44.320 ns (0.00% GC)
  mean time:        49.177 ns (5.29% GC)
  maximum time:     2.157 μs (95.13% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark join(Iterators.take(x, 2))
BenchmarkTools.Trial:
  memory estimate:  256 bytes
  allocs estimate:  7
  --------------
  minimum time:     1.633 μs (0.00% GC)
  median time:      1.679 μs (0.00% GC)
  mean time:        1.786 μs (1.37% GC)
  maximum time:     252.394 μs (96.88% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark first(x, 2)
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     49.801 ns (0.00% GC)
  median time:      58.257 ns (0.00% GC)
  mean time:        60.764 ns (3.88% GC)
  maximum time:     2.035 μs (93.74% GC)
  --------------
  samples:          10000
  evals/sample:     993

julia> @benchmark join(Iterators.take(x, 100))
BenchmarkTools.Trial:
  memory estimate:  656 bytes
  allocs estimate:  12
  --------------
  minimum time:     8.864 μs (0.00% GC)
  median time:      11.663 μs (0.00% GC)
  mean time:        11.916 μs (0.00% GC)
  maximum time:     76.045 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark first(x, 100)
BenchmarkTools.Trial:
  memory estimate:  128 bytes
  allocs estimate:  1
  --------------
  minimum time:     246.046 ns (0.00% GC)
  median time:      247.114 ns (0.00% GC)
  mean time:        262.899 ns (0.90% GC)
  maximum time:     1.539 μs (72.80% GC)
  --------------
  samples:          10000
  evals/sample:     438

StefanKarpinski · 2017-09-22T14:52:34Z

Reading it now I think that first(s, 10) and last(s, 10) do read very nicely. We should probably have corresponding functionality for all collections that can support it.

bkamins · 2017-09-22T15:00:18Z

Strings will need a separate implementation anyway (because of nextind/prevind). So in this thread I would leave it as is (unless it is really uncomplicated otherwise I would make a separate PR for this).

For general collections:

are there any collections except strings for which collection[start(collection):start(collection)+9] and collection[end-9:end] would not work?
is there a single abstract type for which it could be defined?

StefanKarpinski · 2017-09-22T15:15:25Z

Strings will need a separate implementation anyway

Sure, but the API can be generalized. Was not suggesting it should be done here.

StefanKarpinski · 2017-09-22T15:17:17Z

are there any collections except strings for which collection[start(collection):start(collection)+9] and collection[end-9:end] would not work?

Indexing and iteration are not the same at all so that would absolutely not work for general collections. The state object (which need not be a valid index) can be passed to next but not necessarily to getindex.

bkamins · 2017-09-23T21:05:53Z

Related to discussion in #23805 about thisind. This is what I would propose:

Introduce a new name for this functionality and not overload prevind/nextind with nchar=0 with it as I believe it would be confusing (I have explained why in Improved nextind and prevind #23805)
Decide if thisind is a proper name; for me a natural name is fixind as this is what we want to do (but thisind would be OK for me also if there is a general preference for this)
The signature for fixind(str::AbstractString, i::Integer, forward::Bool=false), the additional argument forward would say if we want to go forward if we are at invalid byte index (if true) or backward (if false), by default we go backward
The behavior would be:

if forward==true
- throw ArgumentError if i<1 || i>endof(str)
- otherwise return first valid index j such that j>=i
if forward==false
- throw ArgumentError if i<1 || i>=next(str, endof(str))[2]
- otherwise return first valid index j such that j<=i

bkamins · 2017-09-26T21:49:22Z

Prompted by #23880 (and earlier fixes I have made) a pattern that is commonly used is s[i:j-1] and now it raises error (in the past it was acceptable and was not ambiguous).
Now it can be written as s[i, prevind(s,j)] or chop(s[i:j]) but the former is clumsy and the later is a bit inefficient.

If there would be a decision that handling of a pattern select characters from i to j, excluding j is important enough getindex and SubString could be extended with an additional parameter chop::Bool=false which would make the range exclusive if it is set to true.

nalimilan · 2017-09-27T07:45:44Z

getindex generally doesn't take arguments other than the indices, so adding chop would be unusual.

IMHO these things would be better handled via a StringIndex type (#9297). At the time, @StefanKarpinski was afraid it would be too verbose, but the current codebase is more verbose than before since i-1 now uses prevind anyway. Maybe we could revisit that approach. It would be interesting to see what the nalimilan/stringindex branch would look like on current master and with more convenience features (like accepting 1 instead of start(s)).

bkamins · 2017-10-02T08:11:29Z

#23960 implements first and last from this proposal.

bkamins · 2017-10-13T16:14:52Z

Improved chop is implemented in #24126.

bkamins · 2017-10-21T17:37:26Z

There is only one open point left in this issue - should thisind (or fixind) be reintroduced?
And if yes - what should be the functionality: is forward argument needed or only going to the beginning of the current character is enough.

@nalimilan I would discuss the issue of StringIndex in #9297 and close this issue when thisind point is decided.

bkamins · 2017-10-30T23:22:57Z

In #24414 I propose how the implementation of thisind could look like (and I have decided to stick with the original name). If that PR gets accepted or it is decided that thisind is not needed this issue can be closed.

I think that we should encourage users to use only valid indices, but thisind can come handy e.g. when UTF-8 strings are read in chunks as it was indicated in discussion in #23805.

bkamins · 2017-11-30T21:12:12Z

All done in this issue. Close?

bkamins mentioned this issue Sep 19, 2017

Make getindex for String check if indices are valid #22572

Merged

JeffBezanson added the strings "Strings!" label Sep 19, 2017

bkamins added a commit to bkamins/julia that referenced this issue Sep 20, 2017

Fixes JuliaLang#23797

3d63ec6

@KristofferC We will have to live with such hacks until JuliaLang#23765 is implemented.

bkamins added a commit to bkamins/julia that referenced this issue Sep 20, 2017

Fixes JuliaLang#23797

d22c9ce

@KristofferC We will have to live with such hacks until JuliaLang#23765 is implemented.

This was referenced Sep 20, 2017

Correct string indexing in LineEdit #23798

Merged

Clean up string indexing #23799

Merged

bkamins mentioned this issue Sep 21, 2017

Improved nextind and prevind #23805

Merged

bkamins mentioned this issue Sep 21, 2017

Fixes constructor of SubString from SubString #22511

Merged

bkamins mentioned this issue Oct 2, 2017

first and last with nchar #23960

Merged

bkamins mentioned this issue Oct 2, 2017

Adding number of elements to return for first and last #23969

Closed

bkamins mentioned this issue Oct 13, 2017

Add head and tail arguments to chop #24126

Merged

andreasnoack mentioned this issue Oct 25, 2017

Restrict indexing into strings to a special ByteIndex or StringIndex type #9297

Closed

bkamins mentioned this issue Oct 30, 2017

Implements thisind function #24414

Merged

nalimilan closed this as completed Dec 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvement of the functions for handling string indexing #23765

Improvement of the functions for handling string indexing #23765

bkamins commented Sep 19, 2017

nalimilan commented Sep 19, 2017

bkamins commented Sep 19, 2017

bkamins commented Sep 21, 2017

nalimilan commented Sep 21, 2017

bkamins commented Sep 21, 2017

nalimilan commented Sep 21, 2017

bkamins commented Sep 21, 2017

bkamins commented Sep 21, 2017

StefanKarpinski commented Sep 21, 2017

nalimilan commented Sep 21, 2017

StefanKarpinski commented Sep 21, 2017

bkamins commented Sep 21, 2017

TotalVerb commented Sep 22, 2017 •

edited

Loading

bkamins commented Sep 22, 2017

StefanKarpinski commented Sep 22, 2017

bkamins commented Sep 22, 2017

StefanKarpinski commented Sep 22, 2017 •

edited

Loading

StefanKarpinski commented Sep 22, 2017

bkamins commented Sep 23, 2017

bkamins commented Sep 26, 2017

nalimilan commented Sep 27, 2017

bkamins commented Oct 2, 2017

bkamins commented Oct 13, 2017

bkamins commented Oct 21, 2017

bkamins commented Oct 30, 2017

bkamins commented Nov 30, 2017

Improvement of the functions for handling string indexing #23765

Improvement of the functions for handling string indexing #23765

Comments

bkamins commented Sep 19, 2017

nalimilan commented Sep 19, 2017

bkamins commented Sep 19, 2017

bkamins commented Sep 21, 2017

nalimilan commented Sep 21, 2017

bkamins commented Sep 21, 2017

nalimilan commented Sep 21, 2017

bkamins commented Sep 21, 2017

bkamins commented Sep 21, 2017

StefanKarpinski commented Sep 21, 2017

nalimilan commented Sep 21, 2017

StefanKarpinski commented Sep 21, 2017

bkamins commented Sep 21, 2017

TotalVerb commented Sep 22, 2017 • edited Loading

bkamins commented Sep 22, 2017

StefanKarpinski commented Sep 22, 2017

bkamins commented Sep 22, 2017

StefanKarpinski commented Sep 22, 2017 • edited Loading

StefanKarpinski commented Sep 22, 2017

bkamins commented Sep 23, 2017

bkamins commented Sep 26, 2017

nalimilan commented Sep 27, 2017

bkamins commented Oct 2, 2017

bkamins commented Oct 13, 2017

bkamins commented Oct 21, 2017

bkamins commented Oct 30, 2017

bkamins commented Nov 30, 2017

TotalVerb commented Sep 22, 2017 •

edited

Loading

StefanKarpinski commented Sep 22, 2017 •

edited

Loading