Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved nextind and prevind #23805

Merged
merged 4 commits into from
Oct 1, 2017
Merged

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Sep 21, 2017

First part of implementation of #23765. Extends prevind and nextind with nchar parameter.
Additionally ensures prevind and nextind have consistent return value across different types of strings. In particular:

  • prevind always returns a result between start(str)-1 and endof(str)
  • nextind always returns a result between start(str) and endof(str)+1

@bkamins
Copy link
Member Author

bkamins commented Sep 21, 2017

Does anyone have an advice how to test DirectIndexString without loading LegacyStrings or how to allow use of LegacyStrings in tests?

@bkamins
Copy link
Member Author

bkamins commented Sep 21, 2017

I have added a custom DirectIndexString subtype to Base.Test similar to GenericString.

@bkamins
Copy link
Member Author

bkamins commented Sep 21, 2017

The other change is that maximum from nextind is not endof(str)+1 but next(str, endof(str))[2] as it is more consistent (it gives a next character index that would be returned if the string were longer).

@@ -236,14 +236,33 @@ end

## Generic indexing functions ##

prevind(s::DirectIndexString, i::Integer) = Int(i)-1
nextind(s::DirectIndexString, i::Integer) = Int(i)+1
function prevind(s::DirectIndexString, i::Integer, nchar::Integer=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The definition of DirectIndexString is that each codepoint takes one byte. So you can just do Int(i-nchar-1).

Get the previous valid string index before `i`.
Returns a value less than `1` at the beginning of the string.
Get the `nchar`-th valid string index before `i`.
Returns a `start(str)-1` at the beginning of the string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start(str) always returns 1, so no need to change that.

Returns a value less than `1` at the beginning of the string.
Get the `nchar`-th valid string index before `i`.
Returns a `start(str)-1` at the beginning of the string.
If `i>endof(str)` then `endof(s)` is considered a first valid string index.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"a first" -> "the first"

if i > e
return e
function prevind(str::AbstractString, i::Integer, nchar::Integer=1)
nchar > 0 || error("nchar must be greater than 0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ArgumentError.

function prevind(str::AbstractString, i::Integer, nchar::Integer=1)
nchar > 0 || error("nchar must be greater than 0")
j = Int(i)
j <= start(str) && return start(str)-1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same remark about start.

return j
end

while nchar > 0 && j >= start(str)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be a way to avoid checking j in both loops?

@nalimilan
Copy link
Member

I have added a custom DirectIndexString subtype to Base.Test similar to GenericString.

I wonder why we still have DirectIndexString since it isn't used at all. We should probably move it to LegacyStrings.

@ararslan ararslan added the domain:strings "Strings!" label Sep 21, 2017
@bkamins
Copy link
Member Author

bkamins commented Sep 21, 2017

@nalimilan Thank you for the comments. Considering them led me to redesign the implementation. In particular I left old nextind and prevind untouched. And only added additional methods with nchar argument.
This way the old functions behave as they did before (which is less consistent, but will not break anything).

For instance for DirectIndexString the implementation Int(i)-1 etc. is inconsistent with other implementations when index is out of string range, but for now I recommend to leave it as is.

Also, as the PR introduced some unexpected errors on CI I will let you know when I am confident it is finished and ready for review.

@nalimilan
Copy link
Member

I wouldn't use two different functions unless there's a clear performance gain for the one-argument case. Having different methods means they are less tested, and as you said they can be inconsistent. Also don't worry too much about DirectIndexString, which should probably be removed anyway.

More generally, if the handling of out of range indices is a problem for the complexity or performance of the methods, we could choose a different rule (and/or not document it). It probably doesn't make a difference in practice whether we return endof(s)+1 or next(s, endof(s))[2] as long as the value is out of bounds.: anyway any attempt to use the index will give a BoundsError.

@bkamins
Copy link
Member Author

bkamins commented Sep 21, 2017

It seems no it will go through CI so it should be good for a review.
Regarding your comment:

  • there is a clear performance gain (see benchmarks below) - we do not have to test for nchar and there is no need to set up a loop;
  • I also thought to merge them initially as you saw, but then strange errors in CI started to pop out;

Performance benchmark:

julia> function test(s, i)
           print("prevind($s, $i):\t")
           @btime prevind($s, $i)
           print("prevind($s, $i, 1):\t")
           @btime prevind($s, $i, 1)
           print("nextind($s, $i):\t")
           @btime nextind($s, $i)
           print("nextind($s, $i, 1):\t")
           @btime nextind($s, $i, 1)
           nothing
       end
test (generic function with 1 method)

julia> test("test", 0)
prevind(test, 0):         4.665 ns (0 allocations: 0 bytes)
prevind(test, 0, 1):      8.864 ns (0 allocations: 0 bytes)
nextind(test, 0):         4.198 ns (0 allocations: 0 bytes)
nextind(test, 0, 1):      7.931 ns (0 allocations: 0 bytes)

julia> test("test", 1)
prevind(test, 1):         4.665 ns (0 allocations: 0 bytes)
prevind(test, 1, 1):      8.864 ns (0 allocations: 0 bytes)
nextind(test, 1):         4.665 ns (0 allocations: 0 bytes)
nextind(test, 1, 1):      8.864 ns (0 allocations: 0 bytes)

julia> test("test", 10)
prevind(test, 10):        5.598 ns (0 allocations: 0 bytes)
prevind(test, 10, 1):     10.730 ns (0 allocations: 0 bytes)
nextind(test, 10):        4.198 ns (0 allocations: 0 bytes)
nextind(test, 10, 1):     7.931 ns (0 allocations: 0 bytes)

julia> test("∀∃∀", 0)
prevind(∀∃∀, 0):          4.665 ns (0 allocations: 0 bytes)
prevind(∀∃∀, 0, 1):       8.864 ns (0 allocations: 0 bytes)
nextind(∀∃∀, 0):          4.198 ns (0 allocations: 0 bytes)
nextind(∀∃∀, 0, 1):       7.931 ns (0 allocations: 0 bytes)

julia> test("∀∃∀", 1)
prevind(∀∃∀, 1):          4.665 ns (0 allocations: 0 bytes)
prevind(∀∃∀, 1, 1):       8.864 ns (0 allocations: 0 bytes)
nextind(∀∃∀, 1):          6.997 ns (0 allocations: 0 bytes)
nextind(∀∃∀, 1, 1):       10.263 ns (0 allocations: 0 bytes)

julia> test("∀∃∀", 12)
prevind(∀∃∀, 12):         6.998 ns (0 allocations: 0 bytes)
prevind(∀∃∀, 12, 1):      13.062 ns (0 allocations: 0 bytes)
nextind(∀∃∀, 12):         4.198 ns (0 allocations: 0 bytes)
nextind(∀∃∀, 12, 1):      7.931 ns (0 allocations: 0 bytes)

Additionally I have checked that for nchar>1 preformance wise it is better to use the proposed implementations than calling single step nextind/prevind in a loop nchar times.

@nalimilan
Copy link
Member

Thanks for running the benchmarks. Indeed that's a significant difference. I wouldn't have expected it to be so large ("setting up a loop" should be mostly equivalent to checking nchar > 0 twice, but maybe the compiler doesn't optimize for the nchar=1 case).

The CI errors you got still worry me a bit. Even if we keep the one-argument methods for performance, the two-argument ones should give the same result. So I think the CI should pass without the one-argument methods.

NEWS.md Outdated
@@ -224,6 +224,9 @@ This section lists changes that do not have deprecation warnings.
Library improvements
--------------------

* THe functions `nextind` and `prevind` now accept `nchar` argument that indicates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

"""
prevind(str::AbstractString, i::Integer)
prevind(str::AbstractString, i::Integer, nchar::Integer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use a single line with nchar=1. Then below you can simply say that the function goes nchar characters, implementation details do not matter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@test_throws ArgumentError nextind(str, 20, 0)
end

let str = GenericString("∀α>β:α+1>β")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this block be merged with the previous one using a for loop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately no - nextind and prevind return something different in corner cases for String and AbstractString. Both are correct wrt the contract given in the docstring but give different results.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe have a common part and a separate part, so that it's easier to spot the differences?

j -= 1
end
end
j <= 0 && return j
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check wasn't present in the original code. Would there be a way to avoid it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check is needed to make sure that nextind(s,i,1)==nextind(s,i) contract always holds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but I wondered whether there would be a way to avoid it by adapting the code logic. Maybe not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see one (and this is cheap).

@@ -104,6 +104,27 @@ function prevind(s::String, i::Integer)
@inbounds while j > 0 && is_valid_continuation(codeunit(s,j))
j -= 1
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated change. (Below you also have an empty line which isn't there in other functions.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@bkamins
Copy link
Member Author

bkamins commented Sep 22, 2017

@nalimilan Now CI would pass without one argument methods as I have made the contracts nextind(s,i,1)==nextind(s,i) and prevind(s,i,1)==prevind(s,i) to always be met.

The problem is that with i outside or at the boundary of string old prevind and nextind behave inconsistently between String, AbstractString and DirectIndexString and also depending on the value of i (if it is a boundary or outside the string range). In the earlier implementation I have made this consistent and this produced the CI errors (as it seems that some internal functions in base used the behavior of nextind/prevind that was not specified in docstring contract).

This (the use of undocumented properties of nextind/prevind) probably could be cleaned one day, but I would leave it as a separate task.


Get the previous valid string index before `i`.
Returns a value less than `1` at the beginning of the string.
If `nchar` argument is given the function goes back `nchar` charsacters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the nchar" and "characters".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

j -= 1
end
end
j <= 0 && return j
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but I wondered whether there would be a way to avoid it by adapting the code logic. Maybe not.

@test_throws ArgumentError nextind(str, 20, 0)
end

let str = GenericString("∀α>β:α+1>β")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe have a common part and a separate part, so that it's easier to spot the differences?

@@ -269,11 +285,32 @@ function prevind(s::AbstractString, i::Integer)
return 0 # out of range
end

function prevind(s::AbstractString, i::Integer, nchar::Integer)
nchar > 0 || throw(ArgumentError("nchar must be greater than 0"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not accept 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling of 0:

  • requires separate logic (additional if)
  • is unclear what should be the result if i is not a proper byte index (if one calls prevind one expects to get a proper byte index in return - this is an important invariant of those functions in my opinion)

for j = j+1:e
isvalid(s,j) && break
end
isvalid(s,j) || next(s,e)[2] # out of range
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why you need to call isvalid again here, since it has been called in the last iteration already.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Line 361 was not needed at all as if j==e after the loop we are sure that s[e] is valid (as this is the contract of endof)

@bkamins
Copy link
Member Author

bkamins commented Sep 22, 2017

In the tests I have collected the common part.

@@ -355,10 +355,14 @@ function nextind(s::AbstractString, i::Integer, nchar::Integer)
else
j > e && return j+1
j == e && return next(s,e)[2]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And isn't this one redundant, since the loop will be a no-op anyway?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you've removed the call to next below I guess this one needs to stay.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly

@@ -255,7 +255,7 @@ end

Get the previous valid string index before `i`.
Returns a value less than `1` at the beginning of the string.
If `nchar` argument is given the function goes back `nchar` charsacters.
If `nchar` argument is given the function goes back the `nchar` characters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my suggestion about "the" was about "If the nchar...", not about the second occurrence (but I'm not a native speaker).

@bkamins
Copy link
Member Author

bkamins commented Sep 22, 2017

The the location is corrected in the comment.

j -= 1
end
end
j < 1 && return j
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the one-argument function return 0 in that case?

BTW, the CI failures are due to whitespace in tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gets triggered only if nchar>1 so it did not apply (for nchar=1 everything was OK) and previously I have made it consistent with String implementation. But you are right that it is more consistent to return 0. Fixed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that means this line isn't tested? It should definitely be, especially since that code is quite tricky (are there other cases like this?).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I have changed the implementation to satisfy a contract that iterating nextind/prevind k times should give the same as nextind/prevind with nchar=k. It is also fully tested.

@StefanKarpinski
Copy link
Sponsor Member

That seems like a sensible contract to me... what else would it mean?

@StefanKarpinski
Copy link
Sponsor Member

To me the interesting corner cases are when you start out on a byte that's in the middle of a character – IIRC, there used to be a thisind function to get you to the start of the current character in that situation. We could now use prevind(s, i, 0) to get to the start of the current character – the character whose data the index points into – and nextind(s, i, 0) to get to the start of the next character – if you're already at the start of a character, return i, but if you're in the middle of a character, advance through the trailing bytes.

@bkamins
Copy link
Member Author

bkamins commented Sep 22, 2017

@StefanKarpinski but then:

  • if you are in the middle of the character:
    • prevind(s,i)==prevind(s,i,1)==prevind(s,i,0)
    • nextind(s,i)==nextind(s,i,1)==nextind(s,i,0)
  • if you are at the start of the character:
    • prevind(s,i)==prevind(s,i,1)!=prevind(s,i,0)
    • nextind(s,i)==nextind(s,i,1)!=nextind(s,i,0)

I would prefer a separate function for this. Why thisind was removed?

@bkamins
Copy link
Member Author

bkamins commented Sep 23, 2017

I have rebased the PR to fix merge conflicts.

@nalimilan
Copy link
Member

That seems like a sensible contract to me... what else would it mean?

We could always return 0 when the index would be out of bounds on the left, and endof(s)+1 or next(s, endof(s))[2] when it would be out of bounds on the right. Returning endof(s)+nchar is quite a weird convention as it does as if characters always used a single byte. Anyway users which would rely on these kinds of details probably don't do things correctly. I'd say the best convention is the one which is the easiest to implement efficiently (i.e. no additional checks for corner cases like this).

Regarding thisind, I'd rather avoid supporting officially indices which are in the middle of a codepoint. There's no way to obtain such an index, except by doing things the wrong way (like i-1). Such indices already throw an error when passed to getindex, and (hopefully soon) SubString will do the same (#22511). So maybe we could allow prevind(s, i, 0) at some point, but I'd rather not allow it unless we have a strong use case.

@nalimilan
Copy link
Member

CI failures seem legitimate.

@bkamins
Copy link
Member Author

bkamins commented Sep 23, 2017

I would focus this PR on non-breaking changes (as it is now).
When we have it merged (and also #22511 and #23765 finished) I can handle two breaking changes (as this will involve fixing other code - as we already know and I want to keep PR as small as possible to avoid merge conflicts):

  • make prevind/nextind always return 0 or next(s,endof(s))[2] when out of string range;
  • move DirectIndexString to LegacyStrings (and fix functionality insonsistency)

Regarding thisind I am indifferent. It would have its uses, but I agree with @nalimilan that we should not encourage invalid indexes in general (and we have isvalid already for testing validity). Also if it were to be reintroduced we would have to decide what to do when (I do not know what previous implementation did in such cases):

  • index is less than 1;
  • index is greater or equal than nextind(s, endof(s))[2];

shoud thisind throw an error then?

(CI should go through now)


# prevind and nextind tests

let
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the repeated change requests, but this file has just been ported to using testsets. Could you change the new tests to use @testset (and remove the comment above)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thank you for the commitment. Fixing is not a problem, but resolving conflicts when rebasing is a pain 😃.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've experienced this recently with my own PRs. There's too much activity in Julia at the moment -- which is actually a good sign!

@StefanKarpinski
Copy link
Sponsor Member

Use case: I have a very large file and I want to do parallel line-oriented processing of it. The. A natural approach is to pick evenly spread out indices and then synchronize to the nearest character and the search for the preceding new line and start processing there.

@bkamins
Copy link
Member Author

bkamins commented Sep 23, 2017

Let me propose to move the discussion on thisind to #23765 (I hope this PR can be closed with the functionality already implemented).

@bkamins
Copy link
Member Author

bkamins commented Sep 29, 2017

Any comments before it could be merged? (I would want to move forward with #23765 using this PR)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants