Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate AbstractString interface from iteration protocol #26133

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions base/iterators.jl
Original file line number Diff line number Diff line change
Expand Up @@ -247,6 +247,60 @@ setindex!(v::Pairs, value, key) = (v.data[key] = value; v)
get(v::Pairs, key, default) = get(v.data, key, default)
get(f::Base.Callable, collection::Pairs, key) = get(f, v.data, key)

"""
Iterators.Next(values::A, idx::eltype(I)=firstindex(values), itr::I=eachindex(values)) where {A,I}

Returns a tuple of a value and the subsequent index. This iterator is useful for the
implementation of variably-length encoded arrays where decoding the element and
obtaining the offset or index of the next element generally involve the same computation.

A default implementation is provided that simply iterates over `eachindex` and uses
`getindex` to obtain the value corresponding to the index. It is allowed (and encouraged)
to overload iteration for a specific `Next{A}` in order to provide a more efficient
implementation that computes both in one step.

The index in the last tuple will generally be equivalent to `lastindex(values)+1`
though users should only rely on the fact that it is `> lastindex(values)` to allow
implemntations the flexibility to choose a different value.

The `idx` argument provides a means by which to resume this iterator from a given index.
The first value returned by the `Next` iterator should correspond to the element at `idx`.
Please note that if you override iteration for `Next{A}` and your iteration state is not
the next index, you will have to additionally overload `Next(data::A, idx, itr::I)` for
four `A`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "for four A" mean here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo


# Examples:

julia> first(Next(['a','b','c']))
('a', 2)

julia> first(Next(['a','b','c'], 3))
('c', 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a somewhat confusing example. Would a more typical usage example be possible?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what confuses you about it? The purpose of the second example was to show that the last tuple will have an out-of-bounds index. Would adding first(Next(['a','b','c'], 2)) before this one be better?

"""
struct Next{A, I}
data::A
itr::I
Next{A}(data::A, itr::I) where {A, I} = new{A, I}(data, itr)
end
Next(data, idx, itr) = Rest(Next{typeof(data)}(data, itr), idx)
Next(data, idx) = Next(data, idx, eachindex(data))
Next(data) = Next{typeof(data)}(data, eachindex(data))

start(lip::Next) = start(lip.itr)
done(lip::Next, state) = done(lip.itr, state)
function next(lip::Next, state)
nidx = ns = next(lip.itr, state)
# A bit awkward now, done for consistency with the new iteration protocol
done(lip.itr, ns) && (nidx = lastindex(lip.itr)+1)
(lip.data[ns], nidx), ns
end

length(lip::Next) = length(lip.itr)
eltype(::Type{Next{A, I}}) where {A, I} = Tuple{eltype(A), eltype(I)}

IteratorSize(::Type{<:Next{I}}) where {I} = IteratorSize(I)
IteratorEltype(::Type{<:Next{I}}) where {I} = IteratorEltype(I)

# zip

abstract type AbstractZipIterator end
Expand Down Expand Up @@ -1070,6 +1124,7 @@ end
function fixpoint_iter_type(itrT::Type, valT::Type, stateT::Type)
nextvalstate = Base._return_type(next, Tuple{itrT, stateT})
nextvalstate <: Tuple{Any, Any} || return Any
nextvalstate === Union{} && return Union{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly? Prevents things from erroring in the wrong place if you mess up how to do iteration.

nextvalstate = Tuple{
typejoin(valT, fieldtype(nextvalstate, 1)),
typejoin(stateT, fieldtype(nextvalstate, 2))}
Expand Down
4 changes: 3 additions & 1 deletion base/regex.jl
Original file line number Diff line number Diff line change
Expand Up @@ -228,7 +228,9 @@ ncodeunits(s::SubstitutionString) = ncodeunits(s.string)
codeunit(s::SubstitutionString) = codeunit(s.string)
codeunit(s::SubstitutionString, i::Integer) = codeunit(s.string, i)
isvalid(s::SubstitutionString, i::Integer) = isvalid(s.string, i)
next(s::SubstitutionString, i::Integer) = next(s.string, i)
start(s::StringNext{<:SubstitutionString}) = start(StringNext(s.data.string))
next(s::StringNext{<:SubstitutionString}, state) = next(StringNext(s.data.string), state)
done(s::StringNext{<:SubstitutionString}, state) = done(StringNext(s.data.string), state)

function show(io::IO, s::SubstitutionString)
print(io, "s")
Expand Down
65 changes: 4 additions & 61 deletions base/strings/basic.jl
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,9 @@ AbstractString

## required string functions ##

# N.B. iteration for StringNext{T} is a required part of the iteration protocol
include("strings/iteration.jl")

"""
ncodeunits(s::AbstractString) -> Int

Expand Down Expand Up @@ -121,52 +124,6 @@ Stacktrace:
@propagate_inbounds isvalid(s::AbstractString, i::Integer) = typeof(i) === Int ?
throw(MethodError(isvalid, (s, i))) : isvalid(s, Int(i))

"""
next(s::AbstractString, i::Integer) -> Tuple{Char, Int}

Return a tuple of the character in `s` at index `i` with the index of the start
of the following character in `s`. This is the key method that allows strings to
be iterated, yielding a sequences of characters. If `i` is out of bounds in `s`
then a bounds error is raised. The `next` function, as part of the iteration
protocoal may assume that `i` is the start of a character in `s`.

See also: [`getindex`](@ref), [`start`](@ref), [`done`](@ref),
[`checkbounds`](@ref)
"""
@propagate_inbounds next(s::AbstractString, i::Integer) = typeof(i) === Int ?
throw(MethodError(next, (s, i))) : next(s, Int(i))

## basic generic definitions ##

start(s::AbstractString) = 1
done(s::AbstractString, i::Integer) = i > ncodeunits(s)
eltype(::Type{<:AbstractString}) = Char
sizeof(s::AbstractString) = ncodeunits(s) * sizeof(codeunit(s))
firstindex(s::AbstractString) = 1
lastindex(s::AbstractString) = thisind(s, ncodeunits(s))

function getindex(s::AbstractString, i::Integer)
@boundscheck checkbounds(s, i)
@inbounds return isvalid(s, i) ? next(s, i)[1] : string_index_err(s, i)
end

getindex(s::AbstractString, i::Colon) = s
# TODO: handle other ranges with stride ±1 specially?
# TODO: add more @propagate_inbounds annotations?
getindex(s::AbstractString, v::AbstractVector{<:Integer}) =
sprint(io->(for i in v; write(io, s[i]) end), sizehint=length(v))
getindex(s::AbstractString, v::AbstractVector{Bool}) =
throw(ArgumentError("logical indexing not supported for strings"))

function get(s::AbstractString, i::Integer, default)
# TODO: use ternary once @inbounds is expression-like
if checkbounds(Bool, s, i)
@inbounds return s[i]
else
return default
end
end

## bounds checking ##

checkbounds(::Type{Bool}, s::AbstractString, i::Integer) =
Expand Down Expand Up @@ -379,6 +336,7 @@ julia> thisind("αβγdef", 10)

julia> thisind("αβγdef", 20)
20
```
"""
thisind(s::AbstractString, i::Integer) = thisind(s, Int(i))

Expand Down Expand Up @@ -470,21 +428,6 @@ function nextind(s::AbstractString, i::Int, n::Int)
return i + n
end

## string index iteration type ##

struct EachStringIndex{T<:AbstractString}
s::T
end
keys(s::AbstractString) = EachStringIndex(s)

length(e::EachStringIndex) = length(e.s)
first(::EachStringIndex) = 1
last(e::EachStringIndex) = lastindex(e.s)
start(e::EachStringIndex) = start(e.s)
next(e::EachStringIndex, state) = (state, nextind(e.s, state))
done(e::EachStringIndex, state) = done(e.s, state)
eltype(::Type{<:EachStringIndex}) = Int

"""
isascii(c::Union{Char,AbstractString}) -> Bool

Expand Down
120 changes: 120 additions & 0 deletions base/strings/iteration.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# A specialized iterator for EachIndex of strings
struct EachStringIndex{T<:AbstractString}
s::T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat tangential, but while we're moving this type, let's spell the field out as .string instead of .s.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have nextindex spelled out instead of nextind too? 🚲🏚

end
keys(s::AbstractString) = EachStringIndex(s)

length(e::EachStringIndex) = length(e.s)
first(::EachStringIndex) = 1
last(e::EachStringIndex) = lastindex(e.s)
eltype(::Type{<:EachStringIndex}) = Int

# Iteration over StringNext
#
# Any new subtype of AbstractString, should override
#
# next(::StringNext{MyString}, state)
#
# to provide iteration over the string and its indices. All other iteration methods,
# including iteration over strings, iteration over pairs, indexing into string,
# iteration over indicies alone are derived from this method.

const StringNext{T<:AbstractString} = Iterators.Next{T, EachStringIndex{T}}
StringNext(x::T) where {T<:AbstractString} = Next(x)
StringNext(x::T, idx) where {T<:AbstractString} = Next(x, idx)
StringNext(x::T, idx, itr) where {T<:AbstractString} = Next(x, idx, itr)

start(sp::StringNext) = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be start(sp::StringNext) = firstindex(sp.data) ?
The doc says that string indexing starts at 1, but there are other places that use firstindex (same for first(::EachStringIndex) = 1 above).

The more places the 1 is not hard coded, the easier it'll be to eventually support AbstractString types with non standard indexing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the goal of this change is to separate the assumption currently baked into the string code that indices and itartion state of strings are the same thing. We want indices to always be 1 through ncodeunits(s) but allow iteration state to be anything at all (even non-integers). So firstindex(s) and start(s) should be totally unrelated: the former is always 1 but the latter can be anything at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start(s) here is just a default (custom string types can override it). It is also going away with the iteration protocol change. I do want to eventually define non-standard indexing for strings, but that's a natural extension of this and string should always support linear indicies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want indices to always be 1 through ncodeunits(s)

I've been trying out various string implementations as a way to understand what the abstract interface contract is. If firstindex(s) == 1 is part of the protocol, so be it.

Is it also to be assumed that isvalid(s, firstindex(s)) == true? If this wasn't required that would allow "padding" of the codeunit space in cases where that is more efficient.

I have a SplicedString implementation where the indexes are in the range 1:ncodeunits(s) (and isvalid(s, 1) == true), but where the index space is only sparsely populated.
It all works nicely for constant time indexing and iteration, but anyone who writes a i = i+1 loop is going to find it take a long time to run. Is this too heretical?

julia> s = LazyJSON.SplicedString("Foo", " ", "Bar")
"Foo Bar"

julia> [keys(s)...]
7-element Array{Int64,1}:
             1
             2
             3
 1099511627777
 2199023255553
 2199023255554
 2199023255555

julia> [(i >> 40, i & (2^40-1)) for i in keys(s)]
7-element Array{Tuple{Int64,Int64},1}:
 (0, 1)
 (0, 2)
 (0, 3)
 (1, 1)
 (2, 1)
 (2, 2)
 (2, 3)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

firstindex(s) == 1

I'd be fine relaxing this constraint, but I don't care too much.

isvalid(s, firstindex(s)) == true

This one I do care about.

Is this too heretical?

I mean, you can do various kinds of encoding here, this one included, but I'd rather go the direction where strings can have non-standard index kinds, but all still need to support linear indexing no matter how slow it is.

function done(s::StringNext, i)
if isa(i, Integer)
return i > ncodeunits(s.data)
else
throw(MethodError(done, (s, i)))
end
end
function next(s::StringNext, i)
if isa(i, Integer) && !isa(i, Int)
return next(s, Int(i))
else
throw(MethodError(next, (s, i)))
end
end

# Derive iteration over pairs from `StringNext`
const StringPairs{T<:AbstractString} = Iterators.Pairs{Int, Char, EachStringIndex{T}, T}
Copy link
Member

@StefanKarpinski StefanKarpinski Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to @Keno this is:

the kind of Pairs iterator you get when you call pairs on a string of type T

There should probably be a comment to that effect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be written something like const StringPairs{T<:AbstractString} = typeof(pairs(T()) (I had this problem creating a const type alias for view(Vector{UInt8}), the type is really convoluted, so I ended up with const nobytes = UInt8[]; const ByteView = typeof(view(nobytes))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think that will work since there is no actual T at the time of definition. The RHS needs to use parametric type syntax to work – despite the appearance, this is not a normal assignment, it's what used to be written as typealias.

StringPairs{T}(x::T) where {T<:AbstractString} = Iterators.Pairs(x, eachindex(x))
StringPairs(x::T) where {T<:AbstractString} = StringPairs{T}(x)

Iterators.pairs(s::AbstractString) = StringPairs(s)

start(e::StringPairs) = (firstindex(e.data), start(StringNext(e.data)))
done(e::StringPairs, (idx, state)) = done(StringNext(e.data), state)
function next(s::StringPairs, (idx, state))
((c, nidx), state) = next(StringNext(s.data), state)
Pair(idx, c), (nidx, state)
end

# Derive reverse pair iteration.
# N.B. String implementers may wish to override
#
# next(s::Iterators.Reverse{<:StringPairs}, idx)
#
# to provide efficient variable-length reverse decoding
Iterators.reverse(s::StringPairs) = Iterators.Reverse(s)

start(e::Iterators.Reverse{<:StringPairs}) = ncodeunits(e.itr.data)+1
done(e::Iterators.Reverse{<:StringPairs}, idx) = idx == firstindex(e.itr.data)
function next(s::Iterators.Reverse{<:StringPairs}, idx)
tidx = thisind(s.itr.data, idx-1)
(c, nidx) = first(Next(s.itr.data, tidx))
Pair(tidx, c), tidx
end

function prev(s::AbstractString, idx)
(i, c), _ = next(Iterators.Reverse(StringPairs(s)), idx)
(c, i)
end


# Derive iteration over strings from `StringNext`
start(s::AbstractString) = start(StringNext(s))
done(s::AbstractString, state) = done(StringNext(s), state)
function next(s::AbstractString, state)
((c, _), state) = next(StringNext(s), state)
(c, state)
end

eltype(::Type{<:AbstractString}) = Char
sizeof(s::AbstractString) = ncodeunits(s) * sizeof(codeunit(s))
firstindex(s::AbstractString) = 1
lastindex(s::AbstractString) = thisind(s, ncodeunits(s))

function getindex(s::AbstractString, i::Integer)
@boundscheck checkbounds(s, i)
@inbounds return isvalid(s, i) ? first(first(Next(s, i))) : string_index_err(s, i)
end

getindex(s::AbstractString, i::Colon) = s
# TODO: handle other ranges with stride ±1 specially?
# TODO: add more @propagate_inbounds annotations?
getindex(s::AbstractString, v::AbstractVector{<:Integer}) =
sprint(io->(for i in v; write(io, s[i]) end), sizehint=length(v))
getindex(s::AbstractString, v::AbstractVector{Bool}) =
throw(ArgumentError("logical indexing not supported for strings"))

function get(s::AbstractString, i::Integer, default)
# TODO: use ternary once @inbounds is expression-like
if checkbounds(Bool, s, i)
@inbounds return s[i]
else
return default
end
end

# Derive iteration over indices from `StringNext`
start(e::EachStringIndex) = start(StringPairs(e.s))
done(e::EachStringIndex, state) = done(StringPairs(e.s), state)
function next(e::EachStringIndex, state)
((idx, _), state) = next(StringPairs(e.s), state)
(idx, state)
end
7 changes: 4 additions & 3 deletions base/strings/string.jl
Original file line number Diff line number Diff line change
Expand Up @@ -167,10 +167,11 @@ is_valid_continuation(c) = c & 0xc0 == 0x80

## required core functionality ##

@propagate_inbounds function next(s::String, i::Int)
@propagate_inbounds function next(sp::StringNext{String}, i::Int)
s = sp.data
b = codeunit(s, i)
u = UInt32(b) << 24
between(b, 0x80, 0xf7) || return reinterpret(Char, u), i+1
between(b, 0x80, 0xf7) || return ((reinterpret(Char, u), i + 1), i+1)
return next_continued(s, i, u)
end

Expand All @@ -193,7 +194,7 @@ function next_continued(s::String, i::Int, u::UInt32)
b & 0xc0 == 0x80 || @goto ret
u |= UInt32(b); i += 1
@label ret
return reinterpret(Char, u), i
return (reinterpret(Char, u), i), i
end

@propagate_inbounds function getindex(s::String, i::Int)
Expand Down
7 changes: 4 additions & 3 deletions base/strings/substring.jl
Original file line number Diff line number Diff line change
Expand Up @@ -62,10 +62,11 @@ function codeunit(s::SubString, i::Integer)
@inbounds return codeunit(s.string, s.offset + i)
end

function next(s::SubString, i::Integer)
function next(sp::StringNext{<:SubString}, i::Int)
s = sp.data
@boundscheck checkbounds(s, i)
@inbounds c, i = next(s.string, s.offset + i)
return c, i - s.offset
@inbounds (c, idx), i = next(StringNext(s.string), s.offset + i)
return (c, idx - s.offset), i - s.offset
end

function getindex(s::SubString, i::Integer)
Expand Down
8 changes: 4 additions & 4 deletions base/strings/util.jl
Original file line number Diff line number Diff line change
Expand Up @@ -136,9 +136,9 @@ julia> lstrip(a)
function lstrip(s::AbstractString, chars::Chars=_default_delims)
e = lastindex(s)
for (i, c) in pairs(s)
!(c in chars) && return SubString(s, i, e)
!(c in chars) && return @inbounds SubString(s, i, e)
end
SubString(s, e+1, e)
@inbounds SubString(s, e+1, e)
end

"""
Expand All @@ -161,9 +161,9 @@ julia> rstrip(a)
"""
function rstrip(s::AbstractString, chars::Chars=_default_delims)
for (i, c) in Iterators.reverse(pairs(s))
c in chars || return SubString(s, 1, i)
c in chars || return @inbounds SubString(s, 1, i)
end
SubString(s, 1, 0)
@inbounds SubString(s, 1, 0)
end

"""
Expand Down
2 changes: 1 addition & 1 deletion base/sysimg.jl
Original file line number Diff line number Diff line change
Expand Up @@ -176,7 +176,7 @@ Array{T}(::Missing, d...) where {T} = fill!(Array{T}(uninitialized, d...), missi
include("abstractdict.jl")

include("iterators.jl")
using .Iterators: zip, enumerate
using .Iterators: zip, enumerate, Next
using .Iterators: Flatten, product # for generators

include("namedtuple.jl")
Expand Down
4 changes: 3 additions & 1 deletion stdlib/Test/src/Test.jl
Original file line number Diff line number Diff line change
Expand Up @@ -1427,7 +1427,9 @@ Base.ncodeunits(s::GenericString) = ncodeunits(s.string)
Base.codeunit(s::GenericString) = codeunit(s.string)
Base.codeunit(s::GenericString, i::Integer) = codeunit(s.string, i)
Base.isvalid(s::GenericString, i::Integer) = isvalid(s.string, i)
Base.next(s::GenericString, i::Integer) = next(s.string, i)
Base.start(s::Base.StringNext{GenericString}) = start(Base.StringNext(s.data.string))
Base.done(s::Base.StringNext{GenericString}, state) = done(Base.StringNext(s.data.string), state)
Base.next(s::Base.StringNext{GenericString}, state) = next(Base.StringNext(s.data.string), state)
Base.reverse(s::GenericString) = GenericString(reverse(s.string))
Base.reverse(s::SubString{GenericString}) =
GenericString(typeof(s.string)(reverse(String(s))))
Expand Down
Loading