Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BREAKING] Make DataFrameColumns stop being an AbstractVector #2291

Merged
merged 13 commits into from
Jun 24, 2020
6 changes: 6 additions & 0 deletions docs/src/lib/functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,5 +120,11 @@ disallowmissing!
```@docs
eachcol
eachrow
values
pairs
findnext
findprev
findfirst
findlast
findall
```
9 changes: 6 additions & 3 deletions docs/src/lib/types.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,12 +37,15 @@ or when accessing a single row of a `DataFrame` or `SubDataFrame` via `getindex`

The `eachrow` function returns a value of the `DataFrameRows` type, which
serves as an iterator over rows of an `AbstractDataFrame`, returning `DataFrameRow` objects.
The `DataFrameRows` isa a subtype of `AbstractVector` and supports its interface
with the exception that it is read only.
bkamins marked this conversation as resolved.
Show resolved Hide resolved

Similarly, the `eachcol` function returns a value of the `DataFrameColumns` type, which
serves as an iterator over columns of an `AbstractDataFrame`.
serves as an iterator over columns of an `AbstractDataFrame` that additionally supports
indexing, `getproperty`, `hasproperty`, `keys`, `values`, `pairs`,
`findfirst`, `findnext`, `findlast`, `findprev`, `findall`, `==`, and `isequal` functions.

The `DataFrameRows` and `DataFrameColumns` types are subtypes of `AbstractVector` and support its interface
with the exception that they are read only. Note that they are not exported and should not be constructed directly,
Note that `DataFrameRows` and `DataFrameColumns` are not exported and should not be constructed directly,
but using the `eachrow` and `eachcol` functions.

The `RepeatedVector` and `StackedVector` types are subtypes of `AbstractVector` and support its interface
Expand Down
91 changes: 75 additions & 16 deletions src/abstractdataframe/iteration.jl
Original file line number Diff line number Diff line change
Expand Up @@ -30,9 +30,9 @@ Base.iterate(::AbstractDataFrame) =
Return a `DataFrameRows` that iterates a data frame row by row,
with each row represented as a `DataFrameRow`.

Because `DataFrameRow`s have an `eltype` of `Any`, use `copy(dfr::DataFrameRow)` to obtain
a named tuple, which supports iteration and property access like a `DataFrameRow`,
but also passes information on the `eltypes` of the columns of `df`.
Because `DataFrameRow`s have an `eltype` of `Any`, use `copy(dfr::DataFrameRow)` to obtain
a named tuple, which supports iteration and property access like a `DataFrameRow`,
but also passes information on the `eltypes` of the columns of `df`.

# Examples
```jldoctest
Expand Down Expand Up @@ -107,13 +107,13 @@ Base.propertynames(itr::DataFrameRows, private::Bool=false) = propertynames(pare
# Iteration by columns

"""
DataFrameColumns{<:AbstractDataFrame} <: AbstractVector{AbstractVector}
DataFrameColumns{<:AbstractDataFrame}

An `AbstractVector` that allows iteration over columns of an `AbstractDataFrame`.
A generator that allows iteration over columns of an `AbstractDataFrame`.
bkamins marked this conversation as resolved.
Show resolved Hide resolved
Indexing into `DataFrameColumns` objects using integer or symbol indices
returns the corresponding column (without copying).
"""
struct DataFrameColumns{T<:AbstractDataFrame} <: AbstractVector{AbstractVector}
struct DataFrameColumns{T<:AbstractDataFrame}
df::T
end

Expand All @@ -125,7 +125,8 @@ Base.summary(io::IO, dfcs::DataFrameColumns) = print(io, summary(dfcs))

Return a `DataFrameColumns` that is an `AbstractVector`
bkamins marked this conversation as resolved.
Show resolved Hide resolved
that allows iterating an `AbstractDataFrame` column by column.
Additionally it is allowed to index `DataFrameColumns` using column names.
Additionally it is allowed to index `DataFrameColumns` using column names,
and convenience functions: `keys`, `values`, `pairs` are defined for it.

# Examples
```jldoctest
Expand Down Expand Up @@ -159,15 +160,17 @@ julia> sum.(eachcol(df))
"""
eachcol(df::AbstractDataFrame) = DataFrameColumns(df)

Base.size(itr::DataFrameColumns) = (size(parent(itr), 2),)
Base.IndexStyle(::Type{<:DataFrameColumns}) = Base.IndexLinear()

@inline function Base.getindex(itr::DataFrameColumns, j::Int)
@boundscheck checkbounds(itr, j)
@inbounds parent(itr)[!, j]
end

Base.getindex(itr::DataFrameColumns, j::Symbol) = parent(itr)[!, j]
Base.length(itr::DataFrameColumns) = size(parent(itr), 2)
Base.eltype(::Type{<:DataFrameColumns}) = AbstractVector
bkamins marked this conversation as resolved.
Show resolved Hide resolved
Base.iterate(itr::DataFrameColumns, i=1) =
bkamins marked this conversation as resolved.
Show resolved Hide resolved
i <= length(itr) ? (itr[i], i + 1) : nothing
Base.getindex(itr::DataFrameColumns, idx::ColumnIndex) = parent(itr)[!, idx]
Base.getindex(itr::DataFrameColumns, idx::MultiColumnIndex) =
bkamins marked this conversation as resolved.
Show resolved Hide resolved
eachcol(parent(itr)[!, idx])
bkamins marked this conversation as resolved.
Show resolved Hide resolved
Base.:(==)(itr1::DataFrameColumns, itr2::DataFrameColumns) =
parent(itr1) == parent(itr2)
Base.isequal(itr1::DataFrameColumns, itr2::DataFrameColumns) =
isequal(parent(itr1), parent(itr2))

# separate methods are needed due to dispatch ambiguity
Base.getproperty(itr::DataFrameColumns, col_ind::Symbol) =
Expand All @@ -190,6 +193,13 @@ Get a vector of column names of `dfc` as `Symbol`s.
"""
Base.keys(itr::DataFrameColumns) = propertynames(itr)

"""
values(dfc::DataFrameColumns)

Get a vector of columns of `dfc`.
bkamins marked this conversation as resolved.
Show resolved Hide resolved
"""
Base.values(itr::DataFrameColumns) = collect(itr)
bkamins marked this conversation as resolved.
Show resolved Hide resolved

"""
pairs(dfc::DataFrameColumns)

Expand All @@ -199,6 +209,55 @@ where `name` is the column name of the column `col`.
"""
Base.pairs(itr::DataFrameColumns) = Base.Iterators.Pairs(itr, keys(itr))

"""
findnext(f::Function, itr::DataFrameColumns, i::Integer)
bkamins marked this conversation as resolved.
Show resolved Hide resolved

Find the next integer index after or including an integer `i` of an
element of `itr` for which `f` returns `true`, or `nothing` if not found.

"""
Base.findnext(f::Function, itr::DataFrameColumns, i::Integer) =
bkamins marked this conversation as resolved.
Show resolved Hide resolved
findnext(f, values(itr), i)

"""
findprev(f::Function, itr::DataFrameColumns, i::Integer)

Find the previous integer index before or including an integer `i` of an
element of `itr` for which `f` returns `true`, or `nothing` if not found.

"""
Base.findprev(f::Function, itr::DataFrameColumns, i::Integer) =
findprev(f, values(itr), i)

"""
findfirst(f::Function, itr::DataFrameColumns)

Return the integer index of the first element of `itr` for which `f` returns
`true`. Return `nothing` if there is no such element.

"""
Base.findfirst(f::Function, itr::DataFrameColumns) =
findfirst(f, values(itr))

"""
findlast(f::Function, itr::DataFrameColumns)

Return the integer index of the last element of `itr` for which `f` returns
`true`. Return `nothing` if there is no such element.

"""
Base.findlast(f::Function, itr::DataFrameColumns) =
findlast(f, values(itr))

"""
findall(f::Function, itr::DataFrameColumns)

Return a vector of the integer indices `i` of `itr` where `f(itr[i])` returns
true. If there are no such elements of `itr`, return an empty array.
"""
Base.findall(f::Function, itr::DataFrameColumns) =
findall(f, values(itr))

Base.parent(itr::Union{DataFrameRows, DataFrameColumns}) = getfield(itr, :df)
Base.names(itr::Union{DataFrameRows, DataFrameColumns}) = names(parent(itr))
Base.names(itr::Union{DataFrameRows, DataFrameColumns}, cols) = names(parent(itr), cols)
Expand Down
34 changes: 29 additions & 5 deletions test/iteration.jl
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,16 @@ using Test, DataFrames
@test collect(pairs(row)) isa Vector{Pair{Symbol, Int}}
end

@test size(eachcol(df)) == (size(df, 2),)
@test parent(eachcol(df)) === df
@test names(eachcol(df)) == names(df)
@test IndexStyle(eachcol(df)) == IndexLinear()
@test Base.IndexStyle(eachcol(df)) == IndexLinear()
@test length(eachcol(df)) == size(df, 2)
@test eachcol(df)[1] == df[:, 1]
@test eachcol(df)[:A] == df[:, :A]
@test eachcol(df)[All()] == eachcol(df)
@test isequal(eachcol(df)[[1]], eachcol(df[!, [1]]))
@test eachcol(df).A == df[:, :A]
@test eachcol(df)["A"] == df[:, "A"]
@test eachcol(df)."A" == df[:, "A"]
@test collect(eachcol(df)) isa Vector{AbstractVector}
@test collect(eachcol(df)) == [[1, 2], [2, 3]]
@test eltype(eachcol(df)) == AbstractVector
Expand Down Expand Up @@ -90,7 +93,7 @@ end
@test eachrow(sdf) == eachrow(df[[3,1,4], [3,1,4]])
@test size(eachrow(sdf)) == (3,)
@test eachcol(sdf) == eachcol(df[[3,1,4], [3,1,4]])
@test size(eachcol(sdf)) == (3,)
@test length(eachcol(sdf)) == 3
end

@testset "parent mutation" begin
Expand Down Expand Up @@ -127,7 +130,7 @@ end
end
end

@testset "keys and pairs for eachcol" begin
@testset "keys, values and pairs for eachcol" begin
df = DataFrame([11:16 21:26 31:36 41:46])

cols = eachcol(df)
Expand All @@ -141,6 +144,27 @@ end
@test cols[i] === cols[n]
end
@test_throws ArgumentError cols[:non_existent]

@test values(cols) == collect(cols)
end

@testset "findfirst, findnext, findlast, findprev, findall" begin
df = DataFrame(a=[1, 2, 1, 2], b=["1", "2", "1", "2"],
c=[1, 2, 1, 2], d=["1", "2", "1", "2"])

rows = eachrow(df)
@test findfirst(row -> row.a == 1, rows) == 1
@test findnext(row -> row.a == 1, rows, 2) == 3
@test findlast(row -> row.a == 1, rows) == 3
@test findprev(row -> row.a == 1, rows, 2) == 1
@test findall(row -> row.a == 1, rows) == [1, 3]

cols = eachcol(df)
@test findfirst(col -> eltype(col) <: Int, cols) == 1
@test findnext(col -> eltype(col) <: Int, cols, 2) == 3
@test findlast(col -> eltype(col) <: Int, cols) == 3
@test findprev(col -> eltype(col) <: Int, cols, 2) == 1
@test findall(col -> eltype(col) <: Int, cols) == [1, 3]
end

end # module