-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow push!/pushfirst!/append!/prepend! with multiple values #3372
Changes from 6 commits
6c70d9e
1f52a91
849967a
0deac64
caac620
da4f254
0ef997e
cd1c206
b107105
d7f8fff
58c988d
269cc55
c129cff
f213d3f
5f57ac8
a2c4873
d5c7979
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,10 @@ | ||
""" | ||
append!(df::DataFrame, df2::AbstractDataFrame; cols::Symbol=:setequal, | ||
promote::Bool=(cols in [:union, :subset])) | ||
append!(df::DataFrame, table; cols::Symbol=:setequal, | ||
append!(df::DataFrame, tables...; cols::Symbol=:setequal, | ||
promote::Bool=(cols in [:union, :subset])) | ||
|
||
Add the rows of `df2` to the end of `df`. If the second argument `table` is not | ||
an `AbstractDataFrame` then it is converted using `DataFrame(table, | ||
copycols=false)` before being appended. | ||
Add the rows of tables passed as `tables` to the end of `df`. If the table is not | ||
an `AbstractDataFrame` then it is converted using | ||
`DataFrame(table, copycols=false)` before being appended. | ||
|
||
The exact behavior of `append!` depends on the `cols` argument: | ||
* If `cols == :setequal` (this is the default) then `df2` must contain exactly | ||
|
@@ -78,18 +76,53 @@ julia> df1 | |
4 │ 4 4 | ||
5 │ 5 5 | ||
6 │ 6 6 | ||
|
||
julia> append!(df2, DataFrame(A=1), (; C=1:2), cols=:union) | ||
6×3 DataFrame | ||
Row │ A B C | ||
│ Float64? Int64? Int64? | ||
─────┼───────────────────────────── | ||
1 │ 4.0 4 missing | ||
2 │ 5.0 5 missing | ||
3 │ 6.0 6 missing | ||
4 │ 1.0 missing missing | ||
5 │ missing missing 1 | ||
6 │ missing missing 2 | ||
``` | ||
""" | ||
Base.append!(df1::DataFrame, df2::AbstractDataFrame; cols::Symbol=:setequal, | ||
promote::Bool=(cols in [:union, :subset])) = | ||
_append_or_prepend!(df1, df2, cols=cols, promote=promote, atend=true) | ||
|
||
function Base.append!(df::DataFrame, table; cols::Symbol=:setequal, | ||
promote::Bool=(cols in [:union, :subset])) | ||
if table isa Dict && cols == :orderequal | ||
throw(ArgumentError("passing `Dict` as `table` when `cols` is equal to " * | ||
"`:orderequal` is not allowed as it is unordered")) | ||
end | ||
append!(df, DataFrame(table, copycols=false), cols=cols, promote=promote) | ||
end | ||
|
||
function Base.append!(df::DataFrame, @nospecialize tables...; | ||
cols::Symbol=:setequal, | ||
promote::Bool=(cols in [:union, :subset])) | ||
if !(cols in (:orderequal, :setequal, :intersect, :subset, :union)) | ||
throw(ArgumentError("`cols` keyword argument must be " * | ||
":orderequal, :setequal, :intersect, :subset or :union)")) | ||
end | ||
|
||
return foldl((df, table) -> append!(df, table, cols=cols, promote=promote), | ||
collect(Any, tables), init=df) | ||
end | ||
|
||
""" | ||
prepend!(df::DataFrame, df2::AbstractDataFrame; cols::Symbol=:setequal, | ||
promote::Bool=(cols in [:union, :subset])) | ||
prepend!(df::DataFrame, table; cols::Symbol=:setequal, | ||
prepend!(df::DataFrame, tables...; cols::Symbol=:setequal, | ||
promote::Bool=(cols in [:union, :subset])) | ||
|
||
Add the rows of tables passed as `tables` to the beginning of `df`. If the table is not | ||
an `AbstractDataFrame` then it is converted using | ||
`DataFrame(table, copycols=false)` before being appended. | ||
|
||
Add the rows of `df2` to the beginning of `df`. If the second argument `table` | ||
is not an `AbstractDataFrame` then it is converted using `DataFrame(table, | ||
copycols=false)` before being prepended. | ||
|
@@ -164,12 +197,45 @@ julia> df1 | |
4 │ 1 1 | ||
5 │ 2 2 | ||
6 │ 3 3 | ||
|
||
julia> prepend!(df2, DataFrame(A=1), (; C=1:2), cols=:union) | ||
6×3 DataFrame | ||
Row │ A B C | ||
│ Float64? Int64? Int64? | ||
─────┼───────────────────────────── | ||
1 │ 1.0 missing missing | ||
2 │ missing missing 1 | ||
3 │ missing missing 2 | ||
4 │ 4.0 4 missing | ||
5 │ 5.0 5 missing | ||
6 │ 6.0 6 missing | ||
``` | ||
""" | ||
Base.prepend!(df1::DataFrame, df2::AbstractDataFrame; cols::Symbol=:setequal, | ||
promote::Bool=(cols in [:union, :subset])) = | ||
_append_or_prepend!(df1, df2, cols=cols, promote=promote, atend=false) | ||
|
||
function Base.prepend!(df::DataFrame, table; cols::Symbol=:setequal, | ||
promote::Bool=(cols in [:union, :subset])) | ||
if table isa Dict && cols == :orderequal | ||
throw(ArgumentError("passing `Dict` as `table` when `cols` is equal to " * | ||
"`:orderequal` is not allowed as it is unordered")) | ||
end | ||
prepend!(df, DataFrame(table, copycols=false), cols=cols, promote=promote) | ||
end | ||
|
||
function Base.prepend!(df::DataFrame, @nospecialize tables...; | ||
cols::Symbol=:setequal, | ||
promote::Bool=(cols in [:union, :subset])) | ||
if !(cols in (:orderequal, :setequal, :intersect, :subset, :union)) | ||
throw(ArgumentError("`cols` keyword argument must be " * | ||
":orderequal, :setequal, :intersect, :subset or :union)")) | ||
end | ||
|
||
return foldr((table, df) -> prepend!(df, table, cols=cols, promote=promote), | ||
collect(Any, tables), init=df) | ||
end | ||
|
||
function _append_or_prepend!(df1::DataFrame, df2::AbstractDataFrame; cols::Symbol, | ||
promote::Bool, atend::Bool) | ||
if !(cols in (:orderequal, :setequal, :intersect, :subset, :union)) | ||
|
@@ -355,6 +421,10 @@ following way: | |
added to `df` (using `missing` for existing rows) and a `missing` value is | ||
pushed to columns missing in `row` that are present in `df`. | ||
|
||
If `row` is not a `DataFrameRow`, `NamedTuple`, `AbstractDict`, or `Tables.AbstractRow` | ||
the value of the `cols` argument is ignored and it is only allowed to set it to | ||
`:setequal` or `:orderequal`. | ||
|
||
If `promote=true` and element type of a column present in `df` does not allow | ||
the type of a pushed argument then a new column with a promoted element type | ||
allowing it is freshly allocated and stored in `df`. If `promote=false` an error | ||
|
@@ -371,12 +441,16 @@ $METADATA_FIXED | |
""" | ||
|
||
""" | ||
push!(df::DataFrame, row::Union{Tuple, AbstractArray}; promote::Bool=false) | ||
push!(df::DataFrame, row::Union{Tuple, AbstractArray}; | ||
cols::Symbol=:setequal, promote::Bool=false) | ||
push!(df::DataFrame, row::Union{DataFrameRow, NamedTuple, AbstractDict, | ||
Tables.AbstractRow}; | ||
cols::Symbol=:setequal, promote::Bool=(cols in [:union, :subset])) | ||
push!(df::DataFrame, rows...; | ||
cols::Symbol=:setequal, promote::Bool=(cols in [:union, :subset])) | ||
|
||
Add one row at the end of `df` in-place, taking the values from `row`. | ||
Several rows can be added by passing them as separate arguments from `rows`. | ||
|
||
$INSERTION_COMMON | ||
|
||
|
@@ -452,18 +526,38 @@ julia> push!(df, NamedTuple(), cols=:subset) | |
6 │ 11 12 missing | ||
7 │ 1.0 missing 1.0 | ||
8 │ missing missing missing | ||
|
||
julia> push!(DataFrame(a=1, b=2), (3, 4), (b=6, a=5)) | ||
3×2 DataFrame | ||
Row │ a b | ||
│ Int64 Int64 | ||
─────┼────────────── | ||
1 │ 1 2 | ||
2 │ 3 4 | ||
3 │ 5 6 | ||
``` | ||
""" | ||
Base.push!(df::DataFrame, row::Any; promote::Bool=false) = | ||
_row_inserter!(df, -1, row, Val{:push}(), promote) | ||
function Base.push!(df::DataFrame, row::Any; | ||
cols=:setequal, promote::Bool=false) | ||
if !(cols in (:setequal, :orderequal)) | ||
throw(ArgumentError("`cols` keyword argument must be " * | ||
":orderequal or :setequal")) | ||
end | ||
|
||
return _row_inserter!(df, -1, row, Val{:push}(), promote) | ||
end | ||
|
||
""" | ||
pushfirst!(df::DataFrame, row::Union{Tuple, AbstractArray}; promote::Bool=false) | ||
pushfirst!(df::DataFrame, row::Union{Tuple, AbstractArray}; | ||
cols::Symbol=:setequal, promote::Bool=false) | ||
pushfirst!(df::DataFrame, row::Union{DataFrameRow, NamedTuple, AbstractDict, | ||
Tables.AbstractRow}; | ||
cols::Symbol=:setequal, promote::Bool=(cols in [:union, :subset])) | ||
pushfirst!(df::DataFrame, rows...; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that I have reverted this part of your update of the docs. The reason is that |
||
cols::Symbol=:setequal, promote::Bool=(cols in [:union, :subset])) | ||
bkamins marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Add one row at the beginning of `df` in-place, taking the values from `row`. | ||
bkamins marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Several rows can be added by passing them as separate arguments from `rows`. | ||
|
||
$INSERTION_COMMON | ||
|
||
|
@@ -539,13 +633,30 @@ julia> pushfirst!(df, NamedTuple(), cols=:subset) | |
6 │ a 1 missing | ||
7 │ b 2 missing | ||
8 │ c 3 missing | ||
|
||
julia> pushfirst!(DataFrame(a=1, b=2), (3, 4), (b=6, a=5)) | ||
3×2 DataFrame | ||
Row │ a b | ||
│ Int64 Int64 | ||
─────┼────────────── | ||
1 │ 3 4 | ||
2 │ 5 6 | ||
3 │ 1 2 | ||
``` | ||
""" | ||
Base.pushfirst!(df::DataFrame, row::Any; promote::Bool=false) = | ||
_row_inserter!(df, -1, row, Val{:pushfirst}(), promote) | ||
function Base.pushfirst!(df::DataFrame, row::Any; | ||
cols=:setequal, promote::Bool=false) | ||
if !(cols in (:setequal, :orderequal)) | ||
throw(ArgumentError("`cols` keyword argument must be " * | ||
":orderequal or :setequal")) | ||
end | ||
|
||
return _row_inserter!(df, -1, row, Val{:pushfirst}(), promote) | ||
end | ||
|
||
""" | ||
insert!(df::DataFrame, index::Integer, row::Union{Tuple, AbstractArray}; promote::Bool=false) | ||
insert!(df::DataFrame, index::Integer, row::Union{Tuple, AbstractArray}; | ||
cols::Symbol=:setequal, promote::Bool=false) | ||
insert!(df::DataFrame, index::Integer, row::Union{DataFrameRow, NamedTuple, | ||
AbstractDict, Tables.AbstractRow}; | ||
cols::Symbol=:setequal, promote::Bool=(cols in [:union, :subset])) | ||
|
@@ -629,7 +740,12 @@ julia> insert!(df, 3, NamedTuple(), cols=:subset) | |
8 │ 1.0 missing 1.0 | ||
``` | ||
""" | ||
function Base.insert!(df::DataFrame, index::Integer, row::Any; promote::Bool=false) | ||
function Base.insert!(df::DataFrame, index::Integer, row::Any; | ||
cols=:setequal, promote::Bool=false) | ||
if !(cols in (:setequal, :orderequal)) | ||
throw(ArgumentError("`cols` keyword argument must be " * | ||
":orderequal or :setequal")) | ||
end | ||
index isa Bool && throw(ArgumentError("invalid index: $index of type Bool")) | ||
1 <= index <= nrow(df)+1 || | ||
throw(ArgumentError("invalid index: $index for data frame with $(nrow(df)) rows")) | ||
|
@@ -986,3 +1102,26 @@ function _row_inserter!(df::DataFrame, loc::Integer, | |
_drop_all_nonnote_metadata!(df) | ||
return df | ||
end | ||
|
||
function Base.push!(df::DataFrame, @nospecialize rows...; | ||
cols::Symbol=:setequal, | ||
promote::Bool=(cols in [:union, :subset])) | ||
if !(cols in (:orderequal, :setequal, :intersect, :subset, :union)) | ||
throw(ArgumentError("`cols` keyword argument must be " * | ||
":orderequal, :setequal, :intersect, :subset or :union)")) | ||
end | ||
|
||
return foldl((df, row) -> push!(df, row, cols=cols, promote=promote), | ||
collect(Any, rows), init=df) | ||
end | ||
|
||
function Base.pushfirst!(df::DataFrame, @nospecialize rows...; | ||
cols::Symbol=:setequal, | ||
promote::Bool=(cols in [:union, :subset])) | ||
if !(cols in (:orderequal, :setequal, :intersect, :subset, :union)) | ||
throw(ArgumentError("`cols` keyword argument must be " * | ||
":orderequal, :setequal, :intersect, :subset or :union)")) | ||
end | ||
return foldr((row, df) -> pushfirst!(df, row, cols=cols, promote=promote), | ||
collect(Any, rows), init=df) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we allow
:orderequal
? I imagine this value is used only when you want to protect yourself from possible inversions of columns, and without names we cannot guarantee that. (Of course:setequal
is also a bit weird but we need to allow the default.)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that when we allow for
push!(df, (1,2), (a=1, b=2), cols=:orderequal)
. That is - we allow for mixing rows with names and without names, in which casecols=:orderequal
makes sense in some cases. That is why I allowed forcols
in the first place.If we wanted to disallow this (i.e. disallow mixing named and unnamed containers, which I would also be OK with) then I will redesign the proposal and disallow
cols
when unnamed containers are passed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's not too complex, it would be safer to do that, yes. It shouldn't be super common to add several rows of different types at the same time, and people can use two calls if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the code to disallow mixing.
Also I want to check with you the idea that we could allow to write:
instead of current:
as the latter gets problematic for a lot of passed arguments case:
This is a bit of overuse of
ByRow
(which was designed for other purpose), but I found it a natural name. What do you think? (if you think it is OK, then the same decision is withappend!
andprepend!
- do we also want to allow for theByRow
option instead of splatting multiple tables?)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan TBH. If you have a collection of rows, wouldn't it be more logical to use
append!
orprepend!
? Currently you can doappend!(DataFrame(), Tables.dictrowtable(repeat([(;a=1), (;b=2)], 10000)), cols=:union)
, right? It's not super easy to discover, but theByRow
trick isn't either.Maybe we could simplify this somehow? For example, would there be a way to make
append!(DataFrame(), repeat([(;a=1), (;b=2)], 10000), cols=:union)
automatically wrap the input vector in aTables.dictrowtable
if needed?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In short: we are looking for equivalent of
Tables.dictrowtable
that would be lazy and allocate less (and it is enough that it supportsTables.AbstractRow
interface, it does not have to be a dictionary.In detail: When you do:
it is very slow and allocates a lot. The reason is that
Tables.DictRowTable
has three fields::names
,:types
, and:values
. And:values
is eager and createsDict
for each entry.The lazy variant could have
:names
and:types
fields but:values
could just point lazily to the source table and later iterate its rows when needed. In particular if source table is columnar creating such an iterator should be super cheap (as we do not even need to iterate it most likely). If the source table has row-wise storage then we would need to iterate it once.This was the initial idea. In short, to reduce the cost of
Tables.dictrowtable
.Now regarding your proposal:
This is a valid way to implement it and maybe indeed better and sufficient (i.e. we do not have to be lazy as e.g
Tables.columns
can materialize the columns provided it is efficient).The point is that if we could pass
Tables.columns(x, cols=:union)
in the original code this is exactly what is needed.Then the
cols
kwarg ideally could have the following values:cols == :setequal
(this is the default) then rows must contain exactly the same columns (but possibly in a different order), order is defined by the first row.cols == :orderequal
then rows must contain the same columns in the same ordercols == :intersect
then rows may contain more columns than first row, but all column names that are present in first row must be present in all rows and only they are used to populate a new rows (this is the current behavior).cols == :subset
then the behavior is like for:intersect
but if some column is missing in rows then amissing
value is pushed.cols == :union
then column unioning is performedOf course we do not have to support all. I think natural options that must be supported are:
:intersect
(as this is the current behavior):union
(this is the most commonly requested extension):orderequal
(this is what probably people typically expect by default as they even do not know that the current behavior is:intersect
)The reason is that we then could call e.g.
DataFrame(Tables.columns(x, cols=:union))
and it would be very fast, while now the operation is super slow:compare it to:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the upper bound for what is achievable for the example data is:
This is still a bit inefficient (as we are adding data to the
df
row by row, trying to do union each time), but it is already much faster thanTables.dictrowtable
, so I expect that it is possible to get a sub-second performance.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I was thinking that
append!(..., cols=:union)
would callDataFrame(Tables.column(t, cols=:union))
. It seems enough to haveTables.columns
supportcols=:union
for that.cols=:orderequal
andcols=:setequal
would also make sense but I wouldn't use them inappend!
.OTC, adding
cols=:subset
doesn't sound like a good idea to me as it seems weird to me to take the first row as a reference. The fact that the current behavior does that isn't great IMO. At any rate calling itcols=:intersect
could be confusing as I would expect it to only retain columns that appear in all rows.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I realized that we already have
Tables.dictcolumntable
which is (as expected) much faster:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nalimilan - let us merge this PR (if you are OK with it). Then I will make a separate PR that instead of your proposed
DataFrame(Tables.column(t, cols=:union))
will callDataFrame(Tables.dictcolumntable(t))
whencols=:union
. This should be a good enough solution.