Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of transform! on SubDataFrame #3070

Merged
merged 10 commits into from
Jun 14, 2022
Merged

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Jun 8, 2022

Fixes #3069.

What this PR addresses.

  1. The initially reported performance issue. After the PR:
julia> using DataFrames, BenchmarkTools

julia> df = DataFrame(rand(100000, 100), :auto);

julia> subdf = view(df, df.x1 .>= 0.5, :);

julia> @btime transform!($subdf, :x1 => maximum => :x1);
  289.800 μs (128 allocations: 1.15 MiB)

julia> gdf = groupby(subdf, :x2);

julia> @btime transform!($gdf, :x1 => maximum => :x1);
  1.090 ms (300 allocations: 4.21 MiB)

and before the PR:

julia> using DataFrames, BenchmarkTools

julia> df = DataFrame(rand(100000, 100), :auto);

julia> subdf = view(df, df.x1 .>= 0.5, :);

julia> @btime transform!($subdf, :x1 => maximum => :x1);
  37.446 ms (1642 allocations: 114.68 MiB)

julia> gdf = groupby(subdf, :x2);

julia> @btime transform!($gdf, :x1 => maximum => :x1);
  35.303 ms (4853 allocations: 117.23 MiB)
  1. Aliasing issue in select[!] and transform[!] on GroupedDataFrame

After the PR:

julia> df = DataFrame(a=1:2, b=3:4);

julia> gdf = groupby(df, :a);

julia> res = select(gdf, :b, :a => :c, :b => :d, copycols=false);

julia> res.a === res.c
false

julia> res.b === res.d
false

and before the PR:

julia> df = DataFrame(a=1:2, b=3:4);

julia> gdf = groupby(df, :a);

julia> res = select(gdf, :b, :a => :c, :b => :d, copycols=false);

julia> res.a === res.c
true

julia> res.b === res.d
true

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These issues are tricky as usual...

Is there a performance impact for transform when the number of columns is large? In theory, couldn't the check be restricted to columns that have been used as inputs of transformations?

src/subdataframe/subdataframe.jl Outdated Show resolved Hide resolved
function _replace_columns!(sdf::SubDataFrame, newdf::DataFrame)
colsmatch = _names(sdf) == _names(newdf)

function _replace_columns!(sdf::SubDataFrame, newdf::DataFrame, wastransform::Bool)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a name like keep_present would be easier to follow (i.e. say what it does rather than when it's set)? Could also make it a keyword argument, as it's hard to guess what it means when seeing just a Bool value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I will change it.

src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
@@ -1812,6 +1812,9 @@ end

# This is not exactly copy! as in general we allow axes to be different
function _replace_columns!(df::DataFrame, newdf::DataFrame)
# here we do not support wastransform argument like for SubDataFrame
# because by default in DataFrame case columns are not copied
# so we need to pass `:` to select! to handle this case correctly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get what select! call this refers to.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_replace_columns! is used by transform! and select!. For DataFrame objects transform! falls back to select! by passing extra : (as opposed to what we now do in GroupedDataFrame case).

I might drop this comment if you feel it is not useful. This is exactly related to the fact that select for DataFrame uses a less robust de-aliasing approach (checking input-output mapping) than the approach I proposed for GroupedDataFrame with IdDict.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have simplified the comment

@bkamins
Copy link
Member Author

bkamins commented Jun 8, 2022

Is there a performance impact for transform when the number of columns is large?

It is negligible, for 10,000 columns it is around 500μs:

julia> df = DataFrame(rand(100, 10000), :auto);

julia> using BenchmarkTools

julia> @benchmark _dealias_dataframe($df)
BenchmarkTools.Trial: 9883 samples with 1 evaluation.
 Range (min … max):  432.800 μs …   4.738 ms  ┊ GC (min … max): 0.00% … 82.65%
 Time  (median):     485.300 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   503.228 μs ± 112.816 μs  ┊ GC (mean ± σ):  1.19% ±  4.68%

   ▁▅ ▁▄▃█▇▁
  ▂██▇██████▇▅▄▄▄▅▄▄▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  433 μs           Histogram: frequency by time          853 μs <

 Memory estimate: 344.42 KiB, allocs estimate: 11.

In theory, couldn't the check be restricted to columns that have been used as inputs of transformations?

It could, and this is what we do for DataFrame in select and transform. But I have reviewed the code before proposing this approach and decided that it is overly complex. Actually I wish I had this (i.e. using IdDict) idea earlier as this approach is much more robust than checking input-output mapping. Still for now I decided to leave things as they are in the old code as processing DataFrame fast is more important since the "core" of the operation is easier.

test/select.jl Show resolved Hide resolved
test/select.jl Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
test/select.jl Outdated Show resolved Hide resolved
test/select.jl Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins bkamins merged commit 0ce9b0f into main Jun 14, 2022
@bkamins bkamins deleted the bk/trainsform!_performance branch June 14, 2022 08:29
@bkamins
Copy link
Member Author

bkamins commented Jun 14, 2022

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance issue of transform! on a SubDataFrame
2 participants