Performance of transform! on SubDataFrame #3070

bkamins · 2022-06-08T09:20:44Z

What this PR addresses.

The initially reported performance issue. After the PR:

julia> using DataFrames, BenchmarkTools

julia> df = DataFrame(rand(100000, 100), :auto);

julia> subdf = view(df, df.x1 .>= 0.5, :);

julia> @btime transform!($subdf, :x1 => maximum => :x1);
  289.800 μs (128 allocations: 1.15 MiB)

julia> gdf = groupby(subdf, :x2);

julia> @btime transform!($gdf, :x1 => maximum => :x1);
  1.090 ms (300 allocations: 4.21 MiB)

and before the PR:

julia> using DataFrames, BenchmarkTools

julia> df = DataFrame(rand(100000, 100), :auto);

julia> subdf = view(df, df.x1 .>= 0.5, :);

julia> @btime transform!($subdf, :x1 => maximum => :x1);
  37.446 ms (1642 allocations: 114.68 MiB)

julia> gdf = groupby(subdf, :x2);

julia> @btime transform!($gdf, :x1 => maximum => :x1);
  35.303 ms (4853 allocations: 117.23 MiB)

Aliasing issue in select[!] and transform[!] on GroupedDataFrame

After the PR:

julia> df = DataFrame(a=1:2, b=3:4);

julia> gdf = groupby(df, :a);

julia> res = select(gdf, :b, :a => :c, :b => :d, copycols=false);

julia> res.a === res.c
false

julia> res.b === res.d
false

and before the PR:

julia> df = DataFrame(a=1:2, b=3:4);

julia> gdf = groupby(df, :a);

julia> res = select(gdf, :b, :a => :c, :b => :d, copycols=false);

julia> res.a === res.c
true

julia> res.b === res.d
true

nalimilan

These issues are tricky as usual...

Is there a performance impact for transform when the number of columns is large? In theory, couldn't the check be restricted to columns that have been used as inputs of transformations?

src/subdataframe/subdataframe.jl

nalimilan · 2022-06-08T13:13:53Z

src/subdataframe/subdataframe.jl

-function _replace_columns!(sdf::SubDataFrame, newdf::DataFrame)
-    colsmatch = _names(sdf) == _names(newdf)
-
+function _replace_columns!(sdf::SubDataFrame, newdf::DataFrame, wastransform::Bool)


Maybe a name like keep_present would be easier to follow (i.e. say what it does rather than when it's set)? Could also make it a keyword argument, as it's hard to guess what it means when seeing just a Bool value.

OK - I will change it.

src/groupeddataframe/splitapplycombine.jl

nalimilan · 2022-06-08T13:31:58Z

src/dataframe/dataframe.jl

@@ -1812,6 +1812,9 @@ end

 # This is not exactly copy! as in general we allow axes to be different
 function _replace_columns!(df::DataFrame, newdf::DataFrame)
+    # here we do not support wastransform argument like for SubDataFrame
+    # because by default in DataFrame case columns are not copied
+    # so we need to pass `:` to select! to handle this case correctly


I don't get what select! call this refers to.

_replace_columns! is used by transform! and select!. For DataFrame objects transform! falls back to select! by passing extra : (as opposed to what we now do in GroupedDataFrame case).

I might drop this comment if you feel it is not useful. This is exactly related to the fact that select for DataFrame uses a less robust de-aliasing approach (checking input-output mapping) than the approach I proposed for GroupedDataFrame with IdDict.

I have simplified the comment

bkamins · 2022-06-08T19:08:20Z

Is there a performance impact for transform when the number of columns is large?

It is negligible, for 10,000 columns it is around 500μs:

julia> df = DataFrame(rand(100, 10000), :auto);

julia> using BenchmarkTools

julia> @benchmark _dealias_dataframe($df)
BenchmarkTools.Trial: 9883 samples with 1 evaluation.
 Range (min … max):  432.800 μs …   4.738 ms  ┊ GC (min … max): 0.00% … 82.65%
 Time  (median):     485.300 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   503.228 μs ± 112.816 μs  ┊ GC (mean ± σ):  1.19% ±  4.68%

   ▁▅ ▁▄▃█▇▁
  ▂██▇██████▇▅▄▄▄▅▄▄▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  433 μs           Histogram: frequency by time          853 μs <

 Memory estimate: 344.42 KiB, allocs estimate: 11.

In theory, couldn't the check be restricted to columns that have been used as inputs of transformations?

It could, and this is what we do for DataFrame in select and transform. But I have reviewed the code before proposing this approach and decided that it is overly complex. Actually I wish I had this (i.e. using IdDict) idea earlier as this approach is much more robust than checking input-output mapping. Still for now I decided to leave things as they are in the old code as processing DataFrame fast is more important since the "core" of the operation is easier.

src/abstractdataframe/selection.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

src/abstractdataframe/selection.jl

test/select.jl

src/groupeddataframe/splitapplycombine.jl

test/select.jl

src/abstractdataframe/selection.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2022-06-14T08:29:23Z

Thank you!

bkamins added 2 commits June 8, 2022 10:03

change implementation of transform!

dbf7cdb

Performance issue of transform! on a SubDataFrame

8b13582

bkamins added bug performance grouping labels Jun 8, 2022

bkamins added this to the 1.4 milestone Jun 8, 2022

bkamins requested a review from nalimilan June 8, 2022 09:20

bkamins mentioned this pull request Jun 8, 2022

Performance issue of transform! on a SubDataFrame #3069

Closed

fix comment

2565045

nalimilan reviewed Jun 8, 2022

View reviewed changes

bkamins commented Jun 8, 2022

View reviewed changes

src/abstractdataframe/selection.jl Show resolved Hide resolved

bkamins and others added 3 commits June 8, 2022 21:17

Apply suggestions from code review

c663bf9

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

apply suggestions after code review

d1d1509

Merge branch 'main' into bk/trainsform!_performance

5c860e0

bkamins commented Jun 12, 2022

View reviewed changes

src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved

Update src/abstractdataframe/selection.jl

fdd4e9c

bkamins commented Jun 12, 2022

View reviewed changes

src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved

Update src/abstractdataframe/selection.jl

a2e53c1

bkamins mentioned this pull request Jun 12, 2022

Metadata on data frame and column level #3055

Merged

bkamins commented Jun 13, 2022

View reviewed changes

test/select.jl Show resolved Hide resolved

bkamins commented Jun 13, 2022

View reviewed changes

test/select.jl Show resolved Hide resolved

Apply suggestions from code review

f2b09a2

nalimilan approved these changes Jun 13, 2022

View reviewed changes

src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved

test/select.jl Outdated Show resolved Hide resolved

test/select.jl Show resolved Hide resolved

src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved

Apply suggestions from code review

c439996

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins merged commit 0ce9b0f into main Jun 14, 2022

bkamins deleted the bk/trainsform!_performance branch June 14, 2022 08:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of transform! on SubDataFrame #3070

Performance of transform! on SubDataFrame #3070

bkamins commented Jun 8, 2022

nalimilan left a comment

nalimilan Jun 8, 2022

bkamins Jun 8, 2022

nalimilan Jun 8, 2022

bkamins Jun 8, 2022

bkamins Jun 8, 2022

bkamins commented Jun 8, 2022

bkamins commented Jun 14, 2022

Performance of transform! on SubDataFrame #3070

Performance of transform! on SubDataFrame #3070

Conversation

bkamins commented Jun 8, 2022

nalimilan left a comment

Choose a reason for hiding this comment

nalimilan Jun 8, 2022

Choose a reason for hiding this comment

bkamins Jun 8, 2022

Choose a reason for hiding this comment

nalimilan Jun 8, 2022

Choose a reason for hiding this comment

bkamins Jun 8, 2022

Choose a reason for hiding this comment

bkamins Jun 8, 2022

Choose a reason for hiding this comment

bkamins commented Jun 8, 2022

bkamins commented Jun 14, 2022