allow passing empty sets of columns to ByRow and filter #2476

bkamins · 2020-10-09T11:59:08Z

No description provided.

bkamins · 2020-10-09T12:00:27Z

bkamins · 2020-10-09T12:01:59Z

This is a change before a final GroupedDataFrame rules rewrite for returning multiple columns that is relatively orthogonal so I split it to a separate PR to make reviews simpler (this PR should be relatively easy to review).

test/grouping.jl

NEWS.md

test/grouping.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

…formance

…empty_byrow

bkamins · 2020-10-11T08:44:39Z

@nalimilan - last two commits are attempts to improve code and compilation latency (hopefully not sacrificing performance for large tables)

@pdeffebach - some benchmarking would be welcome (following the discussion on Slack)

src/abstractdataframe/selection.jl

bkamins · 2020-10-11T16:39:31Z

This is what I have now. I think it is good enough for now. I will merge this PR once the tests pass (I have extracted one common case to a separate function and added @nospecialize where tests indicated that it is beneficial; further optimizations are left for the future).

After this PR:

julia> using DataFrames

julia> df = DataFrame(a=1);

julia> @time begin
           @time select(df, :a => identity)
           @time select(df, :a => identity)
           @time select(df, :a => x -> 1)
           @time select(df, :a => x -> 1)
           @time select(df, :a => identity => :c)
           @time select(df, :a => identity => :c)
           @time select(df, :a => (x -> 1) => :c)
           @time select(df, :a => (x -> 1) => :c)
           @time select(df, [:a] => identity)
           @time select(df, [:a] => identity)
           @time select(df, [:a] => x -> 1)
           @time select(df, [:a] => x -> 1)
           @time select(df, [:a] => (x -> [[1]]) => [:c])
           @time select(df, [:a] => (x -> [[1]]) => [:c])
       end;
  1.242834 seconds (4.81 M allocations: 251.046 MiB, 7.05% gc time)
  0.000143 seconds (84 allocations: 5.000 KiB)
  0.099327 seconds (215.07 k allocations: 11.366 MiB)
  0.096288 seconds (215.05 k allocations: 11.376 MiB)
  0.063307 seconds (160.21 k allocations: 8.505 MiB)
  0.000091 seconds (79 allocations: 4.875 KiB)
  0.111864 seconds (201.95 k allocations: 10.764 MiB, 12.66% gc time)
  0.098550 seconds (201.95 k allocations: 10.762 MiB)
  0.310056 seconds (899.44 k allocations: 45.872 MiB, 3.69% gc time)
  0.000123 seconds (95 allocations: 5.641 KiB)
  0.105463 seconds (291.17 k allocations: 15.455 MiB)
  0.120932 seconds (291.16 k allocations: 15.453 MiB, 8.62% gc time)
  0.364569 seconds (1.50 M allocations: 76.191 MiB, 2.57% gc time)
  0.100328 seconds (224.71 k allocations: 12.368 MiB, 11.53% gc time)
  2.762119 seconds (9.06 M allocations: 472.373 MiB, 5.24% gc time)

Before this PR:

julia> using DataFrames

julia> df = DataFrame(a=1);

julia> @time begin
           @time select(df, :a => identity)
           @time select(df, :a => identity)
           @time select(df, :a => x -> 1)
           @time select(df, :a => x -> 1)
           @time select(df, :a => identity => :c)
           @time select(df, :a => identity => :c)
           @time select(df, :a => (x -> 1) => :c)
           @time select(df, :a => (x -> 1) => :c)
           @time select(df, [:a] => identity)
           @time select(df, [:a] => identity)
           @time select(df, [:a] => x -> 1)
           @time select(df, [:a] => x -> 1)
           @time select(df, [:a] => (x -> [[1]]) => [:c])
           @time select(df, [:a] => (x -> [[1]]) => [:c])
       end;
  1.365161 seconds (5.46 M allocations: 284.203 MiB, 6.30% gc time)
  0.000120 seconds (82 allocations: 5.000 KiB)
  0.117532 seconds (236.30 k allocations: 12.589 MiB)
  0.137019 seconds (236.28 k allocations: 12.601 MiB, 14.71% gc time)
  0.091981 seconds (184.44 k allocations: 9.884 MiB)
  0.000107 seconds (73 allocations: 4.703 KiB)
  0.120085 seconds (224.36 k allocations: 12.062 MiB)
  0.119065 seconds (224.36 k allocations: 12.060 MiB)
  0.347304 seconds (803.06 k allocations: 40.986 MiB, 8.86% gc time)
  0.000157 seconds (87 allocations: 5.484 KiB)
  0.218455 seconds (314.21 k allocations: 16.766 MiB, 35.85% gc time)
  0.127758 seconds (314.20 k allocations: 16.769 MiB)
  0.470066 seconds (1.51 M allocations: 76.974 MiB, 14.84% gc time)
  0.110272 seconds (247.74 k allocations: 13.680 MiB)
  3.274517 seconds (9.81 M allocations: 511.783 MiB, 8.70% gc time)

release 0.21 (does not support last two operations):

julia> using DataFrames

julia> df = DataFrame(a=1);

julia> @time begin
           @time select(df, :a => identity)
           @time select(df, :a => identity)
           @time select(df, :a => x -> 1)
           @time select(df, :a => x -> 1)
           @time select(df, :a => identity => :c)
           @time select(df, :a => identity => :c)
           @time select(df, :a => (x -> 1) => :c)
           @time select(df, :a => (x -> 1) => :c)
           @time select(df, [:a] => identity)
           @time select(df, [:a] => identity)
           @time select(df, [:a] => x -> 1)
           @time select(df, [:a] => x -> 1)
           #@time select(df, [:a] => (x -> [[1]]) => [:c])
           #@time select(df, [:a] => (x -> [[1]]) => [:c])
       end;
  0.802099 seconds (3.12 M allocations: 160.918 MiB, 6.74% gc time)
  0.000086 seconds (48 allocations: 3.453 KiB)
  0.186024 seconds (342.27 k allocations: 18.242 MiB, 7.50% gc time)
  0.161856 seconds (340.47 k allocations: 18.134 MiB)
  0.099809 seconds (214.92 k allocations: 11.570 MiB)
  0.000075 seconds (46 allocations: 3.344 KiB)
  0.187048 seconds (323.27 k allocations: 17.314 MiB, 6.39% gc time)
  0.164101 seconds (323.26 k allocations: 17.309 MiB)
  0.381791 seconds (954.31 k allocations: 49.240 MiB, 2.82% gc time)
  0.000108 seconds (64 allocations: 4.297 KiB)
  0.190896 seconds (417.56 k allocations: 22.295 MiB, 5.30% gc time)
  0.173000 seconds (417.53 k allocations: 22.289 MiB)
  2.399654 seconds (6.52 M allocations: 340.514 MiB, 4.20% gc time)

bkamins · 2020-10-12T08:54:34Z

Thank you!

bkamins added 2 commits October 9, 2020 13:58

allow passing empty sets of columns to ByRow and filter

cc8562b

fix NEWS.md

170a247

bkamins linked an issue Oct 9, 2020 that may be closed by this pull request

ByRow without columns #2457

Closed

bkamins requested a review from nalimilan October 9, 2020 12:00

bkamins commented Oct 9, 2020

View reviewed changes

test/grouping.jl Show resolved Hide resolved

bkamins added feature grouping non-breaking The proposed change is not breaking labels Oct 9, 2020

bkamins added this to the 1.0 milestone Oct 9, 2020

nalimilan approved these changes Oct 10, 2020

View reviewed changes

NEWS.md Outdated Show resolved Hide resolved

test/grouping.jl Show resolved Hide resolved

bkamins and others added 4 commits October 11, 2020 10:07

Update NEWS.md

1ec2aad

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

extract out transformation into a function to potentially improve per…

ece1eef

…formance

Merge remote-tracking branch 'origin/filter_empty_byrow' into filter_…

d40da5c

…empty_byrow

attempts to reduce compilation latency

2bdb3de

bkamins mentioned this pull request Oct 11, 2020

Transform performance #2479

Closed

nalimilan reviewed Oct 11, 2020

View reviewed changes

src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved

add one method for common case and @nospecialize selectively

d872942

bkamins merged commit b11fe97 into JuliaData:master Oct 12, 2020

bkamins deleted the filter_empty_byrow branch October 12, 2020 08:54

JuliaRegistrator mentioned this pull request Nov 15, 2020

New version: DataFrames v0.22.0 JuliaRegistries/General#24650

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow passing empty sets of columns to ByRow and filter #2476

allow passing empty sets of columns to ByRow and filter #2476

bkamins commented Oct 9, 2020

bkamins commented Oct 9, 2020

bkamins commented Oct 9, 2020

bkamins commented Oct 11, 2020

bkamins commented Oct 11, 2020

bkamins commented Oct 12, 2020

allow passing empty sets of columns to ByRow and filter #2476

allow passing empty sets of columns to ByRow and filter #2476

Conversation

bkamins commented Oct 9, 2020

bkamins commented Oct 9, 2020

bkamins commented Oct 9, 2020

bkamins commented Oct 11, 2020

bkamins commented Oct 11, 2020

bkamins commented Oct 12, 2020