Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow passing empty sets of columns to ByRow and filter #2476

Merged
merged 7 commits into from
Oct 12, 2020

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Oct 9, 2020

No description provided.

@bkamins
Copy link
Member Author

bkamins commented Oct 9, 2020

Fixes #2457

@bkamins bkamins linked an issue Oct 9, 2020 that may be closed by this pull request
@bkamins bkamins requested a review from nalimilan October 9, 2020 12:00
@bkamins
Copy link
Member Author

bkamins commented Oct 9, 2020

This is a change before a final GroupedDataFrame rules rewrite for returning multiple columns that is relatively orthogonal so I split it to a separate PR to make reviews simpler (this PR should be relatively easy to review).

@bkamins bkamins added feature grouping non-breaking The proposed change is not breaking labels Oct 9, 2020
@bkamins bkamins added this to the 1.0 milestone Oct 9, 2020
NEWS.md Outdated Show resolved Hide resolved
test/grouping.jl Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Oct 11, 2020

@nalimilan - last two commits are attempts to improve code and compilation latency (hopefully not sacrificing performance for large tables)

@pdeffebach - some benchmarking would be welcome (following the discussion on Slack)

@bkamins bkamins mentioned this pull request Oct 11, 2020
@bkamins
Copy link
Member Author

bkamins commented Oct 11, 2020

This is what I have now. I think it is good enough for now. I will merge this PR once the tests pass (I have extracted one common case to a separate function and added @nospecialize where tests indicated that it is beneficial; further optimizations are left for the future).

After this PR:

julia> using DataFrames

julia> df = DataFrame(a=1);

julia> @time begin
           @time select(df, :a => identity)
           @time select(df, :a => identity)
           @time select(df, :a => x -> 1)
           @time select(df, :a => x -> 1)
           @time select(df, :a => identity => :c)
           @time select(df, :a => identity => :c)
           @time select(df, :a => (x -> 1) => :c)
           @time select(df, :a => (x -> 1) => :c)
           @time select(df, [:a] => identity)
           @time select(df, [:a] => identity)
           @time select(df, [:a] => x -> 1)
           @time select(df, [:a] => x -> 1)
           @time select(df, [:a] => (x -> [[1]]) => [:c])
           @time select(df, [:a] => (x -> [[1]]) => [:c])
       end;
  1.242834 seconds (4.81 M allocations: 251.046 MiB, 7.05% gc time)
  0.000143 seconds (84 allocations: 5.000 KiB)
  0.099327 seconds (215.07 k allocations: 11.366 MiB)
  0.096288 seconds (215.05 k allocations: 11.376 MiB)
  0.063307 seconds (160.21 k allocations: 8.505 MiB)
  0.000091 seconds (79 allocations: 4.875 KiB)
  0.111864 seconds (201.95 k allocations: 10.764 MiB, 12.66% gc time)
  0.098550 seconds (201.95 k allocations: 10.762 MiB)
  0.310056 seconds (899.44 k allocations: 45.872 MiB, 3.69% gc time)
  0.000123 seconds (95 allocations: 5.641 KiB)
  0.105463 seconds (291.17 k allocations: 15.455 MiB)
  0.120932 seconds (291.16 k allocations: 15.453 MiB, 8.62% gc time)
  0.364569 seconds (1.50 M allocations: 76.191 MiB, 2.57% gc time)
  0.100328 seconds (224.71 k allocations: 12.368 MiB, 11.53% gc time)
  2.762119 seconds (9.06 M allocations: 472.373 MiB, 5.24% gc time)

Before this PR:

julia> using DataFrames

julia> df = DataFrame(a=1);

julia> @time begin
           @time select(df, :a => identity)
           @time select(df, :a => identity)
           @time select(df, :a => x -> 1)
           @time select(df, :a => x -> 1)
           @time select(df, :a => identity => :c)
           @time select(df, :a => identity => :c)
           @time select(df, :a => (x -> 1) => :c)
           @time select(df, :a => (x -> 1) => :c)
           @time select(df, [:a] => identity)
           @time select(df, [:a] => identity)
           @time select(df, [:a] => x -> 1)
           @time select(df, [:a] => x -> 1)
           @time select(df, [:a] => (x -> [[1]]) => [:c])
           @time select(df, [:a] => (x -> [[1]]) => [:c])
       end;
  1.365161 seconds (5.46 M allocations: 284.203 MiB, 6.30% gc time)
  0.000120 seconds (82 allocations: 5.000 KiB)
  0.117532 seconds (236.30 k allocations: 12.589 MiB)
  0.137019 seconds (236.28 k allocations: 12.601 MiB, 14.71% gc time)
  0.091981 seconds (184.44 k allocations: 9.884 MiB)
  0.000107 seconds (73 allocations: 4.703 KiB)
  0.120085 seconds (224.36 k allocations: 12.062 MiB)
  0.119065 seconds (224.36 k allocations: 12.060 MiB)
  0.347304 seconds (803.06 k allocations: 40.986 MiB, 8.86% gc time)
  0.000157 seconds (87 allocations: 5.484 KiB)
  0.218455 seconds (314.21 k allocations: 16.766 MiB, 35.85% gc time)
  0.127758 seconds (314.20 k allocations: 16.769 MiB)
  0.470066 seconds (1.51 M allocations: 76.974 MiB, 14.84% gc time)
  0.110272 seconds (247.74 k allocations: 13.680 MiB)
  3.274517 seconds (9.81 M allocations: 511.783 MiB, 8.70% gc time)

release 0.21 (does not support last two operations):

julia> using DataFrames

julia> df = DataFrame(a=1);

julia> @time begin
           @time select(df, :a => identity)
           @time select(df, :a => identity)
           @time select(df, :a => x -> 1)
           @time select(df, :a => x -> 1)
           @time select(df, :a => identity => :c)
           @time select(df, :a => identity => :c)
           @time select(df, :a => (x -> 1) => :c)
           @time select(df, :a => (x -> 1) => :c)
           @time select(df, [:a] => identity)
           @time select(df, [:a] => identity)
           @time select(df, [:a] => x -> 1)
           @time select(df, [:a] => x -> 1)
           #@time select(df, [:a] => (x -> [[1]]) => [:c])
           #@time select(df, [:a] => (x -> [[1]]) => [:c])
       end;
  0.802099 seconds (3.12 M allocations: 160.918 MiB, 6.74% gc time)
  0.000086 seconds (48 allocations: 3.453 KiB)
  0.186024 seconds (342.27 k allocations: 18.242 MiB, 7.50% gc time)
  0.161856 seconds (340.47 k allocations: 18.134 MiB)
  0.099809 seconds (214.92 k allocations: 11.570 MiB)
  0.000075 seconds (46 allocations: 3.344 KiB)
  0.187048 seconds (323.27 k allocations: 17.314 MiB, 6.39% gc time)
  0.164101 seconds (323.26 k allocations: 17.309 MiB)
  0.381791 seconds (954.31 k allocations: 49.240 MiB, 2.82% gc time)
  0.000108 seconds (64 allocations: 4.297 KiB)
  0.190896 seconds (417.56 k allocations: 22.295 MiB, 5.30% gc time)
  0.173000 seconds (417.53 k allocations: 22.289 MiB)
  2.399654 seconds (6.52 M allocations: 340.514 MiB, 4.20% gc time)

@bkamins bkamins merged commit b11fe97 into JuliaData:master Oct 12, 2020
@bkamins bkamins deleted the filter_empty_byrow branch October 12, 2020 08:54
@bkamins
Copy link
Member Author

bkamins commented Oct 12, 2020

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature grouping non-breaking The proposed change is not breaking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ByRow without columns
2 participants