allow function in allowduplicates in unstack #2998

bkamins · 2022-01-31T10:40:50Z

Follow up to #2995
Replaces #1181

What I would discuss if allowduplicates is a good name for this keyword argument now. Maybe we should introduce a new keyword argument (a single one) and deprecate allowduplicates (in a long term deprecation fashion i.e. we do need to remove it any time soon)

nalimilan · 2022-02-01T12:37:54Z

Yeah I have to admit the name is a bit weird for passing a function. So you'd call the argument duplicates instead?

src/abstractdataframe/reshape.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2022-02-01T13:32:55Z

So you'd call the argument duplicates instead?

It is hard to say what is best. Let us first decide if we want the API the way I proposed (i.e. fill is not allowed when function is passed) or we want to use fill when value combination has not been encountered (and not call the function then). This choice might affect the judgement what is the best name to use for this argument.

…DataFrames.jl into bk/unstack_duplicates

bkamins · 2022-02-04T16:43:50Z

I have pushed the branch using combine(groupby)) combo for aggregates + fill is always respected.
Now we are faster when doing aggregation than when doing things the legacy way in certain scenarios:

julia> df = DataFrame(rowid=rand(1:10, 10^8), colid=rand(1:10, 10^8), value=rand(10^8));

julia>  @time unstack(df, :rowid, :colid, :value, allowduplicates=true);
  1.109325 seconds (508 allocations: 1.490 GiB, 20.97% gc time)

julia>  @time unstack(df, :rowid, :colid, :value, allowduplicates=last);
  0.277465 seconds (1.07 k allocations: 763.002 MiB, 18.70% gc time)

julia> df = DataFrame(rowid=string.(rand(1:10, 10^8)), colid=string.(rand(1:10, 10^8)), value=rand(10^8));

julia>  @time unstack(df, :rowid, :colid, :value, allowduplicates=true);
 10.708357 seconds (1.46 k allocations: 4.980 GiB, 72.24% gc time)

julia>  @time unstack(df, :rowid, :colid, :value, allowduplicates=last);
  6.203964 seconds (2.05 k allocations: 2.490 GiB, 63.83% gc time)

Probably things can be further optimized but I think it is already OK.

The only decision is about the name of the argument. Do we deprecate allowduplicates in favor of duplicates? (I have left this unchanged for now as I want to first finalize the algorithm)

nalimilan · 2022-02-06T21:20:29Z

This approach is fast when the number of duplicates is large, but it's hard to know whether that's the case in general. Sometimes you might have only a few duplicates. I guess there's no way to be efficient all the time, except by allowing users to choose the algorithm. Maybe not a big deal.

The only decision is about the name of the argument. Do we deprecate allowduplicates in favor of duplicates? (I have left this unchanged for now as I want to first finalize the algorithm)

Yeah that's probably better.

src/abstractdataframe/reshape.jl

oxinabox · 2022-02-07T19:47:09Z

I would lean towards names that reference reduce (or fold I guess)
Because that's the operation isn't it?
reducer, maybe combiner?

bkamins · 2022-02-07T21:11:57Z

Most of the time it will be reducer, but if you e.g. pass identity as a function then you will get a vector in each cell.

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

adkabo · 2022-02-08T02:32:11Z

@bkamins

Most of the time it will be reducer, but if you e.g. pass identity as a function then you will get a vector in each cell.

That sounds similar to combine, right? So combine or combiner would make sense to me.

bkamins · 2022-02-08T08:27:37Z

Yes - internally we call combine.

For now I have proposed to call the keyword argument operation and keep the allowduplicates keyword argument that is ignored when operation is passed.

sprmnt21 · 2022-02-08T16:39:45Z

Yes - internally we call combine.

For now I have proposed to call the keyword argument operation and keep the allowduplicates keyword argument that is ignored when operation is passed.

how about aggregateby or aggregatewith?

bkamins · 2022-02-08T16:48:25Z

The issue is that operation does not have to be aggregation. We allow any operation. But maybe indeed something like aggregatewith is better as most of the time it will be aggregation indeed. @nalimilan - any opinion if my current operation is OK, or you would change it?

nalimilan · 2022-02-09T08:54:32Z

operation is quite broad. FWIW, the tidyverse uses values_fn (consistent with the series of values_* arguments). Maybe it would make sense to use the same pattern, as we also have a value argument, and the "operation" is applied to values before storing them in that column?

bkamins · 2022-02-09T09:00:06Z

So you propose to use values_fn name? I considered this, but I did not like it very much - it mentally did not fit the way we typically name things in Julia. It could rather be valuesfunction or valuesoperation or valuestransform?

sprmnt21 · 2022-02-09T09:58:31Z

I would also put in competition valuesmap

bkamins · 2022-02-09T10:06:51Z

map signals elementwise processing and we most often (but not always) will do reduction.

nalimilan · 2022-02-09T13:34:40Z

I'm not sure, I was just thinking out loud. I'm trying to find a similar case in the existing API, but it turns out most of the time we don't use keyword arguments for functions. Maybe just transform would be enough.

src/abstractdataframe/reshape.jl

bkamins · 2022-02-09T16:54:51Z

I'm trying to find a similar case in the existing API, but it turns out most of the time we don't use keyword arguments for functions.

This is also what I have checked. And for positional arguments f or op are used which is not nice.

If we feel transform is clear enough I would be OK with this. I proposed operation to avoid user ambiguity with transform function that we define (users might think they are related). The question is if we want be clearer with e.g. valuetransform?

adkabo · 2022-02-09T18:19:41Z

Yes - internally we call combine.

I think transform gives the wrong idea then, and I return to combine/combiner. AlgebraOfGraphics has renamer and sorter.

bkamins · 2022-02-09T18:38:19Z

I am ok with combiner. @nalimilan - would you also accept it?

bkamins · 2022-02-11T07:38:03Z

bump (as otherwise we will forget what we discussed).

The question is if we accept the combiner as the name of the kwarg allowing passing the function (and the rest of the design proposed in this PR).

Thank you!

nalimilan · 2022-02-11T13:05:48Z

I was going to say that combiner is OK but then I read the docstring again and I noticed we speak a lot about "combinations" when describing this argument (and others), and yet these "combinations" have nothing to do with the "combiner" (i.e. it doesn't combine values from different combinations). So maybe we should find another term to avoid the confusion. Maybe valuetransform as proposed by @bkamins is better.

bkamins · 2022-02-11T14:09:00Z

Unless no other comment is made on the best choice in a few days I will switch the implementation to use valuetransform.

sprmnt21 · 2022-02-11T14:35:51Z

valuetransform or valuestransform ?

bkamins · 2022-02-11T15:13:58Z

currently the argument for values is called value so I thought valuetransform. We could use valuestransform. Then the question is if we should rename value to values argument to unstack?

bkamins · 2022-02-15T07:45:31Z

@nalimilan - I think we need to close the discussion and make a decision (naming is always super hard unfortunately).

I think valuestransform and changing value positional argument to values is OK (dplyr uses plural form). If we agree on this I will update the PR.

nalimilan · 2022-02-15T20:36:30Z

The plural sounds indeed better given that the function will get passed all values for a given combination. Regarding the positional argument, it matters less, but note that we also use the singular for colkey to differentiate it from rowkeys which allows multiple columns.

bkamins · 2022-02-15T21:08:26Z

I am aware of colkey vs rowkeys, but here I think that "key" part makes the difference. I will switch it to plural in values then.

bkamins · 2022-02-16T08:40:01Z

The PR is updated.

NEWS.md

src/abstractdataframe/reshape.jl

test/reshape.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2022-02-17T11:11:59Z

Thank you!

adkabo · 2022-09-23T05:02:29Z

I was going to say that combiner is OK but then I read the docstring again and I noticed we speak a lot about "combinations" when describing this argument (and others), and yet these "combinations" have nothing to do with the "combiner" (i.e. it doesn't combine values from different combinations). So maybe we should find another term to avoid the confusion. Maybe valuetransform as proposed by @bkamins is better.

That is true but those uses of "combinations" are all informal and not part of the API. I think it is more important to keep the formal usage of transform and combine in the actual DataFrames.jl API consistent. In this case the function behaves according to the combine API and not transform.

bkamins · 2022-09-23T07:31:07Z

@adkabo - but what is your proposal for a name of this keyword argument then?

Do you propose valuescombine?
Or maybe we should use yet something else like valuesaggregate?

CC @nalimilan

adkabo · 2022-09-23T16:11:50Z

IIUC it is exactly a combine, right? In this case I would say valuescombine or combiner or combination. combiner is my favorite of these.

nalimilan · 2022-09-23T16:17:43Z

Well combine does lots of different things (like transform). Maybe valuesfunction, which was mentioned above (as it's really a function which gets called on values)? Having values in the name (like dplyr) makes it more explicit IMO.

bkamins · 2022-09-23T21:15:52Z

Yes - I think values prefix is needed. Another option is valuesaggregate?

jariji · 2022-09-26T19:29:33Z

I think any of these proposals is better than valuestransform.

bkamins · 2022-09-26T19:34:43Z

I opened #3184 to keep track of it.

bkamins added 2 commits January 31, 2022 11:36

allow function in allowduplicates

2b43214

update metadata

493ee1f

bkamins requested a review from nalimilan January 31, 2022 10:40

bkamins added feature reshaping labels Jan 31, 2022

bkamins added this to the 1.4 milestone Jan 31, 2022

bkamins mentioned this pull request Jan 31, 2022

Added pivot() funtion (for updated pull request #1175) #1181

Closed

nalimilan reviewed Feb 1, 2022

View reviewed changes

src/abstractdataframe/reshape.jl Outdated Show resolved Hide resolved

src/abstractdataframe/reshape.jl Outdated Show resolved Hide resolved

src/abstractdataframe/reshape.jl Outdated Show resolved Hide resolved

Update src/abstractdataframe/reshape.jl

111745e

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins added 2 commits February 4, 2022 17:37

use combine(groupby) combination

c40b4b7

Merge branch 'bk/unstack_duplicates' of https://github.com/JuliaData/…

df6186b

…DataFrames.jl into bk/unstack_duplicates

nalimilan reviewed Feb 6, 2022

View reviewed changes

src/abstractdataframe/reshape.jl Show resolved Hide resolved

src/abstractdataframe/reshape.jl Outdated Show resolved Hide resolved

src/abstractdataframe/reshape.jl Outdated Show resolved Hide resolved

bkamins and others added 2 commits February 7, 2022 22:21

Apply suggestions from code review

809bb28

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

add operation kwarg

9b8abf4

nalimilan reviewed Feb 9, 2022

View reviewed changes

src/abstractdataframe/reshape.jl Outdated Show resolved Hide resolved

update to valuestransform

6463d2d

bkamins commented Feb 16, 2022

View reviewed changes

NEWS.md Outdated Show resolved Hide resolved

Update NEWS.md

4efd227

nalimilan reviewed Feb 16, 2022

View reviewed changes

src/abstractdataframe/reshape.jl Outdated Show resolved Hide resolved

test/reshape.jl Outdated Show resolved Hide resolved

test/reshape.jl Outdated Show resolved Hide resolved

bkamins and others added 2 commits February 17, 2022 09:33

Apply suggestions from code review

5896919

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

add more tests

7198220

nalimilan approved these changes Feb 17, 2022

View reviewed changes

bkamins merged commit 8999e16 into main Feb 17, 2022

bkamins deleted the bk/unstack_duplicates branch February 17, 2022 11:11

bkamins mentioned this pull request Sep 26, 2022

change valuestransform in unstack #3184

Closed

allow function in allowduplicates in unstack #2998

allow function in allowduplicates in unstack #2998

Conversation

bkamins commented Jan 31, 2022

nalimilan commented Feb 1, 2022

bkamins commented Feb 1, 2022

bkamins commented Feb 4, 2022

nalimilan commented Feb 6, 2022

oxinabox commented Feb 7, 2022

bkamins commented Feb 7, 2022

adkabo commented Feb 8, 2022

bkamins commented Feb 8, 2022

sprmnt21 commented Feb 8, 2022 • edited Loading

bkamins commented Feb 8, 2022

nalimilan commented Feb 9, 2022

bkamins commented Feb 9, 2022

sprmnt21 commented Feb 9, 2022

bkamins commented Feb 9, 2022

nalimilan commented Feb 9, 2022

bkamins commented Feb 9, 2022

adkabo commented Feb 9, 2022

bkamins commented Feb 9, 2022

bkamins commented Feb 11, 2022

nalimilan commented Feb 11, 2022

bkamins commented Feb 11, 2022

sprmnt21 commented Feb 11, 2022

bkamins commented Feb 11, 2022

bkamins commented Feb 15, 2022

nalimilan commented Feb 15, 2022

bkamins commented Feb 15, 2022

bkamins commented Feb 16, 2022

bkamins commented Feb 17, 2022

adkabo commented Sep 23, 2022 • edited Loading

bkamins commented Sep 23, 2022

adkabo commented Sep 23, 2022

nalimilan commented Sep 23, 2022

bkamins commented Sep 23, 2022

jariji commented Sep 26, 2022

bkamins commented Sep 26, 2022

sprmnt21 commented Feb 8, 2022 •

edited

Loading

adkabo commented Sep 23, 2022 •

edited

Loading