Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow function in allowduplicates in unstack #2998

Merged
merged 11 commits into from
Feb 17, 2022
Merged

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Jan 31, 2022

Follow up to #2995
Replaces #1181

What I would discuss if allowduplicates is a good name for this keyword argument now. Maybe we should introduce a new keyword argument (a single one) and deprecate allowduplicates (in a long term deprecation fashion i.e. we do need to remove it any time soon)

@nalimilan
Copy link
Member

Yeah I have to admit the name is a bit weird for passing a function. So you'd call the argument duplicates instead?

src/abstractdataframe/reshape.jl Outdated Show resolved Hide resolved
src/abstractdataframe/reshape.jl Outdated Show resolved Hide resolved
src/abstractdataframe/reshape.jl Outdated Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Feb 1, 2022

So you'd call the argument duplicates instead?

It is hard to say what is best. Let us first decide if we want the API the way I proposed (i.e. fill is not allowed when function is passed) or we want to use fill when value combination has not been encountered (and not call the function then). This choice might affect the judgement what is the best name to use for this argument.

@bkamins
Copy link
Member Author

bkamins commented Feb 4, 2022

I have pushed the branch using combine(groupby)) combo for aggregates + fill is always respected.
Now we are faster when doing aggregation than when doing things the legacy way in certain scenarios:

julia> df = DataFrame(rowid=rand(1:10, 10^8), colid=rand(1:10, 10^8), value=rand(10^8));

julia>  @time unstack(df, :rowid, :colid, :value, allowduplicates=true);
  1.109325 seconds (508 allocations: 1.490 GiB, 20.97% gc time)

julia>  @time unstack(df, :rowid, :colid, :value, allowduplicates=last);
  0.277465 seconds (1.07 k allocations: 763.002 MiB, 18.70% gc time)

julia> df = DataFrame(rowid=string.(rand(1:10, 10^8)), colid=string.(rand(1:10, 10^8)), value=rand(10^8));

julia>  @time unstack(df, :rowid, :colid, :value, allowduplicates=true);
 10.708357 seconds (1.46 k allocations: 4.980 GiB, 72.24% gc time)

julia>  @time unstack(df, :rowid, :colid, :value, allowduplicates=last);
  6.203964 seconds (2.05 k allocations: 2.490 GiB, 63.83% gc time)

Probably things can be further optimized but I think it is already OK.

The only decision is about the name of the argument. Do we deprecate allowduplicates in favor of duplicates? (I have left this unchanged for now as I want to first finalize the algorithm)

@nalimilan
Copy link
Member

This approach is fast when the number of duplicates is large, but it's hard to know whether that's the case in general. Sometimes you might have only a few duplicates. I guess there's no way to be efficient all the time, except by allowing users to choose the algorithm. Maybe not a big deal.

The only decision is about the name of the argument. Do we deprecate allowduplicates in favor of duplicates? (I have left this unchanged for now as I want to first finalize the algorithm)

Yeah that's probably better.

src/abstractdataframe/reshape.jl Show resolved Hide resolved
src/abstractdataframe/reshape.jl Outdated Show resolved Hide resolved
src/abstractdataframe/reshape.jl Outdated Show resolved Hide resolved
@oxinabox
Copy link
Contributor

oxinabox commented Feb 7, 2022

I would lean towards names that reference reduce (or fold I guess)
Because that's the operation isn't it?
reducer, maybe combiner?

@bkamins
Copy link
Member Author

bkamins commented Feb 7, 2022

Most of the time it will be reducer, but if you e.g. pass identity as a function then you will get a vector in each cell.

bkamins and others added 2 commits February 7, 2022 22:21
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@adkabo
Copy link

adkabo commented Feb 8, 2022

@bkamins

Most of the time it will be reducer, but if you e.g. pass identity as a function then you will get a vector in each cell.

That sounds similar to combine, right? So combine or combiner would make sense to me.

@bkamins
Copy link
Member Author

bkamins commented Feb 8, 2022

Yes - internally we call combine.

For now I have proposed to call the keyword argument operation and keep the allowduplicates keyword argument that is ignored when operation is passed.

@sprmnt21
Copy link

sprmnt21 commented Feb 8, 2022

Yes - internally we call combine.

For now I have proposed to call the keyword argument operation and keep the allowduplicates keyword argument that is ignored when operation is passed.

how about aggregateby or aggregatewith?

@bkamins
Copy link
Member Author

bkamins commented Feb 8, 2022

The issue is that operation does not have to be aggregation. We allow any operation. But maybe indeed something like aggregatewith is better as most of the time it will be aggregation indeed. @nalimilan - any opinion if my current operation is OK, or you would change it?

@nalimilan
Copy link
Member

operation is quite broad. FWIW, the tidyverse uses values_fn (consistent with the series of values_* arguments). Maybe it would make sense to use the same pattern, as we also have a value argument, and the "operation" is applied to values before storing them in that column?

@bkamins
Copy link
Member Author

bkamins commented Feb 9, 2022

So you propose to use values_fn name? I considered this, but I did not like it very much - it mentally did not fit the way we typically name things in Julia. It could rather be valuesfunction or valuesoperation or valuestransform?

@sprmnt21
Copy link

sprmnt21 commented Feb 9, 2022

I would also put in competition valuesmap

@bkamins
Copy link
Member Author

bkamins commented Feb 9, 2022

map signals elementwise processing and we most often (but not always) will do reduction.

@nalimilan
Copy link
Member

I'm not sure, I was just thinking out loud. I'm trying to find a similar case in the existing API, but it turns out most of the time we don't use keyword arguments for functions. Maybe just transform would be enough.

@bkamins
Copy link
Member Author

bkamins commented Feb 9, 2022

I'm trying to find a similar case in the existing API, but it turns out most of the time we don't use keyword arguments for functions.

This is also what I have checked. And for positional arguments f or op are used which is not nice.

If we feel transform is clear enough I would be OK with this. I proposed operation to avoid user ambiguity with transform function that we define (users might think they are related). The question is if we want be clearer with e.g. valuetransform?

@adkabo
Copy link

adkabo commented Feb 9, 2022

Yes - internally we call combine.

I think transform gives the wrong idea then, and I return to combine/combiner. AlgebraOfGraphics has renamer and sorter.

@bkamins
Copy link
Member Author

bkamins commented Feb 9, 2022

I am ok with combiner. @nalimilan - would you also accept it?

@bkamins
Copy link
Member Author

bkamins commented Feb 11, 2022

bump (as otherwise we will forget what we discussed).

The question is if we accept the combiner as the name of the kwarg allowing passing the function (and the rest of the design proposed in this PR).

Thank you!

@nalimilan
Copy link
Member

I was going to say that combiner is OK but then I read the docstring again and I noticed we speak a lot about "combinations" when describing this argument (and others), and yet these "combinations" have nothing to do with the "combiner" (i.e. it doesn't combine values from different combinations). So maybe we should find another term to avoid the confusion. Maybe valuetransform as proposed by @bkamins is better.

@bkamins
Copy link
Member Author

bkamins commented Feb 11, 2022

Unless no other comment is made on the best choice in a few days I will switch the implementation to use valuetransform.

@sprmnt21
Copy link

valuetransform or valuestransform ?

@bkamins
Copy link
Member Author

bkamins commented Feb 11, 2022

currently the argument for values is called value so I thought valuetransform. We could use valuestransform. Then the question is if we should rename value to values argument to unstack?

@bkamins
Copy link
Member Author

bkamins commented Feb 15, 2022

@nalimilan - I think we need to close the discussion and make a decision (naming is always super hard unfortunately).

I think valuestransform and changing value positional argument to values is OK (dplyr uses plural form). If we agree on this I will update the PR.

@nalimilan
Copy link
Member

The plural sounds indeed better given that the function will get passed all values for a given combination. Regarding the positional argument, it matters less, but note that we also use the singular for colkey to differentiate it from rowkeys which allows multiple columns.

@bkamins
Copy link
Member Author

bkamins commented Feb 15, 2022

I am aware of colkey vs rowkeys, but here I think that "key" part makes the difference. I will switch it to plural in values then.

@bkamins
Copy link
Member Author

bkamins commented Feb 16, 2022

The PR is updated.

NEWS.md Outdated Show resolved Hide resolved
src/abstractdataframe/reshape.jl Outdated Show resolved Hide resolved
test/reshape.jl Outdated Show resolved Hide resolved
test/reshape.jl Outdated Show resolved Hide resolved
bkamins and others added 2 commits February 17, 2022 09:33
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins bkamins merged commit 8999e16 into main Feb 17, 2022
@bkamins bkamins deleted the bk/unstack_duplicates branch February 17, 2022 11:11
@bkamins
Copy link
Member Author

bkamins commented Feb 17, 2022

Thank you!

@adkabo
Copy link

adkabo commented Sep 23, 2022

I was going to say that combiner is OK but then I read the docstring again and I noticed we speak a lot about "combinations" when describing this argument (and others), and yet these "combinations" have nothing to do with the "combiner" (i.e. it doesn't combine values from different combinations). So maybe we should find another term to avoid the confusion. Maybe valuetransform as proposed by @bkamins is better.

That is true but those uses of "combinations" are all informal and not part of the API. I think it is more important to keep the formal usage of transform and combine in the actual DataFrames.jl API consistent. In this case the function behaves according to the combine API and not transform.

@bkamins
Copy link
Member Author

bkamins commented Sep 23, 2022

@adkabo - but what is your proposal for a name of this keyword argument then?

Do you propose valuescombine?
Or maybe we should use yet something else like valuesaggregate?

CC @nalimilan

@adkabo
Copy link

adkabo commented Sep 23, 2022

IIUC it is exactly a combine, right? In this case I would say valuescombine or combiner or combination. combiner is my favorite of these.

@nalimilan
Copy link
Member

Well combine does lots of different things (like transform). Maybe valuesfunction, which was mentioned above (as it's really a function which gets called on values)? Having values in the name (like dplyr) makes it more explicit IMO.

@bkamins
Copy link
Member Author

bkamins commented Sep 23, 2022

Yes - I think values prefix is needed. Another option is valuesaggregate?

@jariji
Copy link
Contributor

jariji commented Sep 26, 2022

I think any of these proposals is better than valuestransform.

@bkamins
Copy link
Member Author

bkamins commented Sep 26, 2022

I opened #3184 to keep track of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants