allow no rowkey in unstack #2995

bkamins · 2022-01-24T18:55:31Z

It turns out that when I have re-written the unstack implementation to improve its performance we get handling of no row-key for free. So I have added it.

The question is if we want to add another option for allowduplicates like allowduplicates=:vector, in which case we would store a list of duplicate values (a la dplyr in #2994 example). This would be a new feature (and probably go to 1.4 release as what I make here is a patch)

fredcallaway · 2022-01-24T19:23:56Z

For duplicates, I think the dplyr default of making lists (rather than throwing an error) is a bit questionable. I like the idea of having to explicitly provide a method for handling that case. Ideally, however, it could also accept an aggregation function. By that logic, allowduplicates=collect or even duplicate_fun=collect might be more natural than allowduplicates=:vector...at least if it can be done as performantly (which maybe it can't).

bkamins · 2022-01-24T20:03:24Z

For duplicates, I think the dplyr default of making lists (rather than throwing an error) is a bit questionable

Agreed. A policy in DataFrames.jl is to never produce a warning (like dplyr does). The reason is that in production code they would get lost unseen.

at least if it can be done as performantly (which maybe it can't).

API design is orthogonal to performance. Let us make a list of things we want. I understand it is:

do not allow duplicates (allowduplicates=false)
allow duplicates and keep last (allowduplicates=true)
store a vector
store some aggregate of a vector

The question is now about API. we could either:

add more options to allowduplicates
add a new kwarg duplicate_fun which is used only if allowduplicates=true, by default duplicate_fun would be last (so that we are consistent with the current functionality); then if you passed identity you would get a vector, and if you passed whatever else it would get passed a vector (e.g. passing mean would give you mean etc.)

The problem we have is that we cannot remove allowduplicates keyword argument as we are post 1.0 release so we have to keep allowduplicates=true and allowduplicates=false with the other values at their defaults work the way they do now.

In general I think adding these features gets us super close to resolving long standing #1181.

CC @nalimilan

bkamins · 2022-01-24T21:33:43Z

Actually maybe allowduplicates=identity and similar are OK and would not be confusing?

fredcallaway · 2022-01-24T22:49:58Z

Apologies, I wasn't aware of the existing allowduplicates behavior. Given that, I think option 1 makes sense.
So the docs could read

allowduplicates: if false (the default) then an error will be thrown if any
combination of rowkeys and colkey contains duplicate entries; if a function,
then it will be applied to a vector of duplicate entries (a singleton vector
for cases without duplicates) for each combination; if true then the last encountered
value will be retained (equivalent to passing last).

Note there's the question of how to handle non-duplicates. I think these should be
passed to the function as a vector as well, for consistency.

bkamins · 2022-01-25T07:24:35Z

I think these should be passed to the function as a vector as well, for consistency.

Agreed. In DataFrames.jl, unless there is a strong reason, we try to produce type stable columns.

bkamins · 2022-01-28T20:48:10Z

@nalimilan - any opinion on this PR?

nalimilan

Sorry for the delay!

I'd say that allowduplicates=identity and so on would be OK. That would be more convenient than having to do allowduplicates=true, duplicate_fun=identity.

src/abstractdataframe/reshape.jl

test/reshape.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2022-01-31T09:51:02Z

Thank you!

allow no rowkey

73d4acd

bkamins added the feature label Jan 24, 2022

bkamins added this to the patch milestone Jan 24, 2022

nalimilan approved these changes Jan 31, 2022

View reviewed changes

src/abstractdataframe/reshape.jl Outdated Show resolved Hide resolved

test/reshape.jl Outdated Show resolved Hide resolved

test/reshape.jl Outdated Show resolved Hide resolved

Apply suggestions from code review

05c3381

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins merged commit 7691d9e into main Jan 31, 2022

bkamins deleted the bk/unstack_no_key branch January 31, 2022 09:50

bkamins mentioned this pull request Jan 31, 2022

allow function in allowduplicates in unstack #2998

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow no rowkey in unstack #2995

allow no rowkey in unstack #2995

bkamins commented Jan 24, 2022

fredcallaway commented Jan 24, 2022

bkamins commented Jan 24, 2022

bkamins commented Jan 24, 2022

fredcallaway commented Jan 24, 2022

bkamins commented Jan 25, 2022

bkamins commented Jan 28, 2022

nalimilan left a comment

bkamins commented Jan 31, 2022

allow no rowkey in unstack #2995

allow no rowkey in unstack #2995

Conversation

bkamins commented Jan 24, 2022

fredcallaway commented Jan 24, 2022

bkamins commented Jan 24, 2022

bkamins commented Jan 24, 2022

fredcallaway commented Jan 24, 2022

bkamins commented Jan 25, 2022

bkamins commented Jan 28, 2022

nalimilan left a comment

Choose a reason for hiding this comment

bkamins commented Jan 31, 2022