Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow no rowkey in unstack #2995

Merged
merged 2 commits into from
Jan 31, 2022
Merged

allow no rowkey in unstack #2995

merged 2 commits into from
Jan 31, 2022

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Jan 24, 2022

Fixes #2994

It turns out that when I have re-written the unstack implementation to improve its performance we get handling of no row-key for free. So I have added it.

The question is if we want to add another option for allowduplicates like allowduplicates=:vector, in which case we would store a list of duplicate values (a la dplyr in #2994 example). This would be a new feature (and probably go to 1.4 release as what I make here is a patch)

@bkamins bkamins added this to the patch milestone Jan 24, 2022
@fredcallaway
Copy link

For duplicates, I think the dplyr default of making lists (rather than throwing an error) is a bit questionable. I like the idea of having to explicitly provide a method for handling that case. Ideally, however, it could also accept an aggregation function. By that logic, allowduplicates=collect or even duplicate_fun=collect might be more natural than allowduplicates=:vector...at least if it can be done as performantly (which maybe it can't).

@bkamins
Copy link
Member Author

bkamins commented Jan 24, 2022

For duplicates, I think the dplyr default of making lists (rather than throwing an error) is a bit questionable

Agreed. A policy in DataFrames.jl is to never produce a warning (like dplyr does). The reason is that in production code they would get lost unseen.

at least if it can be done as performantly (which maybe it can't).

API design is orthogonal to performance. Let us make a list of things we want. I understand it is:

  • do not allow duplicates (allowduplicates=false)
  • allow duplicates and keep last (allowduplicates=true)
  • store a vector
  • store some aggregate of a vector

The question is now about API. we could either:

  1. add more options to allowduplicates
  2. add a new kwarg duplicate_fun which is used only if allowduplicates=true, by default duplicate_fun would be last (so that we are consistent with the current functionality); then if you passed identity you would get a vector, and if you passed whatever else it would get passed a vector (e.g. passing mean would give you mean etc.)

The problem we have is that we cannot remove allowduplicates keyword argument as we are post 1.0 release so we have to keep allowduplicates=true and allowduplicates=false with the other values at their defaults work the way they do now.

In general I think adding these features gets us super close to resolving long standing #1181.

CC @nalimilan

@bkamins
Copy link
Member Author

bkamins commented Jan 24, 2022

Actually maybe allowduplicates=identity and similar are OK and would not be confusing?

@fredcallaway
Copy link

Apologies, I wasn't aware of the existing allowduplicates behavior. Given that, I think option 1 makes sense.
So the docs could read

allowduplicates: if false (the default) then an error will be thrown if any
combination of rowkeys and colkey contains duplicate entries; if a function,
then it will be applied to a vector of duplicate entries (a singleton vector
for cases without duplicates) for each combination; if true then the last encountered
value will be retained (equivalent to passing last).

Note there's the question of how to handle non-duplicates. I think these should be
passed to the function as a vector as well, for consistency.

@bkamins
Copy link
Member Author

bkamins commented Jan 25, 2022

I think these should be passed to the function as a vector as well, for consistency.

Agreed. In DataFrames.jl, unless there is a strong reason, we try to produce type stable columns.

@bkamins
Copy link
Member Author

bkamins commented Jan 28, 2022

@nalimilan - any opinion on this PR?

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay!

I'd say that allowduplicates=identity and so on would be OK. That would be more convenient than having to do allowduplicates=true, duplicate_fun=identity.

src/abstractdataframe/reshape.jl Outdated Show resolved Hide resolved
test/reshape.jl Outdated Show resolved Hide resolved
test/reshape.jl Outdated Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins bkamins merged commit 7691d9e into main Jan 31, 2022
@bkamins bkamins deleted the bk/unstack_no_key branch January 31, 2022 09:50
@bkamins
Copy link
Member Author

bkamins commented Jan 31, 2022

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

unstack fails without an id column
3 participants