Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add predicate support for names and more tests #2417

Merged
merged 12 commits into from
Jan 31, 2021

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Sep 6, 2020

@nalimilan I added a predicate for names + added more tests (and keep them in one place so that it easier to make sure all is tested). In particular methods for DataFrameRow were missing.

One thing I was considering if the predicate should not take String => Column pair to allow for more flexibility, but decided that passing String is enough as I think the more complex cases can be relatively easily handled by pairs(eachcol(df)) if neeeded and they are probably much rarer than the case of eg. startswith.

@bkamins bkamins added feature non-breaking The proposed change is not breaking labels Sep 7, 2020
@bkamins bkamins added this to the 1.0 milestone Sep 7, 2020
@bkamins
Copy link
Member Author

bkamins commented Sep 7, 2020

@nalimilan - I have felt that there is some corner case lurking, and it is :, which is a Function 😄. Fixed now.

@nalimilan
Copy link
Member

One thing I was considering if the predicate should not take String => Column pair to allow for more flexibility, but decided that passing String is enough as I think the more complex cases can be relatively easily handled by pairs(eachcol(df)) if neeeded and they are probably much rarer than the case of eg. startswith.

Yes, that occurred to me too. On the one hand, since this is the names function, it makes sense that it would operate on names and not columns in general. On the other hand, if we allow passing a type we already go beyond pure names.

Something that isn't ideal in dplyr is that you can do across(starts_with("sepal"), mean) but you have to write across(where(is.numeric), mean). The where doesn't sound like a good term to mean "refer to the contents of columns and not just their names". So I hope we can find something better. We already do since we allow passing types, which is probably the most common case (apart from conditions on names).

It's great that you can write names(df, startswith("a")), which is very concise and easy to understand. If we passed (name, column) as an argument; having to write names(df, (n, col) -> startswith(col, "a")) wouldn't be as nice. But then what can we recommend to select e.g. all columns without missing values? names(df)[.!any.(ismissing, eachcol(df))] is relatively short but quite tricky. Though even if we passed (name, column) as an argument, names(df, (n, col) -> !any(ismissing, col)) wouldn't be much better. So maybe the PR is OK as it is...

src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/dataframerow/dataframerow.jl Outdated Show resolved Hide resolved
src/dataframerow/dataframerow.jl Outdated Show resolved Hide resolved
@nalimilan nalimilan mentioned this pull request Sep 7, 2020
test/dataframe.jl Show resolved Hide resolved
test/dataframe.jl Outdated Show resolved Hide resolved
test/dataframe.jl Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Sep 8, 2020

names(df)[.!any.(ismissing, eachcol(df))]

names(df, Not(findall(any.(ismissing, eachcol(df)))) is longer but I would say natural and relatively easy to understand.

@bkamins
Copy link
Member Author

bkamins commented Sep 8, 2020

How about allowing regexes while you're at it? :-)

What do you mean by this? We already allow regexes now (this is on 0.21):

julia> using DataFrames

julia> df = DataFrame(x=1,y=2)
names(df, 1×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 2     │

julia> names(df, r"x")
1-element Array{String,1}:
 "x"

@bkamins
Copy link
Member Author

bkamins commented Sep 8, 2020

CI passes - it just a time out of the worker.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I totally forgot this.

src/other/utils.jl Outdated Show resolved Hide resolved
src/other/utils.jl Outdated Show resolved Hide resolved
return ct === Missing
else
if unionmissing
ct === Missing && return t === Union{} || Missing <: t
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Member Author

@bkamins bkamins Nov 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is good we have left some time to sleep over this PR.

The core question is (do not look at the implementation as it is just a consequence). If eltype of the column is Missing and user passes unionmissing equal to true what should happen then? The implementation was doing special casing of this case. Now I think if CT === Missing and unionmissing===true then we should always return true. Do you agree? (if yes - the above implementation will change)

EDIT: Now that I have read my comment above I am again not so sure what is best ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simplest and most obvious approach would be to check ct <: Union{t, Missing} when unionmissing=true. Otherwise things get too complex IMHO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only problem is that if someone writes names(df, Float64) and has a column with eltype equal to Missing then one would get that column although it does not allow Float64 values at all.

@bkamins
Copy link
Member Author

bkamins commented Nov 1, 2020

Sorry I totally forgot this.

There is no rush with this PR. It is not scheduled for 0.22 as I am still not sure what is the best design here.

bkamins and others added 3 commits November 1, 2020 12:46
@bkamins
Copy link
Member Author

bkamins commented Nov 3, 2020

@nalimilan - I was thinking about this PR and unionmissing option. Maybe we should not introduce it and ask what is required now - to give an explicit signature. This is not that much more writing and in the future it is possible it will be only adding ?. On the other hand adding unionmissing might be too much magic - this is not the way we try to do things (e.g. the discussion on where shows that in an even more obvious case we have doubts about being flexible).

@nalimilan
Copy link
Member

Yes I think we should decide on a general policy which would also cover where (and maybe other cases).

@bkamins bkamins mentioned this pull request Nov 4, 2020
@bkamins
Copy link
Member Author

bkamins commented Nov 4, 2020

I have thought of it and commented in #2496. I think here we do not need this kwarg and in where dropmissing=false should be the default.

I think it is better to keep DataFrames.jl close to Julia Base behavior, as otherwise it will lead to the confusion.

@bkamins
Copy link
Member Author

bkamins commented Dec 8, 2020

Given we have released 0.22 and now will be non-breaking unionmissing proposal is out of discussion as it would be breaking (the decision slipped through - my bad).
So in this PR what remains is the "predicate" part and "cleanup" part.

Cleanup of the functionality should for sure be done. Adding predicate also makes sense. Then the only question is to finally settle does the predicate take only column (values) or a pair name => column. I think passing column only is enough.

@nalimilan
Copy link
Member

Given we have released 0.22 and now will be non-breaking unionmissing proposal is out of discussion as it would be breaking (the decision slipped through - my bad).

If we really think the behavior should be changed I think it would be OK to have some limited breakage in 1.0, especially since it's a very recent feature. But it's hard to decide what's best.

So in this PR what remains is the "predicate" part and "cleanup" part.

Cleanup of the functionality should for sure be done. Adding predicate also makes sense. Then the only question is to finally settle does the predicate take only column (values) or a pair name => column. I think passing column only is enough.

Do you mean passing column or passing name? :-D

@bkamins
Copy link
Member Author

bkamins commented Dec 10, 2020

But it's hard to decide what's best.

Exactly - that is why I would leave it out. I think what we have now is also OK, and it will be possibly even more convenient in the future if Union{Missing, T} syntax would be made easier to write.

Do you mean passing column or passing name? :-D

OK to be clear we potentially could have three behaviors:

  1. passing name
  2. passing column (vector)
  3. passing "name => column" pair

I propose to use the option #2 (passing column). I think that this is most often what users would want.

In particular option #1 is easy enough to achieve using filter!(predictate, names(df)). Option #3 which can be done made with eachcol and pairs.

@nalimilan
Copy link
Member

OK, so you changed your mind since you filed this PR. :-)

I'm really torn as both options are useful, and both have less user-friendly alternatives. But I tend to think conditions on names are more common than conditions on columns. For example, in dplyr's help, where is documented only for where(is.numeric) and where(is.factor), and we cover both using the method taking a type.

Also, a criterion to help making our decision is that it could make sense to have names(df, cols) be consistent with Cols(cols) (and possibly even with df[!, cols] if we want to make it more flexible later). It would be nice to be able to do e.g. select(df, Cols(startswith("x")) => +). Of course this is already possible with r"^x" but startswith is probably nicer for many users -- in particular those who would appreciate the most to have a simple syntax.

@bkamins
Copy link
Member Author

bkamins commented Dec 10, 2020

😄 - good to have you here. Indeed I have changed my mind. Which means - as you say that probably both options would be nice to have and both have "less convenient" alternatives.

I understand you propose the predicate to work on column names? Maybe the solution would be the following:

  • names(df, predicate) and in the future other options would take column name
  • names(df, Cols(predicate)) would pass columns to a predicate (I propose Cols as it avoids introducing a new name and it seems to have a natural meaning)

What do you think?

Ah: the problem is that then Cols(predicate) => fun would be ambiguous in transform so it is not good. We need a new name.

@bkamins
Copy link
Member Author

bkamins commented Jan 26, 2021

@nalimilan - I propose to make a decision on this PR. My current thinking, to keep the API simple, that we just add:

names(df::AbstractDataFrame, predicate::Function) = filter(predicate, _names(df))

to keep names be column name focused and this is probably a more common case.

This would finalize the API of names and I would avoid adding more complexity to it (it is already quite large).

@nalimilan
Copy link
Member

Sounds good. That would indeed keep things relatively simple.

@bkamins bkamins changed the base branch from master to main January 26, 2021 13:02
@bkamins
Copy link
Member Author

bkamins commented Jan 26, 2021

@nalimilan - this should be good to have a look at

src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/dataframerow/dataframerow.jl Outdated Show resolved Hide resolved
src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins bkamins merged commit cad735d into JuliaData:main Jan 31, 2021
@bkamins bkamins deleted the names_predicate branch January 31, 2021 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature non-breaking The proposed change is not breaking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants