add predicate support for names and more tests #2417

bkamins · 2020-09-06T20:48:40Z

@nalimilan I added a predicate for names + added more tests (and keep them in one place so that it easier to make sure all is tested). In particular methods for DataFrameRow were missing.

One thing I was considering if the predicate should not take String => Column pair to allow for more flexibility, but decided that passing String is enough as I think the more complex cases can be relatively easily handled by pairs(eachcol(df)) if neeeded and they are probably much rarer than the case of eg. startswith.

bkamins · 2020-09-07T07:53:46Z

@nalimilan - I have felt that there is some corner case lurking, and it is :, which is a Function 😄. Fixed now.

nalimilan · 2020-09-07T10:19:37Z

One thing I was considering if the predicate should not take String => Column pair to allow for more flexibility, but decided that passing String is enough as I think the more complex cases can be relatively easily handled by pairs(eachcol(df)) if neeeded and they are probably much rarer than the case of eg. startswith.

Yes, that occurred to me too. On the one hand, since this is the names function, it makes sense that it would operate on names and not columns in general. On the other hand, if we allow passing a type we already go beyond pure names.

Something that isn't ideal in dplyr is that you can do across(starts_with("sepal"), mean) but you have to write across(where(is.numeric), mean). The where doesn't sound like a good term to mean "refer to the contents of columns and not just their names". So I hope we can find something better. We already do since we allow passing types, which is probably the most common case (apart from conditions on names).

It's great that you can write names(df, startswith("a")), which is very concise and easy to understand. If we passed (name, column) as an argument; having to write names(df, (n, col) -> startswith(col, "a")) wouldn't be as nice. But then what can we recommend to select e.g. all columns without missing values? names(df)[.!any.(ismissing, eachcol(df))] is relatively short but quite tricky. Though even if we passed (name, column) as an argument, names(df, (n, col) -> !any(ismissing, col)) wouldn't be much better. So maybe the PR is OK as it is...

src/abstractdataframe/abstractdataframe.jl

src/dataframerow/dataframerow.jl

test/dataframe.jl

bkamins · 2020-09-08T14:59:51Z

names(df)[.!any.(ismissing, eachcol(df))]

names(df, Not(findall(any.(ismissing, eachcol(df)))) is longer but I would say natural and relatively easy to understand.

bkamins · 2020-09-08T15:12:40Z

How about allowing regexes while you're at it? :-)

What do you mean by this? We already allow regexes now (this is on 0.21):

julia> using DataFrames

julia> df = DataFrame(x=1,y=2)
names(df, 1×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 2     │

julia> names(df, r"x")
1-element Array{String,1}:
 "x"

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

…icate

bkamins · 2020-09-08T19:52:07Z

CI passes - it just a time out of the worker.

nalimilan

Sorry I totally forgot this.

src/other/utils.jl

nalimilan · 2020-09-08T20:43:28Z

src/other/utils.jl

+        return ct === Missing
+    else
+        if unionmissing
+            ct === Missing && return t === Union{} || Missing <: t


Why is this needed?

It is good we have left some time to sleep over this PR.

The core question is (do not look at the implementation as it is just a consequence). If eltype of the column is Missing and user passes unionmissing equal to true what should happen then? The implementation was doing special casing of this case. Now I think if CT === Missing and unionmissing===true then we should always return true. Do you agree? (if yes - the above implementation will change)

EDIT: Now that I have read my comment above I am again not so sure what is best ...

The simplest and most obvious approach would be to check ct <: Union{t, Missing} when unionmissing=true. Otherwise things get too complex IMHO.

The only problem is that if someone writes names(df, Float64) and has a column with eltype equal to Missing then one would get that column although it does not allow Float64 values at all.

bkamins · 2020-11-01T11:45:11Z

Sorry I totally forgot this.

There is no rush with this PR. It is not scheduled for 0.22 as I am still not sure what is the best design here.

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2020-11-03T22:23:35Z

@nalimilan - I was thinking about this PR and unionmissing option. Maybe we should not introduce it and ask what is required now - to give an explicit signature. This is not that much more writing and in the future it is possible it will be only adding ?. On the other hand adding unionmissing might be too much magic - this is not the way we try to do things (e.g. the discussion on where shows that in an even more obvious case we have doubts about being flexible).

nalimilan · 2020-11-04T09:07:02Z

Yes I think we should decide on a general policy which would also cover where (and maybe other cases).

bkamins · 2020-11-04T10:35:16Z

I have thought of it and commented in #2496. I think here we do not need this kwarg and in where dropmissing=false should be the default.

I think it is better to keep DataFrames.jl close to Julia Base behavior, as otherwise it will lead to the confusion.

bkamins · 2020-12-08T17:17:31Z

Given we have released 0.22 and now will be non-breaking unionmissing proposal is out of discussion as it would be breaking (the decision slipped through - my bad).
So in this PR what remains is the "predicate" part and "cleanup" part.

Cleanup of the functionality should for sure be done. Adding predicate also makes sense. Then the only question is to finally settle does the predicate take only column (values) or a pair name => column. I think passing column only is enough.

nalimilan · 2020-12-10T11:24:10Z

Given we have released 0.22 and now will be non-breaking unionmissing proposal is out of discussion as it would be breaking (the decision slipped through - my bad).

If we really think the behavior should be changed I think it would be OK to have some limited breakage in 1.0, especially since it's a very recent feature. But it's hard to decide what's best.

So in this PR what remains is the "predicate" part and "cleanup" part.

Cleanup of the functionality should for sure be done. Adding predicate also makes sense. Then the only question is to finally settle does the predicate take only column (values) or a pair name => column. I think passing column only is enough.

Do you mean passing column or passing name? :-D

bkamins · 2020-12-10T12:19:33Z

But it's hard to decide what's best.

Exactly - that is why I would leave it out. I think what we have now is also OK, and it will be possibly even more convenient in the future if Union{Missing, T} syntax would be made easier to write.

Do you mean passing column or passing name? :-D

OK to be clear we potentially could have three behaviors:

passing name
passing column (vector)
passing "name => column" pair

I propose to use the option #2 (passing column). I think that this is most often what users would want.

In particular option #1 is easy enough to achieve using filter!(predictate, names(df)). Option #3 which can be done made with eachcol and pairs.

nalimilan · 2020-12-10T12:55:11Z

OK, so you changed your mind since you filed this PR. :-)

I'm really torn as both options are useful, and both have less user-friendly alternatives. But I tend to think conditions on names are more common than conditions on columns. For example, in dplyr's help, where is documented only for where(is.numeric) and where(is.factor), and we cover both using the method taking a type.

Also, a criterion to help making our decision is that it could make sense to have names(df, cols) be consistent with Cols(cols) (and possibly even with df[!, cols] if we want to make it more flexible later). It would be nice to be able to do e.g. select(df, Cols(startswith("x")) => +). Of course this is already possible with r"^x" but startswith is probably nicer for many users -- in particular those who would appreciate the most to have a simple syntax.

bkamins · 2020-12-10T19:46:15Z

😄 - good to have you here. Indeed I have changed my mind. Which means - as you say that probably both options would be nice to have and both have "less convenient" alternatives.

I understand you propose the predicate to work on column names? Maybe the solution would be the following:

names(df, predicate) and in the future other options would take column name
names(df, Cols(predicate)) would pass columns to a predicate (I propose Cols as it avoids introducing a new name and it seems to have a natural meaning)

What do you think?

Ah: the problem is that then Cols(predicate) => fun would be ambiguous in transform so it is not good. We need a new name.

bkamins · 2021-01-26T08:42:18Z

@nalimilan - I propose to make a decision on this PR. My current thinking, to keep the API simple, that we just add:

names(df::AbstractDataFrame, predicate::Function) = filter(predicate, _names(df))

to keep names be column name focused and this is probably a more common case.

This would finalize the API of names and I would avoid adding more complexity to it (it is already quite large).

nalimilan · 2021-01-26T09:20:05Z

Sounds good. That would indeed keep things relatively simple.

bkamins · 2021-01-26T15:26:44Z

@nalimilan - this should be good to have a look at

src/abstractdataframe/abstractdataframe.jl

src/dataframerow/dataframerow.jl

src/abstractdataframe/abstractdataframe.jl

src/dataframerow/dataframerow.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

add predicate support and more tests

b487431

bkamins added feature non-breaking The proposed change is not breaking labels Sep 7, 2020

bkamins added this to the 1.0 milestone Sep 7, 2020

fix Colon

218ee21

nalimilan reviewed Sep 7, 2020

View reviewed changes

nalimilan mentioned this pull request Sep 7, 2020

Add pandas comparison #2378

Merged

tk3369 reviewed Sep 7, 2020

View reviewed changes

test/dataframe.jl Show resolved Hide resolved

test/dataframe.jl Outdated Show resolved Hide resolved

test/dataframe.jl Show resolved Hide resolved

bkamins and others added 3 commits September 8, 2020 17:22

Apply suggestions from code review

5a02d3e

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

add unionmissing kwarg with default to true

0c2d610

Merge remote-tracking branch 'origin/names_predicate' into names_pred…

09464a3

…icate

nalimilan reviewed Nov 1, 2020

View reviewed changes

bkamins and others added 3 commits November 1, 2020 12:46

Apply suggestions from code review

fc66601

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

Merge branch 'master' into names_predicate

5a15791

fix case

ae7ea96

bkamins mentioned this pull request Nov 1, 2020

general principles of data manipulation for dicussion #2509

Closed

bkamins mentioned this pull request Nov 4, 2020

Add subset #2496

Merged

bkamins mentioned this pull request Nov 9, 2020

add filtercols and filtercols! #2508

Open

bkamins added 2 commits January 26, 2021 13:02

simplify names design

9a10dab

Merge branch 'main' into names_predicate

8c36ec0

bkamins changed the base branch from master to main January 26, 2021 13:02

add NEWS.md entry

7064c8b

nalimilan reviewed Jan 31, 2021

View reviewed changes

src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved

src/dataframerow/dataframerow.jl Outdated Show resolved Hide resolved

src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved

bkamins commented Jan 31, 2021

View reviewed changes

src/dataframerow/dataframerow.jl Outdated Show resolved Hide resolved

Apply suggestions from code review

c78eb3d

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

nalimilan approved these changes Jan 31, 2021

View reviewed changes

bkamins merged commit cad735d into JuliaData:main Jan 31, 2021

bkamins deleted the names_predicate branch January 31, 2021 20:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add predicate support for names and more tests #2417

add predicate support for names and more tests #2417

bkamins commented Sep 6, 2020

bkamins commented Sep 7, 2020

nalimilan commented Sep 7, 2020

bkamins commented Sep 8, 2020

bkamins commented Sep 8, 2020

bkamins commented Sep 8, 2020

nalimilan left a comment

nalimilan Sep 8, 2020

bkamins Nov 1, 2020 •

edited

Loading

nalimilan Nov 4, 2020

bkamins Nov 4, 2020

bkamins commented Nov 1, 2020

bkamins commented Nov 3, 2020

nalimilan commented Nov 4, 2020

bkamins commented Nov 4, 2020

bkamins commented Dec 8, 2020

nalimilan commented Dec 10, 2020

bkamins commented Dec 10, 2020

nalimilan commented Dec 10, 2020

bkamins commented Dec 10, 2020 •

edited

Loading

bkamins commented Jan 26, 2021

nalimilan commented Jan 26, 2021

bkamins commented Jan 26, 2021

add predicate support for names and more tests #2417

add predicate support for names and more tests #2417

Conversation

bkamins commented Sep 6, 2020

bkamins commented Sep 7, 2020

nalimilan commented Sep 7, 2020

bkamins commented Sep 8, 2020

bkamins commented Sep 8, 2020

bkamins commented Sep 8, 2020

nalimilan left a comment

Choose a reason for hiding this comment

nalimilan Sep 8, 2020

Choose a reason for hiding this comment

bkamins Nov 1, 2020 • edited Loading

Choose a reason for hiding this comment

nalimilan Nov 4, 2020

Choose a reason for hiding this comment

bkamins Nov 4, 2020

Choose a reason for hiding this comment

bkamins commented Nov 1, 2020

bkamins commented Nov 3, 2020

nalimilan commented Nov 4, 2020

bkamins commented Nov 4, 2020

bkamins commented Dec 8, 2020

nalimilan commented Dec 10, 2020

bkamins commented Dec 10, 2020

nalimilan commented Dec 10, 2020

bkamins commented Dec 10, 2020 • edited Loading

bkamins commented Jan 26, 2021

nalimilan commented Jan 26, 2021

bkamins commented Jan 26, 2021

bkamins Nov 1, 2020 •

edited

Loading

bkamins commented Dec 10, 2020 •

edited

Loading