-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add predicate support for names and more tests #2417
Conversation
@nalimilan - I have felt that there is some corner case lurking, and it is |
Yes, that occurred to me too. On the one hand, since this is the Something that isn't ideal in dplyr is that you can do It's great that you can write |
|
What do you mean by this? We already allow regexes now (this is on 0.21):
|
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
CI passes - it just a time out of the worker. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I totally forgot this.
src/other/utils.jl
Outdated
return ct === Missing | ||
else | ||
if unionmissing | ||
ct === Missing && return t === Union{} || Missing <: t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is good we have left some time to sleep over this PR.
The core question is (do not look at the implementation as it is just a consequence). If eltype
of the column is Missing
and user passes unionmissing
equal to true
what should happen then? The implementation was doing special casing of this case. Now I think if CT === Missing
and unionmissing===true
then we should always return true
. Do you agree? (if yes - the above implementation will change)
EDIT: Now that I have read my comment above I am again not so sure what is best ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The simplest and most obvious approach would be to check ct <: Union{t, Missing}
when unionmissing=true
. Otherwise things get too complex IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only problem is that if someone writes names(df, Float64)
and has a column with eltype
equal to Missing
then one would get that column although it does not allow Float64
values at all.
There is no rush with this PR. It is not scheduled for 0.22 as I am still not sure what is the best design here. |
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@nalimilan - I was thinking about this PR and |
Yes I think we should decide on a general policy which would also cover |
I have thought of it and commented in #2496. I think here we do not need this kwarg and in I think it is better to keep DataFrames.jl close to Julia Base behavior, as otherwise it will lead to the confusion. |
Given we have released 0.22 and now will be non-breaking Cleanup of the functionality should for sure be done. Adding predicate also makes sense. Then the only question is to finally settle does the predicate take only column (values) or a pair |
If we really think the behavior should be changed I think it would be OK to have some limited breakage in 1.0, especially since it's a very recent feature. But it's hard to decide what's best.
Do you mean passing column or passing name? :-D |
Exactly - that is why I would leave it out. I think what we have now is also OK, and it will be possibly even more convenient in the future if
OK to be clear we potentially could have three behaviors:
I propose to use the option #2 (passing column). I think that this is most often what users would want. In particular option #1 is easy enough to achieve using |
OK, so you changed your mind since you filed this PR. :-) I'm really torn as both options are useful, and both have less user-friendly alternatives. But I tend to think conditions on names are more common than conditions on columns. For example, in dplyr's help, Also, a criterion to help making our decision is that it could make sense to have |
😄 - good to have you here. Indeed I have changed my mind. Which means - as you say that probably both options would be nice to have and both have "less convenient" alternatives. I understand you propose the predicate to work on column names? Maybe the solution would be the following:
What do you think? Ah: the problem is that then |
@nalimilan - I propose to make a decision on this PR. My current thinking, to keep the API simple, that we just add:
to keep This would finalize the API of |
Sounds good. That would indeed keep things relatively simple. |
@nalimilan - this should be good to have a look at |
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@nalimilan I added a predicate for
names
+ added more tests (and keep them in one place so that it easier to make sure all is tested). In particular methods forDataFrameRow
were missing.One thing I was considering if the predicate should not take
String => Column
pair to allow for more flexibility, but decided that passingString
is enough as I think the more complex cases can be relatively easily handled bypairs(eachcol(df))
if neeeded and they are probably much rarer than the case of eg.startswith
.