-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make CategoricalValue <: AbstractString #77
Conversation
090974c
to
da6563e
Compare
Also implement all methods of the AbstractString interface. Unfortunately, CategoricalValue{T} <: AbstractString even when !(T <: AbstractString), but in practice it should not be a big problem.
da6563e
to
c650265
Compare
I can see why this would be practically useful but IMO conceptually it's pretty weird. |
I'm not sure what "conceptually" exactly means, can you elaborate? :-) In my experience, string operations can be useful even for clearly categorical data. For example, European regional codes are formed of the country code plus a digit (e.g. FR1), and it's often useful to extract the two first letters to get the country. If somebody reads a CSV file with regional codes and wants to create a country variable, it's nice that |
It's not a problem per se; by "conceptually weird" I mean that the concept of every categorical value being an |
OK. In theory, yes, we could have integer categories, and indeed we still support them. Even though In practice, integer categorical variables seem to be very rare, since using a custom ordering of levels for integer codes would be really weird, and there are generally no storage gains to be had by using a |
Yep. This is good as far as I'm concerned. |
Turns out there's a practical issue, not just a conceptual one, at JuliaData/DataFrames.jl#1237. Since The best solution would of course to be having the ability to declare that only |
In my opinion |
Yeah, the answer is: you can't. Though in practice it works fine for many functions, that's why I'm not sure we should forbid non-string types. |
So I have the following question:
I would say that this is a rare case. |
Honestly I don't see a serious use case except strings, but others might disagree (@araslan?). |
The only serious non-string use case I can think of is using CategoricalArrays to sort and order values in DataFrames e.g. unstack. As long as this doesn't break existing uses like that, I think the practical benefits described so far of |
Perhaps declaring a variable to be categorical would be more efficient for group-by operations on Perhaps @dmbates could provide some sagely wisdom here? |
Also related to an earlier comment by @nalimilan:
For modeling purposes probably it would be nice if you could have After thinking about it given the current possibilities of subtyping I would not make The other option would be (I do not know what would be the performance but I understand that small unions are now fast?), names are examples:
with constructor for |
As @bkamins indicates, it is not uncommon for the levels of ordered categorical data to be numbers. The levels for unordered categorical data are just labels and could be restricted to |
I guess we could offer another mechanism to mark that an integer variable should be considered as categorical rather than continuous in models. If you have a
@bkamins That's the approach I had taken, but it actually doesn't work because packages are free to create new methods. So even if I add The solution to use Finally, the case of |
@nalimilan Would the following be an acceptable solution?
Then a documentation should be clear about that difference and when what should be used, because the corner case is that |
That hybrid approach is interesting, but it sounds too complex to me. To make it reasonably efficient, we would have to convert each level to a string and keep a copy to avoid allocating a new one on each access. The lesson we have learned from the experience with |
Good approach :).
All should be relatively simple to handle (in particular they all have a property that there is a unique mapping from their value to their string representation (edit here about the mapping to be more precise) so there will be no problem with uniqueness of values) and we do not have to cover every possibility from the start (e.g. initially the user would be instructed to convert to Additionally (this is covered in the above, but current |
So what is exactly an alternative to e.g. We cannot do |
Would a I agree it's annoying that the string interface isn't based on traits, but I don't think it's realistic to expect it to be changed by 1.0. There are no operations that require |
I'm using CategoricalArray{Float64} to define a timepoint of an experiment.
IIUC, by default Another aspect of "transparent" string operations is nullable data.
This is easy to fix, but the last example (
This is a little bit OT (is there a ticket # for CSV.jl? I couldn't find it myself), but actually Hadley's readr made a change in an opposite direction (from base R's |
💯 Let's avoid assuming things about users' data as much as possible |
@alyst if you put Switching to a general view: in my opinion I think that In the above by clearly unique I mean that |
I'm not sure why you find it natural. You won't save a lot of space by using a Also, I'm not suggesting converting your data to
My point is that for reals the possibility to use a custom ordering for levels makes no sense, yet it's the main reason to use a julia> x = CategoricalVector{Int}(2);
julia> x[1] = 100;
julia> x[2] = 0;
julia> levels(x)
2-element Array{Int64,1}:
100
0
By default That said, I wouldn't be opposed to be PR introducing The choice of creating CategoricalArrays by default is a relatively independent issue, which has basically be done for performance. Discussion happened at JuliaData/DataFrames.jl#895 (and linked issues), please add your arguments there. That's a difficult decision, but performance gains are so large for most cases that it's hard to resist. I precisely decided to make The question of the treatment of missing values is quite separate too. The point is precisely that you don't need to call |
there's also NaN that is not equal to itself. So right now
I would say,
+1 for explicit conversion, -1 to any alternative categorical data types.
That's what the founding fathers of Julia would probably call "passive agressive" ;) "We don't let you do X because you can misuse it". I think the general principle should be
In addition to |
But the same applies to any other type as long as implicit level addition is possible.
That's the point. I don't see how
I'm not arguing for that, because I don't want |
Actually I think we could make things safer by calling As I see it, How about my suggestion to experiment with |
That's a nice way to force the user to re-sort the levels.
So your idea is to:
|
That's more or less what I have in mind, yes. But as you spotted the lack of multiple inheritance makes it less clean in terms of type hierarchy than it could be. Though we don't really need These changes should actually be quite easy to do since |
I like |
Since we now have #198 deprecates all |
Also implement all methods of the
AbstractString
interface.Unfortunately,
CategoricalValue{T} <: AbstractString
even when!(T <: AbstractString)
, but in practice it should not be a big problem.