Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CategoricalArray creation #333

Open
bkamins opened this issue Mar 23, 2021 · 8 comments
Open

CategoricalArray creation #333

bkamins opened this issue Mar 23, 2021 · 8 comments

Comments

@bkamins
Copy link
Member

bkamins commented Mar 23, 2021

This is problematic:

julia> x = [categorical(["a"], ordered=true), categorical(["b"], ordered=true)]
2-element Vector{CategoricalVector{String, UInt32, String, CategoricalValue{String, UInt32}, Union{}}}:
 ["a"]
 ["b"]

julia> [first(v) for v in x]
ERROR: cannot add new level b since ordered pools cannot be extended implicitly. Use the levels! function to set new levels, or the ordered! function to mark the pool as unordered.

julia> first.(x)
ERROR: cannot add new level b since ordered pools cannot be extended implicitly. Use the levels! function to set new levels, or the ordered! function to mark the pool as unordered.

julia> map(first, x)
ERROR: cannot add new level b since ordered pools cannot be extended implicitly. Use the levels! function to set new levels, or the ordered! function to mark the pool as unordered.
@nalimilan
Copy link
Member

Damn. Issues will never stop popping up...

There are only two solutions to this:

  1. Stop overloading similar(::Array, ::Type{<:CategoricalValue}). Incidentally that should help with convert method for Any is not ideal #177 (comment).
  2. Allow adding new levels to ordered pools.

The second one would require fewer changes, but in the case of describe in DataFrames, it doesn't sound appropriate to merge levels between columns. What's annoying with the first one is that we can't know when a comprehension is used whether a single "variable" is being created or not. For example, [ismissing(x[i]) ? y[i] : x[i] for i in eachindex(x, y)] should most likely create a single CategoricalVector merging x and y levels (which are probably equal or compatible). Returning a Vector{<:CategoricalValue} would be a complete failure in that case. Yet I don't know how to distinguish this case from what describe does. And ideally the appropriate type would be used even when the code hasn't been written with CategoricalArray in mind, as it's not always possible to know in advance the kind of array that will be used (as describe shows).

@bkamins
Copy link
Member Author

bkamins commented Mar 23, 2021

Note that currently [ismissing(x[i]) ? y[i] : x[i] for i in eachindex(x, y)]

is super upredictable:

julia> x1 = categorical([missing, 1], ordered=true)
2-element CategoricalArray{Union{Missing, Int64},1,UInt32}:
 missing
 1

julia> x2 = categorical([1, missing], ordered=true)
2-element CategoricalArray{Union{Missing, Int64},1,UInt32}:
 1
 missing

julia> y = categorical([1, 2], ordered=true)
2-element CategoricalArray{Int64,1,UInt32}:
 1
 2

julia> [ismissing(x1[i]) ? y[i] : x1[i] for i in eachindex(x1, y)]
2-element CategoricalArray{Int64,1,UInt32}:
 1
 1

julia> [ismissing(x2[i]) ? y[i] : x2[i] for i in eachindex(x2, y)]
ERROR: cannot add new level 2 since ordered pools cannot be extended implicitly. Use the levels! function to set new levels, or the ordered! function to mark the pool as unordered.

I think we have to go for option 2., as I assume you do not like option 1. very much (and require users to explicitly opt-in for a vector to be categorical after transformation).

@nalimilan
Copy link
Member

That failure is indeed not super justified, as the ordering is perfectly well defined. I think we already have all the logic in place to check whether the result can be ordered or not.

IIRC when we added this error, we considered an alternative which was to add a separate property indicating that levels should be "locked", which could be applied both to ordered and unordered arrays. That would still throw an error in the original example... but at least you'd have opted in for trouble. :-)

I think we have to go for option 2., as I assume you do not like option 1. very much (and require users to explicitly opt-in for a vector to be categorical after transformation).

Well opt-in wouldn't be too bad if at least it could be efficient. But if things like [ismissing(x[i]) ? y[i] : x[i] for i in eachindex(x, y)] return a Vector{<:CategoricalValue}, there's nothing users can do to opt-in for performance: they can only fix the type after the fact, and that may not even be possible if the comprehension is e.g. inside a package function.

@bkamins
Copy link
Member Author

bkamins commented Mar 24, 2021

I think we should ensure that generic code like [ismissing(x[i]) ? y[i] : x[i] for i in eachindex(x, y)] always works. This means that I would opt for allowing extending of ordered categorical values as the user might not have control what kind of input is passed to a generic function.

This in particular means that I think we should allow merging even two ordered CategoricalArrays that have conflicting levels. I think what should be done is that if two CategoricalArrays that have non-subset levels are merged then the resulting CategoricalArray should be unordered.

@nalimilan
Copy link
Member

@alyst IIRC you advocated in favor of throwing an error when trying to implicitly add levels to an ordered CategoricalArray. Do you actually use this in practice? Do you think adding a separate property that you would set to "lock" levels instead would be OK?

@bkamins
Copy link
Member Author

bkamins commented Mar 25, 2021

I can say that even I advocated it in the past, but we need to make CategoricalArray work in generic code always. Regarding lock - I would be OK to add it, but by default this lock should be turned off I think.

@alyst
Copy link
Contributor

alyst commented Mar 25, 2021

My issue (#84, wow, quite some time has passed) was about being able to declare the levels at the categorical array creation time.
Actually, I was not aware that it was implemented -- that's great! I will use it.
I think locked flag is a very nice idea: ctor/categorical(..., locked=levels !== nothing) looks quite natural and convenient.

@nalimilan
Copy link
Member

OK, thanks. So I guess we can stop throwing an error, and just mark the pool as unordered if levels don't have compatible orders (the code to check this is already in place). For reference, the error was added at #125 (Cc: @gustafsson but I'm not sure he's still working on Julia). Adding a lock or protect property can be done later if somebody cares enough to implement it. :-p

See also #141 and #199.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants