Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling CategoricalValue types? #117

Closed
tlnagy opened this issue Jan 11, 2018 · 11 comments
Closed

Handling CategoricalValue types? #117

tlnagy opened this issue Jan 11, 2018 · 11 comments

Comments

@tlnagy
Copy link

tlnagy commented Jan 11, 2018

I'm in the process of updating Gadfly's code to use the new DataFrames v0.11+ infrastructure and I'm getting hung up on the eltype of CategoricalArrays. It looks accessing an element is always wrapped? I believe this is different from PooledDataArrays?

CategoricalArrays

julia> a = CategoricalArray([RGBA{Float32}(1.0, 1.0, 1.0, 1.0)])
1-element CategoricalArrays.CategoricalArray{ColorTypes.RGBA{Float32},1,UInt32}:
 RGBA{Float32}(1.0f0,1.0f0,1.0f0,1.0f0)

julia> eltype(a)
CategoricalArrays.CategoricalValue{ColorTypes.RGBA{Float32},UInt32}

DataArrays v0.6.2

julia> a = PooledDataArray([RGBA{Float32}(1.0, 1.0, 1.0, 1.0)])
1-element DataArrays.PooledDataArray{ColorTypes.RGBA{Float32},UInt32,1}:
 RGBA{Float32}(1.0,1.0,1.0,1.0)

julia> eltype(a)
ColorTypes.RGBA{Float32}

Unfortunately this means that general functions like something(color::Color) won't work with CategoricalArrays any more.

tlnagy added a commit to tlnagy/Gadfly.jl that referenced this issue Jan 11, 2018
i haven't been able to fix everything due to slight inconsistencies
between PooledDataArrays and CategoricalArrays behavior. See
JuliaData/CategoricalArrays.jl#117 for
details. Working towards a fix for GiovineItalia#1065
@alyst
Copy link
Contributor

alyst commented Jan 11, 2018

It looks accessing an element is always wrapped?

Yes, indexing of arr:CategoricalArray always returns an element val of categorical value type (or missing).
You can use leveltype(arr) or leveltype(val) to get the type of values it wraps.
get(val) returns the unwrapped value.

Unfortunately this means that general functions like something(color::Color) won't work with CategoricalArrays any more.

Do you have some specific examples that are broken when updating to CategoricalArrays?

@tlnagy
Copy link
Author

tlnagy commented Jan 11, 2018

Yes, initializing using a CategoricalArray here leads to breakage on this line:

https://github.com/tlnagy/Gadfly.jl/blob/75edec7b1d9be861b07aec7a7aa2575239c228f2/src/geom/point.jl#L44-L46

when attempting to call the following functions:

https://github.com/GiovineItalia/Gadfly.jl/blob/daf9032a047b2f8514fb8fd5c79258e0457f2240/src/theme.jl#L6-L10

This worked fine when we used PooledDataArrays and broke when I replaced the PooledDataArray constructor with the CategoricalArray constructor. This is pretty common throughout Gadfly's code, because before everything was always unwrapped.

@nalimilan
Copy link
Member

Indeed that's expected. Why does Gadfly use CategoricalArray in these places? Is is to save space or to use a custom ordering of levels?

@alyst
Copy link
Contributor

alyst commented Jan 12, 2018

@tlnagy I suppose the minimal change would be to use something like convert(AbstractColor, color) in such cases. If color is a categorical value, convert() would automatically unwrap it, if it's already of color type, convert() will be just the identity operator.

@tlnagy
Copy link
Author

tlnagy commented Jan 12, 2018

Indeed that's expected. Why does Gadfly use CategoricalArray in these places? Is is to save space or to use a custom ordering of levels?

TBH, I'm not sure. It seems like PooledDataArray has been used for this since at least 2013 (GiovineItalia/Gadfly.jl@ed51ef6). Not sure if @dcjones remembers why.

@nalimilan
Copy link
Member

What would happen if you stopped using it? How many bytes uses a single entry in these arrays?

@dcjones
Copy link

dcjones commented Jan 12, 2018

I think the use of PooledDataArray in Gadfly was less about saving space and more about ordering levels. At some point it may have also been used to distinguish categorical from continuous data, but I'm not sure if that's still the case.

@nalimilan
Copy link
Member

If that's for ordering, then indeed categorical arrays are still needed. Then the code should just be adapted to accept CategoricalValue objects, possibly unwrapping them if needed.

@matthieugomez
Copy link

matthieugomez commented Jan 30, 2018

Why do we need this wrapping? It makes is so much harder to user for end users. An operation such as v[1] ≈ 0.1 does not work anymore

@nalimilan
Copy link
Member

It's needed so that < and similar operations follow the custom ordering of the levels. But really you shouldn't have to use with categorical values, the package is intended mainly at categorical data, i.e. not numeric data.

@matthieugomez
Copy link

Ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants