Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Levels, setindex!, copyto! and vcat #258

Open
nalimilan opened this issue Apr 8, 2020 · 7 comments
Open

Levels, setindex!, copyto! and vcat #258

nalimilan opened this issue Apr 8, 2020 · 7 comments

Comments

@nalimilan
Copy link
Member

nalimilan commented Apr 8, 2020

The behavior of setindex!, copyto! and vcat regarding levels is tricky to get right as we have conflicting goals (below "level" also implies "orderedness"):

  1. copyto!(similar(x), x) should have the same levels as x (including unused levels)
  2. x2 = similar(x); foreach(i -> x[i] = x2[i], eachindex(x, x2)) should be equivalent to copyto!
  3. vcat(x, y) should be equivalent to z = similar(x, length(x)+length(y)); copyto!(z, x); copyto!(z, length(x)+1, y, 1, length(y))
  4. setindex!(x::CategoricalArray v::CategoricalValue, ...) should only affect the assigned value and not add other levels

Currently we ensure 1, 2* and 3, but not 4: setindex merges source CategoricalValue levels with destination levels. This may be surprising or inconvenient, for example if you merge a variable with current occupation with another variable containing last occupation for the unemployed which also has extra levels like "Never had a job". Another behavior which may be weird is that setindex! may insert new levels in the middle and not only at the end, which forces recomputing all reference codes and invalidates existing CategoricalValue objects.

If we wanted to ensure property 4 (setindex!), we would have to either:

  • Drop 2: calling setindex! on each entry would not copy unused levels, but copyto! would. This shouldn't be too problematic in practice, though it introduces an inconsistency. Note that for performance, we would probably also have to add levels at the end rather than trying to merge orders, to avoid recoding all references each time a new value is encountered. This would have the major drawback that levels would be in their order of appearance in the data.
  • Drop 3: copyto! would copy levels when the destination array has no levels, but otherwise only add levels for values that appear in the data. This would have the advantage that setindex! could do the same, so that we ensure property 2.
  • Drop 1: copyto! would drop unused levels. In that case, vcat should do the same to ensure 3, which sounds even more problematic.

(* Since #253 copyto! also set levels when copying zero elements, which differs from calling setindex! zero times. So there's already an inconsistency.)

@bkamins
Copy link
Member

bkamins commented Apr 8, 2020

I had mixed opinions (and rewritten this comments several times, in particular I considered the rule setindex! adds only its level if x already has levels, but adds all levels if x has no levels).

But in the end I concluded Drop 2 should be recommended. The reasoning would be:

  1. if we perform "whole array" operation we treat that "levels" is a part of array specification
  2. if we perform "elementwise" operation we treat the element just as a value (forgetting that it is categorical) - so in a sense - CategoricalValue property does not carry over when working with it.

But frankly - I am not sure what is the best default.

@nalimilan
Copy link
Member Author

Yeah, that's really a difficult choice. The main problem with Drop 2 is what I have put in italics:

This would have the major drawback that levels would be in their order of appearance in the data.

I hadn't realized this at first, but this means that even things like [x for x in X] or [x == "a" ? y : x for (x, y) in zip(X, Y)] will have to lose the ordering, even if X and Y categorical arrays have the same levels. This is because since we dropped the index vs. levels distinction (#253), inserting levels (other than at the end) requires recoding all reference codes. Doing that everytime a new value is encountered would be really bad, so the only choice is to add levels at the end. That's really not great.

@bkamins
Copy link
Member

bkamins commented Apr 8, 2020

I know. But the other option is to keep 4. Can you remind me why it is bad?

Maybe we should go back and and accept 4. Then probably we should deprecate get and replace it by getindex so you can simply write v[] to unwrap the stored value. And user could choose one or the other. It is also more natural - i.e. to think of CategoricalValue as a 0-dimensional container.

The only drawback of get (currently) or getindex (in target semantics) is that get(missing) and getindex(missing) are both an error. What would you think of defining getindex(missing) to return missing? This would have to be in Base though.

@nalimilan
Copy link
Member Author

What do you mean by "keep 4"?

@bkamins
Copy link
Member

bkamins commented Apr 8, 2020

Ah - sorry. "keep 4" means - do not change how we handle property 4 (i.e. to keep "setindex! merges source CategoricalValue levels with destination levels").

What I mean is to make it more convenient to "unwrap" CategoricalValue so that the user can easily choose:

  1. merge levels: do not unwrap, eg. x[i] = v
  2. do not merge levels (add only one): unwrap, eg. x[i] = v[] (if we defined getindex for it, and ideally defined it also for missing, probably in Missings.jl for now)

@nalimilan
Copy link
Member Author

OK. Makes sense, but getindex(missing) will probably not be accepted in Base, and in Missings it would be type piracy. If we wanted a function which propagates missing, we should probably use our own, like unwrap. That doesn't prevent us from also supporting getindex though.

@bkamins
Copy link
Member

bkamins commented Apr 9, 2020

This was my fear. Let us then stick to unwrap not to duplicate functionality. And unwrap will work on missing also (and get will be deprecated).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants