Drop distinction between index and levels (#253)

References now always point to the vector returned by `levels`. This simplifies a lot of code, especially for packages that use CategoricalArrays. The downside is that all references need to be recoded when existing levels are removed or reordered. To ensure `CategoricalValue` objects always remain valid, `CategoricalPool` is now semi-mutable: only adding new levels is possible. `CategoricalArray`s are now mutable and replace their pool with a new one when levels are removed or reordered, e.g. in `levels!` or `setindex!(A::CategoricalArray, v::CategoricalValue, ...)`. This should not be a problem for performance as changing levels should not be frequent. On the other hand, adding levels keeps the same pool, which makes creating a `CategoricalArray` from another array type relatively fast, though references have to be recoded at the end when sorting levels. And the (very frequent) operations which use levels in their order should be faster than before as they can use refs directly. Note that replacing the pool makes it impossible to compare new `CategoricalValue` objects with old ones with `<` and `>`. This should not be too problematic in practice. Finally, replace deprecation message with an error when assignment would add new levels to ordered array, and make `copy` and `copyto!` merge levels even when copying zero elements (this differs from what the `AbstractArray` fallback would do but makes more sense).
JuliaData · Apr 8, 2020 · dcc24cf · dcc24cf
1 parent 81867f6
commit dcc24cf
Show file tree

Hide file tree

Showing 27 changed files with 987 additions and 1,134 deletions.
diff --git a/docs/src/implementation.md b/docs/src/implementation.md
@@ -2,25 +2,12 @@
 
 `CategoricalArray` is made of the two fields:
 
-- `refs`: an integer array that stores the position of the category level in the `index` field of `CategoricalPool` for each `CategoricalArray` element; `0` denotes a missing value (for `CategoricalArray{Union{T, Missing}}` only).
+- `refs`: an integer array that stores the position of the category level in the `levels` field of `CategoricalPool` for each `CategoricalArray` element; `0` denotes a missing value (for `CategoricalArray{Union{T, Missing}}` only).
 - `pool`: the `CategoricalPool` object that maintains the levels of the array.
 
-!!! warning
+The `CategoricalPool{V,R,C}` type keeps track of the levels of type `V` and associates them with an integer reference code of type `R` (for internal use). It offers methods to add new levels, and efficiently get the integer index corresponding to a level and vice-versa. Whether the values of `CategoricalArray` are ordered or not is defined by an `ordered` field of the pool. Finally, `CategoricalPool{V,R,C}` keeps a `valindex` vector of value objects of type `C == CategoricalValue{V, R}`, so that `getindex` can return the existing object instead of allocating a new one.
 
-    Integer codes in the `x.refs` field *cannot* be used to index into the vector returned
-    by `levels(x)`. These codes refer to the position in the *index*, which can be accessed
-    using `CategoricalArrays.index(x.pool)`. That is,
-    `CategoricalArrays.index(x.pool)[x.refs] == x` always holds, but
-    `levels(x.pool)[x.refs] == x` is *not* correct in general. To obtain the position in
-    `levels(x)` of entries in `x`, use `CategoricalArrays.order(x.pool)[x.refs]`.
-
-    The reason for this subtlety is that it allows changing the order of levels without
-    having to reset all the underlying integer codes. This is especially useful for the
-    `CategoricalArray(::AbstractArray)` constructor, which needs to assign new codes as
-    new levels are encountered, potentially conflicting with the default ordering of
-    levels (based on `sort`).
-
-The `CategoricalPool{V,R,C}` type keeps track of the levels of type `V` and associates them with an integer reference code of type `R` (for internal use). It offers methods to set the levels, change their order while preserving the references, and efficiently get the integer index corresponding to a level and vice-versa. Whether the values of `CategoricalArray` are ordered or not is defined by an `ordered` field of the pool. Finally, `CategoricalPool{V,R,C}` keeps a `valindex` vector of value objects of type `C == CategoricalValue{V, R}`, so that `getindex` can return the existing object instead of allocating a new one.
+Do note that `CategoricalPool` levels are semi-mutable: it is only allowed to add new levels, but never to remove or reorder existing ones. This ensures existing `CategoricalValue` objects remain valid and always point to the same level as when they were created. Therefore, `CategoricalArray`s create a new pool each time some of their levels are removed or reordered. This happens when calling `levels!`, but also when assigning a `CategoricalValue` via `setindex!`, `push!`, `append!`, `copy!` or `copyto!` (as new levels may be added to the front to preserve relative order of both source and destination levels). Doing so requires updating all reference codes to point to the new pool, and makes it impossible to compare existing ordered `CategoricalValue` objects with values from the array using `<` and `>`.
 
 The type parameters of `CategoricalArray{T, N, R <: Integer, V, C, U}` are a bit complex:
  - `T` is the type of array elements without `CategoricalValue` wrappers; if `T >: Missing`, then the array supports missing values.
@@ -32,6 +19,4 @@ The type parameters of `CategoricalArray{T, N, R <: Integer, V, C, U}` are a bit
 
 Only `T`, `N` and `R` could be specified upon construction. The last three parameters are chosen automatically, but are needed for the definition of the type. In particular, `U` allows expressing that `CategoricalArray{T, N}` inherits from `AbstractArray{Union{C, U}, N}` (which is equivalent to `AbstractArray{C, N}` for arrays which do not support missing values, and to `AbstractArray{Union{C, Missing}, N}` for those which support them).
 
-The `CategoricalPool` type is designed to limit the need to go over all elements of the vector, either for reading or for writing. This is why unused levels are not dropped automatically (this would force checking all elements on every modification or keeping a counts table), but only when `droplevels!` is called. `levels` is a (very fast) O(1) operation since it merely returns the (ordered) vector of levels without accessing the data at all.
-
-Another useful feature is that integer indices referring to levels are preserved when adding or reordering levels: the order of levels exposed to the user by the `levels` function does not necessarily match these internal indices, which are stored in the `index` field of the pool. This means a reordering of the levels is also an O(1) operation. On the other hand, deleting levels may change the indices and therefore requires iterating over all elements in the array to update the references.
+The `CategoricalPool` type is designed to limit the need to go over all elements of the vector, either for reading or for writing. This is why unused levels are not dropped automatically (this would force checking all elements on every modification or keeping a counts table), but only when `droplevels!` is called. `levels` is a (very fast) O(1) operation since it merely returns the (ordered) vector of levels without accessing the data at all.
diff --git a/docs/src/using.md b/docs/src/using.md
@@ -193,7 +193,116 @@ julia> levels!(y, ["Young", "Middle"]; allow_missing=true)
 
 ```
 
-## Working with categorical arrays
+## Combining levels
+
+Some operations imply combining levels of two categorical arrays: this is the case when concatenating arrays (`vcat`, `hcat` and `cat`) and when assigning a `CategoricalValue` from another categorical array.
+
+For example, imagine we have two sets of observations, one with only the younger part of the population and one with the older part:
+```jldoctest using
+julia> x = categorical(["Middle", "Old", "Middle"], ordered=true);
+
+julia> y = categorical(["Young", "Middle", "Middle"], ordered=true);
+
+julia> levels!(y, ["Young", "Middle"]);
+```
+
+If we concatenate the two sets, the levels of the resulting categorical vector are chosen so that the relative orders of levels in `x` and `y` are preserved, if possible. In that case, comparisons with `<` and `>` are still valid, and resulting vector is marked as ordered:
+```jldoctest
+julia> xy = vcat(x, y)
+6-element CategoricalArray{String,1,UInt32}:
+ "Middle"
+ "Old"   
+ "Middle"
+ "Young" 
+ "Middle"
+ "Middle"
+
+julia> levels(xy)
+3-element Array{String,1}:
+ "Young" 
+ "Middle"
+ "Old"   
+
+julia> isordered(xy)
+true
+```
+
+Likewise, assigning a `CategoricalValue` from `y` to an entry in `x` expands the levels of `x`, *adding a new level to the front to respect the ordering of levels in both vectors*. The new level is added even if the assigned value belongs to another level which is already present in `x`. Note that adding new levels requires marking `x` as unordered:
+```jldoctest
+julia> x[1] = y[1]
+ERROR: cannot add new level Young since ordered pools cannot be extended implicitly. Use the levels! function to set new levels, or the ordered! function to mark the pool as unordered.
+Stacktrace:
+[...]
+
+julia> ordered!(x, false);
+
+julia> levels(x)
+2-element Array{String,1}:
+ "Middle"
+ "Old"   
+
+julia> x[1] = y[1]
+CategoricalValue{String,UInt32} "Old" (3/3)
+
+julia> levels(x)
+3-element Array{String,1}:
+ "Young" 
+ "Middle"
+ "Old"   
+```
+
+In cases where levels with incompatible orderings are combined, the ordering of the first array wins and the resulting array is marked as unordered:
+```jldoctest using
+julia> a = categorical(["a", "b", "c"], ordered=true);
+
+julia> b = categorical(["a", "b", "c"], ordered=true);
+
+julia> ab = vcat(a, b)
+6-element CategoricalArray{String,1,UInt32}:
+ "a"
+ "b"
+ "c"
+ "a"
+ "b"
+ "c"
+
+julia> levels(ab)
+3-element Array{String,1}:
+ "a"
+ "b"
+ "c"
+
+julia> isordered(ab)
+true
+
+julia> levels!(b, ["c", "b", "a"])
+3-element CategoricalArray{String,1,UInt32}:
+ "a"
+ "b"
+ "c"
+
+julia> ab2 = vcat(a, b)
+6-element CategoricalArray{String,1,UInt32}:
+ "a"
+ "b"
+ "c"
+ "a"
+ "b"
+ "c"
+
+julia> levels(ab2)
+3-element Array{String,1}:
+ "a"
+ "b"
+ "c"
+
+julia> isordered(ab2)
+false
+```
+
+Do note that in some cases the two sets of levels may have compatible orderings, but it is not possible to determine in what order should levels appear in the merged set. This is the case for example with `["a, "b", "d"]` and `["c", "d", "e"]`: there is no way to detect that `"c"` should be inserted exactly after `"b"` (lexicographic ordering is not relevant here). In such cases, the resulting array is marked as unordered. This situation can only happen when working with data subsets selected based on non-contiguous subsets of levels.
+
+## Exported functions
 
 `categorical(A)` - Construct a categorical array with values from `A`
 

diff --git a/src/CategoricalArrays.jl b/src/CategoricalArrays.jl
@@ -15,8 +15,6 @@ module CategoricalArrays
 
     include("typedefs.jl")
 
-    include("buildfields.jl")
-
     include("pool.jl")
     include("value.jl")