Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop distinction between index and levels #253

Merged
merged 4 commits into from
Apr 8, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 4 additions & 19 deletions docs/src/implementation.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,25 +2,12 @@

`CategoricalArray` is made of the two fields:

- `refs`: an integer array that stores the position of the category level in the `index` field of `CategoricalPool` for each `CategoricalArray` element; `0` denotes a missing value (for `CategoricalArray{Union{T, Missing}}` only).
- `refs`: an integer array that stores the position of the category level in the `levels` field of `CategoricalPool` for each `CategoricalArray` element; `0` denotes a missing value (for `CategoricalArray{Union{T, Missing}}` only).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be good to clarify if 0 can denote missing only for Union{T, Missing} CategoricalArrays; like, if it's just a regular CategoricalVector{String}, would 0 point to the first level? Or is 0 exclusively for missing always.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other arrays it corresponds to an undefined entry. But can we leave unrelated improvements to another PR? This one is already messy enough.

- `pool`: the `CategoricalPool` object that maintains the levels of the array.

!!! warning
The `CategoricalPool{V,R,C}` type keeps track of the levels of type `V` and associates them with an integer reference code of type `R` (for internal use). It offers methods to add new levels, and efficiently get the integer index corresponding to a level and vice-versa. Whether the values of `CategoricalArray` are ordered or not is defined by an `ordered` field of the pool. Finally, `CategoricalPool{V,R,C}` keeps a `valindex` vector of value objects of type `C == CategoricalValue{V, R}`, so that `getindex` can return the existing object instead of allocating a new one.
nalimilan marked this conversation as resolved.
Show resolved Hide resolved

Integer codes in the `x.refs` field *cannot* be used to index into the vector returned
by `levels(x)`. These codes refer to the position in the *index*, which can be accessed
using `CategoricalArrays.index(x.pool)`. That is,
`CategoricalArrays.index(x.pool)[x.refs] == x` always holds, but
`levels(x.pool)[x.refs] == x` is *not* correct in general. To obtain the position in
`levels(x)` of entries in `x`, use `CategoricalArrays.order(x.pool)[x.refs]`.

The reason for this subtlety is that it allows changing the order of levels without
having to reset all the underlying integer codes. This is especially useful for the
`CategoricalArray(::AbstractArray)` constructor, which needs to assign new codes as
new levels are encountered, potentially conflicting with the default ordering of
levels (based on `sort`).

The `CategoricalPool{V,R,C}` type keeps track of the levels of type `V` and associates them with an integer reference code of type `R` (for internal use). It offers methods to set the levels, change their order while preserving the references, and efficiently get the integer index corresponding to a level and vice-versa. Whether the values of `CategoricalArray` are ordered or not is defined by an `ordered` field of the pool. Finally, `CategoricalPool{V,R,C}` keeps a `valindex` vector of value objects of type `C == CategoricalValue{V, R}`, so that `getindex` can return the existing object instead of allocating a new one.
Do note that `CategoricalPool` levels are semi-mutable: it is only allowed to add new levels, but never to remove or reorder existing ones. This ensures existing `CategoricalValue` objects remain valid and always point to the same level as when they were created. Therefore, `CategoricalArray`s create a new pool each time some of their levels are removed or reordered. This happens when calling `levels!`, but also when assigning a `CategoricalValue` via `setindex!`, `push!`, `append!`, `copy!` or `copyto!` (as new levels may be added to the front to preserve relative order of both source and destination levels). Doing so requires updating all reference codes to point to the new pool, and makes it impossible to compare existing ordered `CategoricalValue` objects with values from the array using `<` and `>`.

The type parameters of `CategoricalArray{T, N, R <: Integer, V, C, U}` are a bit complex:
- `T` is the type of array elements without `CategoricalValue` wrappers; if `T >: Missing`, then the array supports missing values.
Expand All @@ -32,6 +19,4 @@ The type parameters of `CategoricalArray{T, N, R <: Integer, V, C, U}` are a bit

Only `T`, `N` and `R` could be specified upon construction. The last three parameters are chosen automatically, but are needed for the definition of the type. In particular, `U` allows expressing that `CategoricalArray{T, N}` inherits from `AbstractArray{Union{C, U}, N}` (which is equivalent to `AbstractArray{C, N}` for arrays which do not support missing values, and to `AbstractArray{Union{C, Missing}, N}` for those which support them).

The `CategoricalPool` type is designed to limit the need to go over all elements of the vector, either for reading or for writing. This is why unused levels are not dropped automatically (this would force checking all elements on every modification or keeping a counts table), but only when `droplevels!` is called. `levels` is a (very fast) O(1) operation since it merely returns the (ordered) vector of levels without accessing the data at all.

Another useful feature is that integer indices referring to levels are preserved when adding or reordering levels: the order of levels exposed to the user by the `levels` function does not necessarily match these internal indices, which are stored in the `index` field of the pool. This means a reordering of the levels is also an O(1) operation. On the other hand, deleting levels may change the indices and therefore requires iterating over all elements in the array to update the references.
The `CategoricalPool` type is designed to limit the need to go over all elements of the vector, either for reading or for writing. This is why unused levels are not dropped automatically (this would force checking all elements on every modification or keeping a counts table), but only when `droplevels!` is called. `levels` is a (very fast) O(1) operation since it merely returns the (ordered) vector of levels without accessing the data at all.
111 changes: 110 additions & 1 deletion docs/src/using.md
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,116 @@ julia> levels!(y, ["Young", "Middle"]; allow_missing=true)

```

## Working with categorical arrays
## Combining levels

Some operations imply combining levels of two categorical arrays: this is the case when concatenating arrays (`vcat`, `hcat` and `cat`) and when assigning a `CategoricalValue` from another categorical array.

For example, imagine we have two sets of observations, one with only the younger part of the population and one with the older part:
```jldoctest using
julia> x = categorical(["Middle", "Old", "Middle"], ordered=true);

julia> y = categorical(["Young", "Middle", "Middle"], ordered=true);

julia> levels!(y, ["Young", "Middle"]);
```

If we concatenate the two sets, the levels of the resulting categorical vector are chosen so that the relative orders of levels in `x` and `y` are preserved, if possible. In that case, comparisons with `<` and `>` are still valid, and resulting vector is marked as ordered:
```jldoctest
julia> xy = vcat(x, y)
6-element CategoricalArray{String,1,UInt32}:
"Middle"
"Old"
"Middle"
"Young"
"Middle"
"Middle"

julia> levels(xy)
3-element Array{String,1}:
"Young"
"Middle"
"Old"

julia> isordered(xy)
true
```

Likewise, assigning a `CategoricalValue` from `y` to an entry in `x` expands the levels of `x`, *adding a new level to the front to respect the ordering of levels in both vectors*. The new level is added even if the assigned value belongs to another level which is already present in `x`. Note that adding new levels requires marking `x` as unordered:
```jldoctest
julia> x[1] = y[1]
ERROR: cannot add new level Young since ordered pools cannot be extended implicitly. Use the levels! function to set new levels, or the ordered! function to mark the pool as unordered.
Stacktrace:
[...]

julia> ordered!(x, false);

julia> levels(x)
2-element Array{String,1}:
"Middle"
"Old"

julia> x[1] = y[1]
CategoricalValue{String,UInt32} "Old" (3/3)

julia> levels(x)
3-element Array{String,1}:
"Young"
"Middle"
"Old"
```

In cases where levels with incompatible orderings are combined, the ordering of the first array wins and the resulting array is marked as unordered:
```jldoctest using
julia> a = categorical(["a", "b", "c"], ordered=true);

julia> b = categorical(["a", "b", "c"], ordered=true);

julia> ab = vcat(a, b)
6-element CategoricalArray{String,1,UInt32}:
"a"
"b"
"c"
"a"
"b"
"c"

julia> levels(ab)
3-element Array{String,1}:
"a"
"b"
"c"

julia> isordered(ab)
true

julia> levels!(b, ["c", "b", "a"])
3-element CategoricalArray{String,1,UInt32}:
"a"
"b"
"c"

julia> ab2 = vcat(a, b)
6-element CategoricalArray{String,1,UInt32}:
"a"
"b"
"c"
"a"
"b"
"c"

julia> levels(ab2)
3-element Array{String,1}:
"a"
"b"
"c"

julia> isordered(ab2)
false
```

Do note that in some cases the two sets of levels may have compatible orderings, but it is not possible to determine in what order should levels appear in the merged set. This is the case for example with `["a, "b", "d"]` and `["c", "d", "e"]`: there is no way to detect that `"c"` should be inserted exactly after `"b"` (lexicographic ordering is not relevant here). In such cases, the resulting array is marked as unordered. This situation can only happen when working with data subsets selected based on non-contiguous subsets of levels.

## Exported functions

`categorical(A)` - Construct a categorical array with values from `A`

Expand Down
2 changes: 0 additions & 2 deletions src/CategoricalArrays.jl
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,6 @@ module CategoricalArrays

include("typedefs.jl")

include("buildfields.jl")

include("pool.jl")
include("value.jl")

Expand Down
Loading