-
Notifications
You must be signed in to change notification settings - Fork 34
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
a4b9929
commit d634683
Showing
10 changed files
with
299 additions
and
277 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,3 @@ | ||
.travis.yml | ||
docs/build/ | ||
docs/site/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
using Documenter, CategoricalArrays | ||
|
||
makedocs( | ||
modules = [CategoricalArrays], | ||
format = :html, | ||
sitename = "CategoricalArrays", | ||
pages = Any[ | ||
"Overview" => "overview.md", | ||
"Using CategoricalArrays" => "using.md", | ||
"Implementation details" => "implementation.md", | ||
"Index" => "functionindex.md" | ||
] | ||
) | ||
|
||
deploydocs( | ||
repo = "github.com/JuliaData/CategoricalArrays.jl.git", | ||
target = "build", | ||
julia = "0.5", | ||
osname = "linux", | ||
deps = nothing, | ||
make = nothing | ||
) |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Index | ||
|
||
```@index | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
# Implementation details | ||
|
||
`CategoricalArray` and `NullableCategoricalArray` share a common implementation for the most part, with the main differences being their element types. They are based on the `CategoricalPool` type, which keeps track of the levels and associates them with an integer reference (for internal use). They offer methods to set levels, change their order while preserving the references, and efficiently get the integer index corresponding to a level and vice-versa. They are also parameterized on the type used to store the references, so that small pools can use as little memory as possible. Finally, they keep a vector of value objects (`CategoricalValue`), so that `getindex` can return the existing object instead of allocating a new one. | ||
|
||
Array types are made of two fields: | ||
|
||
- `refs`: an integer vector giving the index of the level in the pool for each element. For `NullableCategoricalArray`, `0` indicates a missing value. | ||
- `pool`: the `CategoricalPool` object keeping the levels of the array. | ||
|
||
Whether an array (and its values) are ordered or not is stored as a property of the pool. | ||
|
||
`CategoricalPool` is designed to limit the need to go over all elements of the vector, either for reading or for writing. This is why unused levels are not dropped automatically (this would force checking all elements on every modification or keeping a counts table), but only when `droplevels!` is called. `levels` is a (very fast) O(1) operation since it merely returns the (ordered) vector of levels without accessing the data at all. | ||
|
||
Another useful property is that integer indices referring to levels are preserved when adding or reordering levels: the order of levels exposed to the user by the `levels` function does not necessarily match these internal indices, which are stored in the `index` field of the pool. This means a reordering of the levels is also an O(1) operation. On the other hand, deleting levels may change the indices and therefore requires iterating over all elements in the array to update the references. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
# Overview | ||
|
||
This package provides a replacement for [DataArrays.jl](https://github.com/JuliaStats/DataArrays.jl)'s `PooledDataArray` type. | ||
|
||
It offers better performance by getting rid of type instability thanks to the `Nullable` type, which is used to represent missing data. It is also based on a simpler design by only supporting categorical data, which allows offering more specialized features (like ordering of categories). See the [IndirectArrays.jl](https://github.com/JuliaArrays/IndirectArrays.jl) package for a simpler array type storing data with a small number of values. | ||
|
||
The package provides two array types designed to hold categorical data efficiently and conveniently: | ||
|
||
- `CategoricalArray` can hold both unordered and ordered categorical data | ||
|
||
- `NullableCategoricalArray` supports the same features as the first type, also accepts missing data | ||
|
||
These arrays behave just like standard Julia `Array`s, but they return special types when indexed: | ||
|
||
- `CategoricalArray` returns a `CategoricalValue` object | ||
|
||
- `NullableCategoricalArray` returns a `Nullable{CategoricalValue}` object | ||
|
||
`CategoricalValue` objects are simple wrappers around the actual categorical levels which allow for very efficient extraction and equality tests. Indeed, the main feature of categorical arrays types is that they store a pool of the levels which can appear in the variable. These levels are stored in a specific order: for unordered arrays, this order is only used for pretty printing (e.g. in cross tables or plots); for ordered arrays, it also allows comparing values using the `<` and `>` operators: the comparison is then based on the ordering of levels stored in the array. Whether an array is ordered can be defined either on construction via the `ordered` argument, or at any time via the `ordered!` function. | ||
|
||
Use the `levels` function to access the levels of a categorical array, and the `levels!` function to set and order them. Levels are automatically created when setting an element to a previously unused level. On the other hand, they are never removed without manual intervention: use the `droplevels!` function for this. |
Oops, something went wrong.