-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop distinction between index and levels #253
Conversation
References now always point to the vector returned by `levels`. This simplifies a lot of code, especially for packages that use CategoricalArrays. The downside is that all references need to be recoded when existing levels are removed or reordered. To ensure `CategoricalValue` objects always remain valid, `CategoricalPool` is now semi-mutable: only adding new levels is possible. `CategoricalArray`s are now mutable and replace their pool with a new one when levels are removed or reordered, e.g. in `levels!` or `setindex!(A::CategoricalArray, v::CategoricalValue, ...)`. This should not be a problem for performance as changing levels should not be frequent. On the other hand, adding levels keeps the same pool, which makes creating a `CategoricalArray` from another array type relatively fast, though references have to be recoded at the end when sorting levels. And the (very frequent) operations which use levels in their order should be faster than before as they can use refs directly. Note that replacing the pool makes it impossible to compare new `CategoricalValue` objects with old ones with `<` and `>`. This should not be too problematic in practice. Finally, replace deprecation message with an error when assignment would add new levels to ordered array, and make `copy` and `copyto!` merge levels even when copying zero elements (this differs from what the `AbstractArray` fallback would do but makes more sense).
Thank you for working on this.
This is indeed important IMO, it should make e.g. split-apploy-combine faster (have you tried this)?
Actually I think it is OK, as it is safer (it is better to error than to produce some unexpected wrong result).
So how now one could add new levels to an ordered array? Change it first to unordered, add levels, and make ordered or there is some simpler way? Also - a related question (I have not dug through the code to check) - what is the cost of switching between ordered and unordered? |
I haven't tried yet. It should make the computation of group indices faster (basically a no-op). I had noted in the code that the performance gain was negligible (we could already have applied that optimization when levels and index are equal), but that was at a time when we always computed row indices, which is quite slow. Now that we can avoid that step, computing group indices more efficiently could make a difference. Though note that if there are empty levels we still need to recompute indices.
Well you can still call Switching between ordered and unordered just sets a |
@inline function setindex!(A::CategoricalArray, v::Any, I::Real...) | ||
@boundscheck checkbounds(A, I...) | ||
# TODO: use a global table to cache subset relations for all pairs of pools |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have also a setindex!
for the case of multiple selections like dst[idx_set] = src
? As it seems currently it would be expensive if src
is a categorical array (we would have to check the pool at every iteration).
This can be left for later as it is performance related only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. We could only cover simple cases (with a single set of indices), and then we could probably just call copyto!
. I wonder whether that optimization should live in Base, as it already has a special method for Array
that calls unsafe_copyto!
. If unsafe_copyto!
fell back to copyto!
, all arrays would automatically benefit from any optimized implementation of copyto!
.
Anyway that's for future work I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite follow these two comments, but I'd be slightly wary of using a global table as mentioned in the comment due to multi-threaded scenarios. Not sure if that's the kind of global table you have in mind, but just a thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well this idea/need isn't new, it just appears in more places. I'm not sure what would be the best solution for multithreading, but if possible I guess a lock wouldn't be worse than the current code which has to check all levels of both pools on each assignment. Mutating pools during a multithreaded operation should be quite rare in practice, so it would be too bad if it affected performance for common cases. Any ideas? Maybe worth discussing this in a separate issue.
invindex::Dict{T, R} # map from category levels to their reference codes | ||
order::Vector{R} # 1-to-1 map from `index` to `level` (position of i-th category in `levels`) | ||
levels::Vector{T} # category levels ordered by externally specified order | ||
valindex::Vector{V} # "category value" objects 1-to-1 matching `index` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of valindex
? Is the benefit just that it makes the CategoricalValue objects once instead of repeatedly on-demand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I wanted to try removing it and see whether it makes a difference for performance. It might actually be faster in some cases as the compiler may be able to detect that it only needs to use the reference code and not load the pointer to the pool at all. But that's for another PR.
@@ -2,25 +2,12 @@ | |||
|
|||
`CategoricalArray` is made of the two fields: | |||
|
|||
- `refs`: an integer array that stores the position of the category level in the `index` field of `CategoricalPool` for each `CategoricalArray` element; `0` denotes a missing value (for `CategoricalArray{Union{T, Missing}}` only). | |||
- `refs`: an integer array that stores the position of the category level in the `levels` field of `CategoricalPool` for each `CategoricalArray` element; `0` denotes a missing value (for `CategoricalArray{Union{T, Missing}}` only). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be good to clarify if 0 can denote missing
only for Union{T, Missing}
CategoricalArrays; like, if it's just a regular CategoricalVector{String}
, would 0 point to the first level? Or is 0 exclusively for missing
always.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For other arrays it corresponds to an undefined entry. But can we leave unrelated improvements to another PR? This one is already messy enough.
docs/src/implementation.md
Outdated
levels (based on `sort`). | ||
|
||
The `CategoricalPool{V,R,C}` type keeps track of the levels of type `V` and associates them with an integer reference code of type `R` (for internal use). It offers methods to set the levels, change their order while preserving the references, and efficiently get the integer index corresponding to a level and vice-versa. Whether the values of `CategoricalArray` are ordered or not is defined by an `ordered` field of the pool. Finally, `CategoricalPool{V,R,C}` keeps a `valindex` vector of value objects of type `C == CategoricalValue{V, R}`, so that `getindex` can return the existing object instead of allocating a new one. | ||
Do note that `CategoricalPool` levels are semi-mutable: it is only allowed to add new levels, but never to remove or reorder existing ones. This ensures existing `CategoricalValue` objects remain valid and always point to the same level as when they were created. Therefore, `CategoricalArray`s create a new pool each time some of their levels are removed or reordered. This happens when calling `levels!`, but also when assigning a `CategoricalValue` via `setindex!`, `push!`, `append!`, `copy!` or `copyto!`. This requires updating all reference codes to point to the new pool, and makes it impossible to compare existing ordered `CategoricalValue` objects with values from the array using `<` and `>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This happens when calling
levels!
, but also when assigning aCategoricalValue
viasetindex!
,push!
,append!
,copy!
orcopyto!
.
So this isn't entirely clear to me: you say that a new pool is created if some of the levels are removed or reordered, but it doesn't seem obvious to me why doing setindex!
or push!
would factor in; they're certainly not removing levels, and it doesn't seem like they'd reorder anything either. If I assign a new CategoricalValue we haven't seen before, I would assume we'd tack on new levels to the existing pool, so no re-ordering is necessary.
I imagine I'm just missing something obvious here though, so it'd be good to maybe clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well it's a bit tricky because we merge levels in a way that preserves the relative orders of levels in both pools if possible at all. For example, if you have ["b", "c", "d"]
in one pool and ["a", "b", "c"]in the other, the result will be
["a", "b", "c", "d"]whatever which array is assigned to. That allows keeping the destination array as ordered if it was before, without making
<` change its behavior for existing values.
That's probably a corner case, but but I figured it's strictly better to do something meaningful in this case rather than just concatenating levels. This also allows vcat
not to change levels depending on the order in which arrays are passed (when possible).
I'll try improving docs. We don't really reorder levels, they refs are just shifted due to adding new levels before them.
Yeah, we'll have to review what this means for CSV.jl. In the latest big redesign of CSV.jl, we actually don't have a formal "detection" phase anymore; if types aren't specified by the user, then we just look at the first row, set the type to whatever we find there, then continue parsing the rest of the file. But we also are managing "pooled" columns in a package-agnostic way; i.e. we just track a Is the plan to make these changes a major release in CategoricalArrays? 1.0? |
That would work too (just need to add a constructor, which would be useful anyway). Do you have a rough timeline for this? I planned to release 0.8 (I'd prefer testing the API for a while before tagging 1.0 once we know there aren't major issues), and have the next DataFrames release use that. Ideally other packages would follow relatively quickly, otherwise the new DataFrames version won't be usable with CSV. |
My current thinking on timeline is:
In my mind, that provides a good scenario where most rigid users could just update the patch release and get as much of the new functionality as possible; update to 0.6 to see deprecations and what they should change, and 1.0 once they're ready to change over. |
OK, but do you have an idea about how long it will take? :-) |
I expect to start working on new CSV features in about 2 weeks, then do a week per bullet point above? So like, 5-6 weeks to complete all three? Of course, I'm happy to coordinate with other releases, CategoricalArrays/DataFrames. I'm not in any particular rush; that timeline is more if it solely depended on my availability. What are the timelines here/DataFrames? And what would make the most sense coordinating? |
DataFrames.jl is not likely to have a release in that timeline so CategoricalArrays.jl will be more relevant to coordinate with. |
if pool(B) !== pool(A) && pool(B) ⊈ pool(A) | ||
merge_pools!(A, B) | ||
end | ||
# TODO: optimize recoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is is safe to use use @inbounds
in line 713 (and worth it? - if not then better leave as is).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Safe yes, but as the TODO indicates if we want something fast we should do as in other places and compute a mapping between source and destination refs. Adding @inbounds
would probably not help as we do a dict lookup for each entry (!).
Maybe calling copyto!
would be enough here.
It doesn't make a lot of sense to make a breaking release of CategoricalArrays if neither DataFrames nor CSV support it. So I guess we can just wait until releases are ready for those, there's no rush anyway. I've pushed commits addressing review comments and adding some docs. |
Thank you. So maybe also remove |
It hasn't been deprecated in a release yet. :-D |
So I understand it should be deprecated now - right? I thought it was, because I get deprecation warnings to wrap categorical values containing in |
It's been deprecated on master for a while, but not in a release. |
I've added a few more tests to fix coverage regressions. That made me spot a place where the fast path should have been used and wasn't. |
newlevs = copy(levels(a)) | ||
ordered = isordered(a) | ||
else | ||
nl, ordered = mergelevels(isordered(a), a.levels, b.levels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether this would better be
nl, ordered = mergelevels(isordered(a), a.levels, b.levels) | |
nl, ordered = mergelevels(isordered(a) | isordered(b), a.levels, b.levels) |
It could be a bit surprising that an array becomes ordered after setting a value. OTOH it's a bit weird too that we set the levels, but not orderedness, which introduces a small inconsistency with vcat
. Opinions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my thinking is the following. We have two use-cases for merge_pools
(maybe we need an argument to differentiate between them):
- asymmetric - you merge
b
intoa
creating a new pool that is to replacea
(setting a value in an array) - symmetric - you merge
a
andb
to create something new (like invcat
)
In these two scenarios I think we can have a different rules.
Now as for the rules - I am not sure what is best.
For symmetric (vcat
) case the rule I think should be "treat unordered categorical vectors as if they were not categorical (but just plain vectors)", what I mean is that it should matter (i.e. the result should be the same) if I merge ["a", "b", "c"]
and categorical(["a", "b", "c"])
with some other categorical array that is ordered.
For asymmetric case I think the right rule is to keep the status of the destination array (i.e. the one we write to).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually merge_pools
is only used for the asymmetric case. The symmetric case only happens with vcat
, which uses its own logic.
For symmetric (
vcat
) case the rule I think should be "treat unordered categorical vectors as if they were not categorical (but just plain vectors)", what I mean is that it should matter (i.e. the result should be the same) if I merge["a", "b", "c"]
andcategorical(["a", "b", "c"])
with some other categorical array that is ordered.
Well that's already the case for the example you give since levels are the same, but that's quite different when arrays have unused levels (like we do with setindex!
). Are you suggesting to change that?
For asymmetric case I think the right rule is to keep the status of the destination array (i.e. the one we write to).
OK, so basically what we do now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting to change that?
No (I did not consider unused levels when writing this - good point; I concentrated on orderness).
I understand that we propagate unused levels now - right? I think it is OK, as then the levels in the new pool are a superset for levels in pools from source arrays.
Thank you! |
References now always point to the vector returned by
levels
. This simplifies a lot of code, especially for packages that use CategoricalArrays. The downside is that all references need to be recoded when existing levels are removed or reordered.To ensure
CategoricalValue
objects always remain valid,CategoricalPool
is now semi-mutable: only adding new levels is possible.CategoricalArray
s are now mutable and replace their pool with a new one when levels are removed or reordered, e.g. inlevels!
orsetindex!(A::CategoricalArray, v::CategoricalValue, ...)
. This should not be a problem for performance as changing levels should not be frequent. On the other hand, adding levels keeps the same pool, which makes creating aCategoricalArray
from another array type relatively fast, though references have to be recoded at the end when sorting levels. And the (very frequent) operations which use levels in their order should be faster than before as they can use refs directly.Note that replacing the pool makes it impossible to compare new
CategoricalValue
objects with old ones with<
and>
. This should not be too problematic in practice.Finally, replace deprecation message with an error when assignment would add new levels
to ordered array, and make
copy
andcopyto!
merge levels even when copying zeroelements (this differs from what the
AbstractArray
fallback would do but makes more sense, see JuliaData/DataFrames.jl#2104).This is a large PR and these things are really tricky, so I wouldn't mind a review @bkamins. Good luck! :-)
@quinnj As we discussed some time ago, this implies that references are all recoded when calling
levels!
at the of the CSV parsing, to set levels in lexicographic order. That shouldn't be too bad for performance, but it could make sense to sort levels before parsing at the end of the detection phase: if all levels have been encountered during that phase, no reordering nor recoding would be needed at all. (Some small adjustments to CSV will be needed anyway to accomodate the fact thatCategoricalPool
cannot be mutated like before, but nothing major.)