Drop distinction between index and levels #253

nalimilan · 2020-03-01T17:00:06Z

References now always point to the vector returned by levels. This simplifies a lot of code, especially for packages that use CategoricalArrays. The downside is that all references need to be recoded when existing levels are removed or reordered.

To ensure CategoricalValue objects always remain valid, CategoricalPool is now semi-mutable: only adding new levels is possible. CategoricalArrays are now mutable and replace their pool with a new one when levels are removed or reordered, e.g. in levels! or setindex!(A::CategoricalArray, v::CategoricalValue, ...). This should not be a problem for performance as changing levels should not be frequent. On the other hand, adding levels keeps the same pool, which makes creating a CategoricalArray from another array type relatively fast, though references have to be recoded at the end when sorting levels. And the (very frequent) operations which use levels in their order should be faster than before as they can use refs directly.

Note that replacing the pool makes it impossible to compare new CategoricalValue objects with old ones with < and >. This should not be too problematic in practice.

Finally, replace deprecation message with an error when assignment would add new levels
to ordered array, and make copy and copyto! merge levels even when copying zero
elements (this differs from what the AbstractArray fallback would do but makes more sense, see JuliaData/DataFrames.jl#2104).

This is a large PR and these things are really tricky, so I wouldn't mind a review @bkamins. Good luck! :-)

@quinnj As we discussed some time ago, this implies that references are all recoded when calling levels! at the of the CSV parsing, to set levels in lexicographic order. That shouldn't be too bad for performance, but it could make sense to sort levels before parsing at the end of the detection phase: if all levels have been encountered during that phase, no reordering nor recoding would be needed at all. (Some small adjustments to CSV will be needed anyway to accomodate the fact that CategoricalPool cannot be mutated like before, but nothing major.)

References now always point to the vector returned by `levels`. This simplifies a lot of code, especially for packages that use CategoricalArrays. The downside is that all references need to be recoded when existing levels are removed or reordered. To ensure `CategoricalValue` objects always remain valid, `CategoricalPool` is now semi-mutable: only adding new levels is possible. `CategoricalArray`s are now mutable and replace their pool with a new one when levels are removed or reordered, e.g. in `levels!` or `setindex!(A::CategoricalArray, v::CategoricalValue, ...)`. This should not be a problem for performance as changing levels should not be frequent. On the other hand, adding levels keeps the same pool, which makes creating a `CategoricalArray` from another array type relatively fast, though references have to be recoded at the end when sorting levels. And the (very frequent) operations which use levels in their order should be faster than before as they can use refs directly. Note that replacing the pool makes it impossible to compare new `CategoricalValue` objects with old ones with `<` and `>`. This should not be too problematic in practice. Finally, replace deprecation message with an error when assignment would add new levels to ordered array, and make `copy` and `copyto!` merge levels even when copying zero elements (this differs from what the `AbstractArray` fallback would do but makes more sense).

src/array.jl

bkamins · 2020-03-01T21:49:34Z

Thank you for working on this.
I will look at the details later (it gets late today), but let me just comment on the design assumptions.

And the (very frequent) operations which use levels in their order should be faster than before as they can use refs directly.

This is indeed important IMO, it should make e.g. split-apploy-combine faster (have you tried this)?

Note that replacing the pool makes it impossible to compare new CategoricalValue objects with old ones with < and >. This should not be too problematic in practice.

Actually I think it is OK, as it is safer (it is better to error than to produce some unexpected wrong result).

Finally, replace deprecation message with an error when assignment would add new levels
to ordered array.

So how now one could add new levels to an ordered array? Change it first to unordered, add levels, and make ordered or there is some simpler way? Also - a related question (I have not dug through the code to check) - what is the cost of switching between ordered and unordered?

nalimilan · 2020-03-02T08:47:53Z

This is indeed important IMO, it should make e.g. split-apploy-combine faster (have you tried this)?

I haven't tried yet. It should make the computation of group indices faster (basically a no-op). I had noted in the code that the performance gain was negligible (we could already have applied that optimization when levels and index are equal), but that was at a time when we always computed row indices, which is quite slow. Now that we can avoid that step, computing group indices more efficiently could make a difference. Though note that if there are empty levels we still need to recompute indices.

So how now one could add new levels to an ordered array? Change it first to unordered, add levels, and make ordered or there is some simpler way? Also - a related question (I have not dug through the code to check) - what is the cost of switching between ordered and unordered?

Well you can still call levels! and it will replace the pool if necessary. Basically, this PR doesn't change anything if you only use exported functions. We could almost make it a minor release, but we know many packages rely on unexported functions which were kind of part of the public API and are now deprecated (without breakage in most cases).

Switching between ordered and unordered just sets a Bool field, so you could even do that in a loop for each element if you wanted to. :-)

bkamins · 2020-03-02T14:54:02Z

src/array.jl

 @inline function setindex!(A::CategoricalArray, v::Any, I::Real...)
    @boundscheck checkbounds(A, I...)
+    # TODO: use a global table to cache subset relations for all pairs of pools


Should we have also a setindex! for the case of multiple selections like dst[idx_set] = src? As it seems currently it would be expensive if src is a categorical array (we would have to check the pool at every iteration).

This can be left for later as it is performance related only.

Right. We could only cover simple cases (with a single set of indices), and then we could probably just call copyto!. I wonder whether that optimization should live in Base, as it already has a special method for Array that calls unsafe_copyto!. If unsafe_copyto! fell back to copyto!, all arrays would automatically benefit from any optimized implementation of copyto!.

Anyway that's for future work I think.

I don't quite follow these two comments, but I'd be slightly wary of using a global table as mentioned in the comment due to multi-threaded scenarios. Not sure if that's the kind of global table you have in mind, but just a thought.

Well this idea/need isn't new, it just appears in more places. I'm not sure what would be the best solution for multithreading, but if possible I guess a lock wouldn't be worse than the current code which has to check all levels of both pools on each assignment. Mutating pools during a multithreaded operation should be quite rare in practice, so it would be too bad if it affected performance for common cases. Any ideas? Maybe worth discussing this in a separate issue.

src/typedefs.jl

quinnj · 2020-03-04T04:52:46Z

src/typedefs.jl

    invindex::Dict{T, R}    # map from category levels to their reference codes
-    order::Vector{R}        # 1-to-1 map from `index` to `level` (position of i-th category in `levels`)
-    levels::Vector{T}       # category levels ordered by externally specified order
    valindex::Vector{V}     # "category value" objects 1-to-1 matching `index`


What's the purpose of valindex? Is the benefit just that it makes the CategoricalValue objects once instead of repeatedly on-demand?

Yeah. I wanted to try removing it and see whether it makes a difference for performance. It might actually be faster in some cases as the compiler may be able to detect that it only needs to use the reference code and not load the pointer to the pool at all. But that's for another PR.

quinnj · 2020-03-04T04:55:01Z

docs/src/implementation.md

@@ -2,25 +2,12 @@

 `CategoricalArray` is made of the two fields:

- `refs`: an integer array that stores the position of the category level in the `index` field of `CategoricalPool` for each `CategoricalArray` element; `0` denotes a missing value (for `CategoricalArray{Union{T, Missing}}` only).
+- `refs`: an integer array that stores the position of the category level in the `levels` field of `CategoricalPool` for each `CategoricalArray` element; `0` denotes a missing value (for `CategoricalArray{Union{T, Missing}}` only).


might be good to clarify if 0 can denote missing only for Union{T, Missing} CategoricalArrays; like, if it's just a regular CategoricalVector{String}, would 0 point to the first level? Or is 0 exclusively for missing always.

For other arrays it corresponds to an undefined entry. But can we leave unrelated improvements to another PR? This one is already messy enough.

docs/src/implementation.md

quinnj · 2020-03-04T05:00:51Z

docs/src/implementation.md

-    levels (based on `sort`).
-
-The `CategoricalPool{V,R,C}` type keeps track of the levels of type `V` and associates them with an integer reference code of type `R` (for internal use). It offers methods to set the levels, change their order while preserving the references, and efficiently get the integer index corresponding to a level and vice-versa. Whether the values of `CategoricalArray` are ordered or not is defined by an `ordered` field of the pool. Finally, `CategoricalPool{V,R,C}` keeps a `valindex` vector of value objects of type `C == CategoricalValue{V, R}`, so that `getindex` can return the existing object instead of allocating a new one.
+Do note that `CategoricalPool` levels are semi-mutable: it is only allowed to add new levels, but never to remove or reorder existing ones. This ensures existing `CategoricalValue` objects remain valid and always point to the same level as when they were created. Therefore, `CategoricalArray`s create a new pool each time some of their levels are removed or reordered. This happens when calling `levels!`, but also when assigning a `CategoricalValue` via `setindex!`, `push!`, `append!`, `copy!` or `copyto!`. This requires updating all reference codes to point to the new pool, and makes it impossible to compare existing ordered `CategoricalValue` objects with values from the array using `<` and `>`.


This happens when calling levels!, but also when assigning a CategoricalValue via setindex!, push!, append!, copy! or copyto!.

So this isn't entirely clear to me: you say that a new pool is created if some of the levels are removed or reordered, but it doesn't seem obvious to me why doing setindex! or push! would factor in; they're certainly not removing levels, and it doesn't seem like they'd reorder anything either. If I assign a new CategoricalValue we haven't seen before, I would assume we'd tack on new levels to the existing pool, so no re-ordering is necessary.

I imagine I'm just missing something obvious here though, so it'd be good to maybe clarify?

Well it's a bit tricky because we merge levels in a way that preserves the relative orders of levels in both pools if possible at all. For example, if you have ["b", "c", "d"] in one pool and ["a", "b", "c"]in the other, the result will be["a", "b", "c", "d"]whatever which array is assigned to. That allows keeping the destination array as ordered if it was before, without making<` change its behavior for existing values.

That's probably a corner case, but but I figured it's strictly better to do something meaningful in this case rather than just concatenating levels. This also allows vcat not to change levels depending on the order in which arrays are passed (when possible).

I'll try improving docs. We don't really reorder levels, they refs are just shifted due to adding new levels before them.

quinnj · 2020-03-04T05:10:54Z

but it could make sense to sort levels before parsing at the end of the detection phase: if all levels have been encountered during that phase, no reordering nor recoding would be needed at all.

Yeah, we'll have to review what this means for CSV.jl. In the latest big redesign of CSV.jl, we actually don't have a formal "detection" phase anymore; if types aren't specified by the user, then we just look at the first row, set the type to whatever we find there, then continue parsing the rest of the file. But we also are managing "pooled" columns in a package-agnostic way; i.e. we just track a Dict{String, UInt64} while parsing, and only materialize a PooledArray/CategoricalArray in copy(col::CSV.Column{PooledString}). But my plan is to drop CategoricalArrays altogether and rely on a probably-not-implemented-yet constructor like fromrefarray(A) that would construct a CategoricalArray by getting the refs/levels of the input array that implemented the DataAPI definitions.

Is the plan to make these changes a major release in CategoricalArrays? 1.0?

nalimilan · 2020-03-04T09:09:30Z

But my plan is to drop CategoricalArrays altogether and rely on a probably-not-implemented-yet constructor like fromrefarray(A) that would construct a CategoricalArray by getting the refs/levels of the input array that implemented the DataAPI definitions.

That would work too (just need to add a constructor, which would be useful anyway). Do you have a rough timeline for this?

I planned to release 0.8 (I'd prefer testing the API for a while before tagging 1.0 once we know there aren't major issues), and have the next DataFrames release use that. Ideally other packages would follow relatively quickly, otherwise the new DataFrames version won't be usable with CSV.

quinnj · 2020-03-04T16:52:48Z

My current thinking on timeline is:

do a bunch of non-breaking work I've been planning to do for a while on CSV.jl, probably just release it all w/ a patch release
add in a bunch of deprecations CSV.read, catg keyword, etc. and do a 0.6 release
break stuff and release 1.0

In my mind, that provides a good scenario where most rigid users could just update the patch release and get as much of the new functionality as possible; update to 0.6 to see deprecations and what they should change, and 1.0 once they're ready to change over.

nalimilan · 2020-03-04T17:54:40Z

OK, but do you have an idea about how long it will take? :-)

quinnj · 2020-03-05T04:34:57Z

I expect to start working on new CSV features in about 2 weeks, then do a week per bullet point above? So like, 5-6 weeks to complete all three? Of course, I'm happy to coordinate with other releases, CategoricalArrays/DataFrames. I'm not in any particular rush; that timeline is more if it solely depended on my availability.

What are the timelines here/DataFrames? And what would make the most sense coordinating?

bkamins · 2020-03-05T13:52:10Z

DataFrames.jl is not likely to have a release in that timeline so CategoricalArrays.jl will be more relevant to coordinate with.

src/array.jl

bkamins · 2020-03-06T17:25:15Z

src/array.jl

+    if pool(B) !== pool(A) && pool(B) ⊈ pool(A)
+        merge_pools!(A, B)
+    end
+    # TODO: optimize recoding


Is is safe to use use @inbounds in line 713 (and worth it? - if not then better leave as is).

Safe yes, but as the TODO indicates if we want something fast we should do as in other places and compute a mapping between source and destination refs. Adding @inbounds would probably not help as we do a dict lookup for each entry (!).

Maybe calling copyto! would be enough here.

src/pool.jl

src/recode.jl

nalimilan · 2020-03-07T21:08:20Z

It doesn't make a lot of sense to make a breaking release of CategoricalArrays if neither DataFrames nor CSV support it. So I guess we can just wait until releases are ready for those, there's no rush anyway.

I've pushed commits addressing review comments and adding some docs.

bkamins · 2020-03-07T23:14:39Z

Thank you. So maybe also remove CategoricalString (probably in another PR) to fully switch to CategoricalVale{String} everywhere? I do not remember how long it has been deprecated though.

nalimilan · 2020-03-08T10:22:19Z

Thank you. So maybe also remove CategoricalString (probably in another PR) to fully switch to CategoricalVale{String} everywhere? I do not remember how long it has been deprecated though.

It hasn't been deprecated in a release yet. :-D

bkamins · 2020-03-08T10:32:57Z

So I understand it should be deprecated now - right?

I thought it was, because I get deprecation warnings to wrap categorical values containing in String for processing as string. But I understand now that this is a separate deprecation.

nalimilan · 2020-03-08T10:35:50Z

It's been deprecated on master for a while, but not in a release.

nalimilan · 2020-03-08T11:50:39Z

I've added a few more tests to fix coverage regressions. That made me spot a place where the fast path should have been used and wasn't.

nalimilan · 2020-03-08T11:50:43Z

src/pool.jl

+        newlevs = copy(levels(a))
+        ordered = isordered(a)
+    else
+        nl, ordered = mergelevels(isordered(a), a.levels, b.levels)


I wonder whether this would better be

Suggested change

nl, ordered = mergelevels(isordered(a), a.levels, b.levels)

nl, ordered = mergelevels(isordered(a) | isordered(b), a.levels, b.levels)

It could be a bit surprising that an array becomes ordered after setting a value. OTOH it's a bit weird too that we set the levels, but not orderedness, which introduces a small inconsistency with vcat. Opinions?

So my thinking is the following. We have two use-cases for merge_pools (maybe we need an argument to differentiate between them):

asymmetric - you merge b into a creating a new pool that is to replace a (setting a value in an array)

symmetric - you merge a and b to create something new (like in vcat)

In these two scenarios I think we can have a different rules.

Now as for the rules - I am not sure what is best.

For symmetric (vcat) case the rule I think should be "treat unordered categorical vectors as if they were not categorical (but just plain vectors)", what I mean is that it should matter (i.e. the result should be the same) if I merge ["a", "b", "c"] and categorical(["a", "b", "c"]) with some other categorical array that is ordered.

For asymmetric case I think the right rule is to keep the status of the destination array (i.e. the one we write to).

Actually merge_pools is only used for the asymmetric case. The symmetric case only happens with vcat, which uses its own logic.

For symmetric (vcat) case the rule I think should be "treat unordered categorical vectors as if they were not categorical (but just plain vectors)", what I mean is that it should matter (i.e. the result should be the same) if I merge ["a", "b", "c"] and categorical(["a", "b", "c"]) with some other categorical array that is ordered.

Well that's already the case for the example you give since levels are the same, but that's quite different when arrays have unused levels (like we do with setindex!). Are you suggesting to change that?

For asymmetric case I think the right rule is to keep the status of the destination array (i.e. the one we write to).

OK, so basically what we do now.

Are you suggesting to change that?

No (I did not consider unused levels when writing this - good point; I concentrated on orderness).

I understand that we propagate unused levels now - right? I think it is OK, as then the levels in the new pool are a superset for levels in pools from source arrays.

bkamins · 2020-04-08T14:16:39Z

Thank you!

nalimilan requested a review from bkamins March 1, 2020 17:00

bkamins reviewed Mar 1, 2020

View reviewed changes

src/array.jl Show resolved Hide resolved

bkamins reviewed Mar 2, 2020

View reviewed changes

src/typedefs.jl Show resolved Hide resolved

quinnj reviewed Mar 4, 2020

View reviewed changes

docs/src/implementation.md Show resolved Hide resolved

quinnj reviewed Mar 4, 2020

View reviewed changes