Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop distinction between index and levels #253

Merged
merged 4 commits into from
Apr 8, 2020
Merged

Drop distinction between index and levels #253

merged 4 commits into from
Apr 8, 2020

Conversation

nalimilan
Copy link
Member

@nalimilan nalimilan commented Mar 1, 2020

References now always point to the vector returned by levels. This simplifies a lot of code, especially for packages that use CategoricalArrays. The downside is that all references need to be recoded when existing levels are removed or reordered.

To ensure CategoricalValue objects always remain valid, CategoricalPool is now semi-mutable: only adding new levels is possible. CategoricalArrays are now mutable and replace their pool with a new one when levels are removed or reordered, e.g. in levels! or setindex!(A::CategoricalArray, v::CategoricalValue, ...). This should not be a problem for performance as changing levels should not be frequent. On the other hand, adding levels keeps the same pool, which makes creating a CategoricalArray from another array type relatively fast, though references have to be recoded at the end when sorting levels. And the (very frequent) operations which use levels in their order should be faster than before as they can use refs directly.

Note that replacing the pool makes it impossible to compare new CategoricalValue objects with old ones with < and >. This should not be too problematic in practice.

Finally, replace deprecation message with an error when assignment would add new levels
to ordered array, and make copy and copyto! merge levels even when copying zero
elements (this differs from what the AbstractArray fallback would do but makes more sense, see JuliaData/DataFrames.jl#2104).

This is a large PR and these things are really tricky, so I wouldn't mind a review @bkamins. Good luck! :-)

@quinnj As we discussed some time ago, this implies that references are all recoded when calling levels! at the of the CSV parsing, to set levels in lexicographic order. That shouldn't be too bad for performance, but it could make sense to sort levels before parsing at the end of the detection phase: if all levels have been encountered during that phase, no reordering nor recoding would be needed at all. (Some small adjustments to CSV will be needed anyway to accomodate the fact that CategoricalPool cannot be mutated like before, but nothing major.)

References now always point to the vector returned by `levels`. This simplifies
a lot of code, especially for packages that use CategoricalArrays. The downside
is that all references need to be recoded when existing levels are removed or reordered.

To ensure `CategoricalValue` objects always remain valid, `CategoricalPool` is now
semi-mutable: only adding new levels is possible. `CategoricalArray`s are now mutable
and replace their pool with a new one when levels are removed or reordered, e.g. in
`levels!` or `setindex!(A::CategoricalArray, v::CategoricalValue, ...)`. This should
not be a problem for performance as changing levels should not be frequent. On the other
hand, adding levels keeps the same pool, which makes creating a `CategoricalArray`
from another array type relatively fast, though references have to be recoded at
the end when sorting levels. And the (very frequent) operations which use levels
in their order should be faster than before as they can use refs directly.

Note that replacing the pool makes it impossible to compare new `CategoricalValue`
objects with old ones with `<` and `>`. This should not be too problematic in practice.

Finally, replace deprecation message with an error when assignment would add new levels
to ordered array, and make `copy` and `copyto!` merge levels even when copying zero
elements (this differs from what the `AbstractArray` fallback would do but makes more sense).
@nalimilan nalimilan requested a review from bkamins March 1, 2020 17:00
src/array.jl Show resolved Hide resolved
@bkamins
Copy link
Member

bkamins commented Mar 1, 2020

Thank you for working on this.
I will look at the details later (it gets late today), but let me just comment on the design assumptions.

And the (very frequent) operations which use levels in their order should be faster than before as they can use refs directly.

This is indeed important IMO, it should make e.g. split-apploy-combine faster (have you tried this)?

Note that replacing the pool makes it impossible to compare new CategoricalValue objects with old ones with < and >. This should not be too problematic in practice.

Actually I think it is OK, as it is safer (it is better to error than to produce some unexpected wrong result).

Finally, replace deprecation message with an error when assignment would add new levels
to ordered array.

So how now one could add new levels to an ordered array? Change it first to unordered, add levels, and make ordered or there is some simpler way? Also - a related question (I have not dug through the code to check) - what is the cost of switching between ordered and unordered?

@nalimilan
Copy link
Member Author

This is indeed important IMO, it should make e.g. split-apploy-combine faster (have you tried this)?

I haven't tried yet. It should make the computation of group indices faster (basically a no-op). I had noted in the code that the performance gain was negligible (we could already have applied that optimization when levels and index are equal), but that was at a time when we always computed row indices, which is quite slow. Now that we can avoid that step, computing group indices more efficiently could make a difference. Though note that if there are empty levels we still need to recompute indices.

So how now one could add new levels to an ordered array? Change it first to unordered, add levels, and make ordered or there is some simpler way? Also - a related question (I have not dug through the code to check) - what is the cost of switching between ordered and unordered?

Well you can still call levels! and it will replace the pool if necessary. Basically, this PR doesn't change anything if you only use exported functions. We could almost make it a minor release, but we know many packages rely on unexported functions which were kind of part of the public API and are now deprecated (without breakage in most cases).

Switching between ordered and unordered just sets a Bool field, so you could even do that in a loop for each element if you wanted to. :-)

@inline function setindex!(A::CategoricalArray, v::Any, I::Real...)
@boundscheck checkbounds(A, I...)
# TODO: use a global table to cache subset relations for all pairs of pools
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have also a setindex! for the case of multiple selections like dst[idx_set] = src? As it seems currently it would be expensive if src is a categorical array (we would have to check the pool at every iteration).

This can be left for later as it is performance related only.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. We could only cover simple cases (with a single set of indices), and then we could probably just call copyto!. I wonder whether that optimization should live in Base, as it already has a special method for Array that calls unsafe_copyto!. If unsafe_copyto! fell back to copyto!, all arrays would automatically benefit from any optimized implementation of copyto!.

Anyway that's for future work I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite follow these two comments, but I'd be slightly wary of using a global table as mentioned in the comment due to multi-threaded scenarios. Not sure if that's the kind of global table you have in mind, but just a thought.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well this idea/need isn't new, it just appears in more places. I'm not sure what would be the best solution for multithreading, but if possible I guess a lock wouldn't be worse than the current code which has to check all levels of both pools on each assignment. Mutating pools during a multithreaded operation should be quite rare in practice, so it would be too bad if it affected performance for common cases. Any ideas? Maybe worth discussing this in a separate issue.

invindex::Dict{T, R} # map from category levels to their reference codes
order::Vector{R} # 1-to-1 map from `index` to `level` (position of i-th category in `levels`)
levels::Vector{T} # category levels ordered by externally specified order
valindex::Vector{V} # "category value" objects 1-to-1 matching `index`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of valindex? Is the benefit just that it makes the CategoricalValue objects once instead of repeatedly on-demand?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I wanted to try removing it and see whether it makes a difference for performance. It might actually be faster in some cases as the compiler may be able to detect that it only needs to use the reference code and not load the pointer to the pool at all. But that's for another PR.

@@ -2,25 +2,12 @@

`CategoricalArray` is made of the two fields:

- `refs`: an integer array that stores the position of the category level in the `index` field of `CategoricalPool` for each `CategoricalArray` element; `0` denotes a missing value (for `CategoricalArray{Union{T, Missing}}` only).
- `refs`: an integer array that stores the position of the category level in the `levels` field of `CategoricalPool` for each `CategoricalArray` element; `0` denotes a missing value (for `CategoricalArray{Union{T, Missing}}` only).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be good to clarify if 0 can denote missing only for Union{T, Missing} CategoricalArrays; like, if it's just a regular CategoricalVector{String}, would 0 point to the first level? Or is 0 exclusively for missing always.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other arrays it corresponds to an undefined entry. But can we leave unrelated improvements to another PR? This one is already messy enough.

levels (based on `sort`).

The `CategoricalPool{V,R,C}` type keeps track of the levels of type `V` and associates them with an integer reference code of type `R` (for internal use). It offers methods to set the levels, change their order while preserving the references, and efficiently get the integer index corresponding to a level and vice-versa. Whether the values of `CategoricalArray` are ordered or not is defined by an `ordered` field of the pool. Finally, `CategoricalPool{V,R,C}` keeps a `valindex` vector of value objects of type `C == CategoricalValue{V, R}`, so that `getindex` can return the existing object instead of allocating a new one.
Do note that `CategoricalPool` levels are semi-mutable: it is only allowed to add new levels, but never to remove or reorder existing ones. This ensures existing `CategoricalValue` objects remain valid and always point to the same level as when they were created. Therefore, `CategoricalArray`s create a new pool each time some of their levels are removed or reordered. This happens when calling `levels!`, but also when assigning a `CategoricalValue` via `setindex!`, `push!`, `append!`, `copy!` or `copyto!`. This requires updating all reference codes to point to the new pool, and makes it impossible to compare existing ordered `CategoricalValue` objects with values from the array using `<` and `>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens when calling levels!, but also when assigning a CategoricalValue via setindex!, push!, append!, copy! or copyto!.

So this isn't entirely clear to me: you say that a new pool is created if some of the levels are removed or reordered, but it doesn't seem obvious to me why doing setindex! or push! would factor in; they're certainly not removing levels, and it doesn't seem like they'd reorder anything either. If I assign a new CategoricalValue we haven't seen before, I would assume we'd tack on new levels to the existing pool, so no re-ordering is necessary.

I imagine I'm just missing something obvious here though, so it'd be good to maybe clarify?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it's a bit tricky because we merge levels in a way that preserves the relative orders of levels in both pools if possible at all. For example, if you have ["b", "c", "d"] in one pool and ["a", "b", "c"]in the other, the result will be["a", "b", "c", "d"]whatever which array is assigned to. That allows keeping the destination array as ordered if it was before, without making<` change its behavior for existing values.

That's probably a corner case, but but I figured it's strictly better to do something meaningful in this case rather than just concatenating levels. This also allows vcat not to change levels depending on the order in which arrays are passed (when possible).

I'll try improving docs. We don't really reorder levels, they refs are just shifted due to adding new levels before them.

@quinnj
Copy link
Member

quinnj commented Mar 4, 2020

but it could make sense to sort levels before parsing at the end of the detection phase: if all levels have been encountered during that phase, no reordering nor recoding would be needed at all.

Yeah, we'll have to review what this means for CSV.jl. In the latest big redesign of CSV.jl, we actually don't have a formal "detection" phase anymore; if types aren't specified by the user, then we just look at the first row, set the type to whatever we find there, then continue parsing the rest of the file. But we also are managing "pooled" columns in a package-agnostic way; i.e. we just track a Dict{String, UInt64} while parsing, and only materialize a PooledArray/CategoricalArray in copy(col::CSV.Column{PooledString}). But my plan is to drop CategoricalArrays altogether and rely on a probably-not-implemented-yet constructor like fromrefarray(A) that would construct a CategoricalArray by getting the refs/levels of the input array that implemented the DataAPI definitions.

Is the plan to make these changes a major release in CategoricalArrays? 1.0?

@nalimilan
Copy link
Member Author

But my plan is to drop CategoricalArrays altogether and rely on a probably-not-implemented-yet constructor like fromrefarray(A) that would construct a CategoricalArray by getting the refs/levels of the input array that implemented the DataAPI definitions.

That would work too (just need to add a constructor, which would be useful anyway). Do you have a rough timeline for this?

I planned to release 0.8 (I'd prefer testing the API for a while before tagging 1.0 once we know there aren't major issues), and have the next DataFrames release use that. Ideally other packages would follow relatively quickly, otherwise the new DataFrames version won't be usable with CSV.

@quinnj
Copy link
Member

quinnj commented Mar 4, 2020

My current thinking on timeline is:

  • do a bunch of non-breaking work I've been planning to do for a while on CSV.jl, probably just release it all w/ a patch release
  • add in a bunch of deprecations CSV.read, catg keyword, etc. and do a 0.6 release
  • break stuff and release 1.0

In my mind, that provides a good scenario where most rigid users could just update the patch release and get as much of the new functionality as possible; update to 0.6 to see deprecations and what they should change, and 1.0 once they're ready to change over.

@nalimilan
Copy link
Member Author

OK, but do you have an idea about how long it will take? :-)

@quinnj
Copy link
Member

quinnj commented Mar 5, 2020

I expect to start working on new CSV features in about 2 weeks, then do a week per bullet point above? So like, 5-6 weeks to complete all three? Of course, I'm happy to coordinate with other releases, CategoricalArrays/DataFrames. I'm not in any particular rush; that timeline is more if it solely depended on my availability.

What are the timelines here/DataFrames? And what would make the most sense coordinating?

@bkamins
Copy link
Member

bkamins commented Mar 5, 2020

DataFrames.jl is not likely to have a release in that timeline so CategoricalArrays.jl will be more relevant to coordinate with.

src/array.jl Outdated Show resolved Hide resolved
src/array.jl Outdated Show resolved Hide resolved
src/array.jl Outdated Show resolved Hide resolved
if pool(B) !== pool(A) && pool(B) ⊈ pool(A)
merge_pools!(A, B)
end
# TODO: optimize recoding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is is safe to use use @inbounds in line 713 (and worth it? - if not then better leave as is).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Safe yes, but as the TODO indicates if we want something fast we should do as in other places and compute a mapping between source and destination refs. Adding @inbounds would probably not help as we do a dict lookup for each entry (!).

Maybe calling copyto! would be enough here.

src/pool.jl Outdated Show resolved Hide resolved
src/recode.jl Show resolved Hide resolved
@nalimilan
Copy link
Member Author

It doesn't make a lot of sense to make a breaking release of CategoricalArrays if neither DataFrames nor CSV support it. So I guess we can just wait until releases are ready for those, there's no rush anyway.

I've pushed commits addressing review comments and adding some docs.

@bkamins
Copy link
Member

bkamins commented Mar 7, 2020

Thank you. So maybe also remove CategoricalString (probably in another PR) to fully switch to CategoricalVale{String} everywhere? I do not remember how long it has been deprecated though.

@nalimilan
Copy link
Member Author

Thank you. So maybe also remove CategoricalString (probably in another PR) to fully switch to CategoricalVale{String} everywhere? I do not remember how long it has been deprecated though.

It hasn't been deprecated in a release yet. :-D

@bkamins
Copy link
Member

bkamins commented Mar 8, 2020

So I understand it should be deprecated now - right?

I thought it was, because I get deprecation warnings to wrap categorical values containing in String for processing as string. But I understand now that this is a separate deprecation.

@nalimilan
Copy link
Member Author

It's been deprecated on master for a while, but not in a release.

@nalimilan
Copy link
Member Author

I've added a few more tests to fix coverage regressions. That made me spot a place where the fast path should have been used and wasn't.

newlevs = copy(levels(a))
ordered = isordered(a)
else
nl, ordered = mergelevels(isordered(a), a.levels, b.levels)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether this would better be

Suggested change
nl, ordered = mergelevels(isordered(a), a.levels, b.levels)
nl, ordered = mergelevels(isordered(a) | isordered(b), a.levels, b.levels)

It could be a bit surprising that an array becomes ordered after setting a value. OTOH it's a bit weird too that we set the levels, but not orderedness, which introduces a small inconsistency with vcat. Opinions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my thinking is the following. We have two use-cases for merge_pools (maybe we need an argument to differentiate between them):

  • asymmetric - you merge b into a creating a new pool that is to replace a (setting a value in an array)
  • symmetric - you merge a and b to create something new (like in vcat)

In these two scenarios I think we can have a different rules.

Now as for the rules - I am not sure what is best.

For symmetric (vcat) case the rule I think should be "treat unordered categorical vectors as if they were not categorical (but just plain vectors)", what I mean is that it should matter (i.e. the result should be the same) if I merge ["a", "b", "c"] and categorical(["a", "b", "c"]) with some other categorical array that is ordered.

For asymmetric case I think the right rule is to keep the status of the destination array (i.e. the one we write to).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually merge_pools is only used for the asymmetric case. The symmetric case only happens with vcat, which uses its own logic.

For symmetric (vcat) case the rule I think should be "treat unordered categorical vectors as if they were not categorical (but just plain vectors)", what I mean is that it should matter (i.e. the result should be the same) if I merge ["a", "b", "c"] and categorical(["a", "b", "c"]) with some other categorical array that is ordered.

Well that's already the case for the example you give since levels are the same, but that's quite different when arrays have unused levels (like we do with setindex!). Are you suggesting to change that?

For asymmetric case I think the right rule is to keep the status of the destination array (i.e. the one we write to).

OK, so basically what we do now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting to change that?

No (I did not consider unused levels when writing this - good point; I concentrated on orderness).

I understand that we propagate unused levels now - right? I think it is OK, as then the levels in the new pool are a superset for levels in pools from source arrays.

@nalimilan nalimilan merged commit dcc24cf into master Apr 8, 2020
@nalimilan nalimilan deleted the nl/order branch April 8, 2020 13:34
@bkamins
Copy link
Member

bkamins commented Apr 8, 2020

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants