WIP: Better copyto #172

bkamins · 2018-11-26T10:27:46Z

PR following the discussion in #170.

What does it change:

append! uses resize! and copyto! which will make it faster, but when it fails it does not clean up
copy! will allow src to be non-categorical; dest is array or a view; in this implementation levels are retained and also if dest is ordered copy! might fail
copyto!:
1. if dest is ordered do not allow adding new levels
2. if dest is not ordered only add levels that are present in src in copied range (not all levels in src - this makes sure that if src is categorical and not categorical holding the same values as src we get the same result)

EDIT
The general idea of the design is that if x is categorical and y is a non-categorical then:

copyto!(x,a,y,b,n)

should be exactly the same as

copyto!(x,a,categorical(y),b,n)

This reverts commit 5528a9a.

bkamins · 2018-11-26T11:27:40Z

@nalimilan this is WIP, so let me work on it a bit (I have to dig through all the tests as they require significant changes because if ordered is true we change a lot here).

The implementation I proposed for copyto! now seems to follow the rules we have discussed and is faster than old copyto! (at least on some limited benchmark tests I had time to perform).

bkamins · 2018-11-26T12:55:45Z

I am going through the tests (hopefully I will update all today and push relevant changes) and here are two additional comments:

copy! indeed should have a different implementation - indeed it should copy levels and moreover it should have a totally separate implementation (I will add it) as Future.copy!([1,2,3],[5,6]) == [5,6] and we do not resize catarrays with copy! right now;
copyto! behavior that I propose is inconsistent with vcat in terms of ordering of levels (we could change vcat implementation to use copyto! internally to make it consistent); this should be discussed again I think; also the usage of copyto! in DataFrames.jl assumes that level ordering is retained so if we use what copyto! now proposes the pattern to populate a bigger vector using copyto! should be:
1. create an empty categorical vector
2. populate its levels as needed
3. populate the values after levels are populated

nalimilan · 2018-11-26T13:31:26Z

Thanks! That inconsistency between copyto! and vcat is a bit annoying. I'd say the essential property that should hold for copyto! is that it's equivalent to calling setindex! repeatedly. Currently, setindex!(::CategoricalArray, ::Any, ::CatValue) doesn't respect the ordering of levels (i.e. it adds a new level at the end if needed), which means we need copyto! to do the same.

But maybe we could change that? It would make sense for setindex! to insert the level in the position in which it would have in the result of mergelevels, i.e. in the levels of vcat. In general, the principle I've put a lot of effort into implementing is that if the two vectors have compatible orderings of levels, we try very hard to merge the levels into a common order (i.e. one in which the relative order of levels is preserved). I think that's right since the order of appearance is really an arbitrary criterion which shouldn't play a role for categorical arrays.

src/array.jl

nalimilan · 2018-11-26T13:04:55Z

src/array.jl

+                end
+            else
+                if seen[s]
+                    drefs[dstart+i] = remap[s]


Couldn't you just use remap[s] == -1 as a sentinel?

This is what I have done first, but allocating seen should be cheap, and the problem is that remap, in order to be fast, should have the same eltype as drefs and they are unsigned.

At least this is what I thought. Now I see that we can use remap[s] == 0 as sentinel, as we use 0 for missing which is handled in other way anyway. I will change this

src/array.jl

nalimilan · 2018-11-26T13:11:03Z

src/array.jl

+                end
+            end
+        end
+    else


Are you sure we need to duplicate all of this code for performance? Branch prediction is usually quite efficient, so even if you have an if inside the loop I suspect it won't make a big difference. And often LLVM will do the hoisting automatically.

I would have to do isordered check in the innermost part of the loop. I can check if compiler can optimize this out.

Since there's already a branch I don't think it will matter, even if the compiler doesn't hoist it.

but we have to do isordered test in every step of the loop. If it is not optimized out then it will be much slower. I will check.

nalimilan · 2018-11-26T13:13:25Z

src/array.jl

 @static if VERSION >= v"0.7.0-DEV.3208"
    using Future
+    Future.copy!(dest::CatArrOrSub, src::AbstractArray) =
+        copyto!(dest, 1, src, 1, length(src))
    Future.copy!(dest::CatArrOrSub, src::CatArrOrSub) =


This method isn't needed now AFAICT.

What do you want to happen if we want to copy! a normal array to a categorical array? Throw an error?

nalimilan · 2018-11-26T13:15:40Z

src/array.jl

-    for i = 1:len2
-        A[len + i] = B[i]
-    end
+    # As in Base, A will be left modified if it is not possible to copy B to A


I wonder why Base doesn't call resize! in case of failure. That sounds trivial, undoing the operation is not as hard as in copyto!. But better stick to what Base does, and maybe file an issue in Julia.

Maybe the reason is that you would have to do try-catch which will slow down the function? I will open an issue with Base.

nalimilan · 2018-11-26T13:18:18Z

test/13_arraycommon.jl

+        x = categorical([1:255;1:255], true)
+        y = categorical([1, 1000], true)
+        @test copyto!(x, 1, y, 1, 1)[1] == y[1]
+        @test_throws ErrorException copyto!(x, 1, y, 1, 1)[2]


Better check the length? Actually, better check the whole contents of the result (same for the line above).

nalimilan · 2018-11-26T13:19:15Z

test/13_arraycommon.jl

+        @test_throws ErrorException copyto!(x, 1, y, 1, 1)[2]
+
+        x = categorical([1,2,3])
+        y = categorical([5,6,7])


Better use unsorted values to really test that the order is preserved.

nalimilan · 2018-11-26T13:19:44Z

test/13_arraycommon.jl

+        copyto!(x,y)
+        @test levels(x) = [1,2,3,5,6,7]
+
+        @test copy!(x, [1,1,1]) == [1,1,1]


Why not also check levels? Also with different unsorted values? Same below.

Co-Authored-By: bkamins <bkamins@sgh.waw.pl>

bkamins · 2018-11-26T14:03:36Z

I've put a lot of effort into implementing is that if the two vectors have compatible orderings of levels, we try very hard to merge the levels into a common order

This is what I see in the code. And that is why I am asking, because the issue is complex (in general - working on those things taught me that things that I initially see as simple are not).

So we have two options:

dynamically add levels as needed (what I currently implemented in copyto!)
always merge levels

I can implement both of them with equal speed in copyto! (of course option 2 with setindex! will be slower)

Option 2 is very tempting conceptually. The problem with it is that what we merge might have conflicting orderings (but I guess we should treat it as a corner case and just go with your implementation)?

nalimilan · 2018-11-26T14:46:56Z

If you have the time and motivation, it would be interesting to investigate whether it would be possible to have setindex! preserve ordering using mergelevels. When orderings conflicts, the fallback is to add levels at the end.

bkamins · 2018-11-26T14:57:27Z

This is what I intended. I really like to have things clean.
However, what I will do, given the raising complexity of this PR is to handle append! separately as this is a simple change that will stay anyway.

bkamins · 2018-11-26T23:04:29Z

@nalimilan - sorry to bother you again. I have started working on the order-preserving setindex! which led me to the analysis of mergelevels, because I have to reimplement it to level updating version (not creating a completely new level).

Can you please explain me the logic behind the following output?

julia> CategoricalArrays.mergelevels(false, [1,2,3], [4,2,5,1])
([5, 4, 1, 2, 3], false)

I do not understand why you put 4 after 5 in the result.

nalimilan · 2018-11-27T08:52:04Z

No idea. I admit I would have expected [1, 2, 3, 4, 5] (due to the order of levels, not to integer sorting). I guess this has to do with the particular solution I found to insert levels while preserving order when possible: AFAIK it works when it's possible to preserve order, but when that's not possible it might not return an optimal choice. Feel free to suggest better solutions if you can find ones, I must say I don't remember all the details now.

bkamins · 2018-11-27T09:11:23Z

I will design such an algorithm, but this is a longer process to make it efficient and consistent for: setindex!, vcat and copyto!. But this will be mildly breaking.

Therefore, because of the complexity we face, I will close this PR and close #170 and open issues/PRs that are split into smaller chunks of changed functionality.

bkamins · 2018-12-02T19:02:07Z

I am thinking about this issue for some time. The first conclusion is that we have to have a list of methods we should hande. This is what I have collected:

setindex!
copyto!
copy!
vcat (we need to implement a proper method here for CatValue and CatString as they are missing)
append!
push!
hcat (we still have to implement a custom method here as it is missing)
hvcat (we still have to implement a custom method here as it is missing)
pushfirst! (we still have to implement a custom method here as it is missing)
insert! (we still have to implement a custom method here as it is missing)

The key issue is that in general it is impossible to implement mergelevels in a consistent way. What I mean by this is that the result must depend on the order of arguments and this is unavoidable.

The general logic for all methods should be:

check if the operation allows adding new levels
if it does then add levels
if it does not then throw an error if the level would have to be added
perform an operation on a modified target (we know that it allows adding levels)

There are then multiple corner cases but the main is:

in cases those methods add levels to a vector should they add all levels at once or only levels present in what is added.

Currently with mergelevels we add all levels, not only those present. I think it is important to have one guiding princple here that is commonly respected by all the functions, as otherwise users will have a hard time learning the API.

The difference in the approach manifests itself e.g. when doing setindex!, append!, copyto! to an ordered catvector. First of all - they all should be consistent. Now if we have add all levels rule then those methods should fail if what we add has some levels that are not supported by the catvector we write to even if the value itself would be accepted. If we have add needed levels rule then such operation will work if what we want to add is a level that is accepted.

The thing is that if we do not have common guiding principles then e.g. vcat(x, y) and append!(x, y) for some data might both work but produce a different result and I think that vcat(x,y) and append!(x,y) should pass equality of results and should pass equality of levels.

To understand what I mean have a look at this example (this is one of many possible examples):

julia> x = categorical([1,2,3], ordered=true)
3-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

julia> y = categorical([4], ordered=true)
1-element CategoricalArray{Int64,1,UInt32}:
 4

julia> levels!(y, [1,4,2])
1-element CategoricalArray{Int64,1,UInt32}:
 4

julia> levels(hcat(x[1], y))
2-element Array{CategoricalValue{Int64,UInt32},1}:
 1
 4

julia> levels(hcat(x[1:1], y))
2-element Array{Int64,1}:
 1
 4

julia> levels(vcat(x[1], y))
2-element Array{CategoricalValue{Int64,UInt32},1}:
 1
 4

julia> levels(vcat(x[1:1], y))
4-element Array{Int64,1}:
 1
 4
 2
 3

julia> levels(append!(x,y))
4-element Array{Int64,1}:
 1
 2
 3
 4

So in short - I think we first need a principle we want to follow and then reimplement all the methods from grounds-up to follow it. Unfortunately it seems to be a large task, but - if we want to provide a solid base for the future - I would try to do it.

nalimilan · 2018-12-15T14:21:04Z

Unfortunately I don't think we can fix cat at this point, since we would introduce ambiguities with other array types. See JuliaLang/julia#20815. So all the situations where you end up with an Array{CategoricalValue} should not exist -- we can't handle this kind of array very well anyway and they don't make a lot of sense.

Regarding whether we should (try to) add all levels or only those actually used, I think we should do the former, because it's not always possible to efficiently check which levels are used (you basically need to make an additional pass over the data). That's one of the reasons why we keep unused levels in the first place (the other being that levels are supposed to have a meaning by themselves, independent from the data).

bkamins · 2018-12-15T14:26:36Z

Then I would leave mergelevels you have implemented and just say that:

when there is a conflict between level orderings in merged levels the order is undefined;
that the resulting order depends on the order of arguments (even without a conflict - as this is the case).

I will go back to this when I am more or less finished with DataFrames.jl stuff (setindex! and broadcasting) as this is probably less pressing.

bkamins · 2018-12-19T20:07:18Z

TODO (from Slack):
also consider this case:

using DataFrames
v=DataFrame(:a=>categorical([nothing,"a"]))
append!(v,v)

which currently throws an error

nalimilan · 2019-09-25T07:38:25Z

@bkamins Do you know what is still relevant in this PR after we merge #211 and given what we have decided at #99?

bkamins · 2019-09-25T08:01:18Z

I will test it after #211 is merged and tagged.

bkamins · 2019-09-26T09:58:07Z

There is still a problem that:

using DataFrames
v=DataFrame(:a=>categorical([nothing,"a"]))
append!(v,v)

throws an error.

Also before we close this please confirm that this is intended:

julia> x = categorical([1,2,3], ordered=true)
3-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

julia> push!(x, 10)
ERROR: cannot add new level 10 since ordered pools cannot be extended implicitly. Use the levels! function to set new levels, or the ordered! function to mark the pool as unordered.

julia> append!(x, [10])
ERROR: MethodError: no method matching append!

julia> append!(x, categorical([10]))
4-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 3
 10

nalimilan · 2019-09-26T10:11:05Z

There is still a problem that:
using DataFrames
v=DataFrame(:a=>categorical([nothing,"a"]))
append!(v,v)
throws an error.

I don't think CategoricalArrays handle Union eltypes very well. Probably worth filing a separate issue.

Also before we close this please confirm that this is intended:

julia> x = categorical([1,2,3], ordered=true)
3-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

julia> push!(x, 10)
ERROR: cannot add new level 10 since ordered pools cannot be extended implicitly. Use the levels! function to set new levels, or the ordered! function to mark the pool as unordered.

julia> append!(x, [10])
ERROR: MethodError: no method matching append!

julia> append!(x, categorical([10]))
4-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 3
 10

The push! error is intended (we can discuss that, though). The append! error is probably an oversight worth filing another issue (I wish Base would define a fallback based on resize! and setindex!, but currently it doesn't).

Nothing to change in copyto! itself?

bkamins · 2019-09-26T11:59:15Z

If this:

julia> x = categorical([1,2,3], ordered=true)
3-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

julia> y = categorical([1,2,1000], ordered=true)
3-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 1000

julia> y[3] = 1
1

julia> y
3-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 1

julia> copyto!(x, y)
3-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 1

julia> levels(x)
4-element Array{Int64,1}:
    1
    2
    3
 1000

is intended then I think we are OK and this can be closed.

Can I leave opening of the other specific issues you have indicated to you as I feel you understand better what is the core problem with them?

bkamins · 2019-09-26T12:00:46Z

just for a reference continuing the last example, we have:

julia> x = categorical([1,2,3], ordered=true)
3-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

julia> z = [x; y]
6-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 3
 1
 2
 1

julia> levels(z)
4-element Array{Int64,1}:
    1
    2
    3
 1000

so we are consistent.

nalimilan · 2019-09-26T19:54:56Z

is intended then I think we are OK and this can be closed.
Yes that's intended. At least that's what we agreed upon AFAICT. Good to know we're consistent now!

Can I leave opening of the other specific issues you have indicated to you as I feel you understand better what is the core problem with them?

Actually I've realized it's covered by #170.

bkamins · 2019-09-26T20:42:17Z

OK - so this can be closed then. Right?

nalimilan · 2019-09-27T08:10:18Z

I guess.

I've filed #213 to fix the append! problem. There are probably many other issues like that, though.

bkamins added 3 commits November 22, 2018 18:33

copyto! and append! reference implementation

5528a9a

Revert "copyto! and append! reference implementation"

5b5f565

This reverts commit 5528a9a.

improve copyto!, copy!, and append!

e6a51fa

bkamins force-pushed the better_copyto branch from 4cb7fef to ff5d841 Compare November 26, 2018 10:38

cleanup code

ff08d35

bkamins force-pushed the better_copyto branch from ff5d841 to ff08d35 Compare November 26, 2018 10:42

fix reference

c0f3f61

bkamins changed the title ~~Better copyto~~ WIP: Better copyto Nov 26, 2018

improve performance for non-missings

b4e8089

bkamins force-pushed the better_copyto branch from 720b24d to b4e8089 Compare November 26, 2018 11:18

fix typo and remove ambiguity

35d103f

nalimilan reviewed Nov 26, 2018

View reviewed changes

nalimilan and others added 2 commits November 26, 2018 14:43

Update src/array.jl

5c5e571

Co-Authored-By: bkamins <bkamins@sgh.waw.pl>

Update src/array.jl

89ed2e4

Co-Authored-By: bkamins <bkamins@sgh.waw.pl>

nalimilan mentioned this pull request May 17, 2019

Implement generic append! method for AbstractArray source #192

Draft

nalimilan mentioned this pull request Jun 12, 2019

Broadcasting methods over CategoricalString has changed #199

Closed

bkamins mentioned this pull request Jun 19, 2019

Efficiently copy levels ordering from CategoricalValue in setindex!() #99

Open

nalimilan mentioned this pull request Sep 27, 2019

Fix convert with Union{T, Nothing} #213

Merged

nalimilan closed this Sep 27, 2019

WIP: Better copyto #172

WIP: Better copyto #172

Conversation

bkamins commented Nov 26, 2018 • edited Loading

bkamins commented Nov 26, 2018

bkamins commented Nov 26, 2018

nalimilan commented Nov 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Nov 26, 2018

nalimilan commented Nov 26, 2018

bkamins commented Nov 26, 2018

bkamins commented Nov 26, 2018

nalimilan commented Nov 27, 2018

bkamins commented Nov 27, 2018

bkamins commented Dec 2, 2018 • edited by nalimilan Loading

nalimilan commented Dec 15, 2018

bkamins commented Dec 15, 2018

bkamins commented Dec 19, 2018

nalimilan commented Sep 25, 2019

bkamins commented Sep 25, 2019

bkamins commented Sep 26, 2019

nalimilan commented Sep 26, 2019

bkamins commented Sep 26, 2019

bkamins commented Sep 26, 2019

nalimilan commented Sep 26, 2019

bkamins commented Sep 26, 2019

nalimilan commented Sep 27, 2019

bkamins commented Nov 26, 2018 •

edited

Loading

bkamins commented Dec 2, 2018 •

edited by nalimilan

Loading