Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vcat changes type of PooledArray columns #3061

Open
MrHenning opened this issue Jun 1, 2022 · 9 comments
Open

vcat changes type of PooledArray columns #3061

MrHenning opened this issue Jun 1, 2022 · 9 comments
Labels
Milestone

Comments

@MrHenning
Copy link

First, thank you all for this absolutely beautiful Julia Package 🙂

I am not sure if this is intended, but using vcat on PooledArray dataframe columns changes their type:

typeof(
	vcat(
		DataFrame(col=PooledArray(rand(1:2, 10))), 
		DataFrame(col=PooledArray(rand(1:2, 10)))
	).col
)
>>> Vector{Int64} (alias for Array{Int64, 1})

whereas it works with CategoricalArrays:

typeof(
	vcat(
		DataFrame(col=categorical(rand(1:2, 10))), 
		DataFrame(col=categorical(rand(1:2, 10)))
	).col
)
>>> CategoricalVector{Int64, UInt32, Int64, CategoricalValue{Int64, UInt32}, Union{}} (alias for CategoricalArray{Int64, 1, UInt32, Int64, CategoricalValue{Int64, UInt32}, Union{}})

Using vcat on Vectors keeps the PooledArray type:

typeof(
	vcat(
		PooledArray(rand(1:2, 10)), 
		PooledArray(rand(1:2, 10))
	)
)
>>> PooledVector{Int64, UInt8, Vector{UInt8}} (alias for PooledArray{Int64, UInt8, 1, Array{UInt8, 1}})
@bkamins bkamins added this to the 1.4 milestone Jun 1, 2022
@bkamins
Copy link
Member

bkamins commented Jun 1, 2022

This is kind-of intended, but maybe we can change it. Currently we generate the target column type through Tables.allocatecolumn, but we could instead use the default vcat mechanism here. This would be mildly breaking.

Now the point is that for CategoricalVector we have to keep the type as otherwise the column would loose its "categorical" characteristics. For PooledVector it is different as Vector works just as well as PooledVector. The only question is about pool size (so if it is worth to keep the result pooled or not - but maybe it is).

@nalimilan - what do you think?

@nalimilan
Copy link
Member

Yeah it sounds better to use vcat if possible. But AFAICT it's not completely trivial to do as we also support various options for when a column is missing from one of the inputs. This is another case where something like JuliaLang/julia#20815 could be useful.

@bkamins
Copy link
Member

bkamins commented Jun 1, 2022

@nalimilan to make this change we would have to materialize missing columns before vcatting them.

As with current code switching to default vcat would result in:

julia> vcat([1,2,3], Iterators.repeated(missing, 5))
4-element Vector{Any}:
 1
 2
 3
  Base.Iterators.Take{Base.Iterators.Repeated{Missing}}(Base.Iterators.Repeated{Missing}(missing), 5)

The reason is that vcat treats iterators (like Iterators.Repeated) as scalars not as iterables.
(we would need to instead materialize Iterators.Repeated into something that would match type of other columns that are already present - as you noted - this is tricky)

@nalimilan
Copy link
Member

Yeah that's not great. But maybe that small performance loss (which does not affect the externally visible behavior) is better than losing the PooledArray type of columns?

We could also use a FillArray but that would add a dependency; or implement an equivalent internally, which adds more code.

@bkamins
Copy link
Member

bkamins commented Jun 1, 2022

Yes - I agree. I just noted the required change. The major question is if keeping PooledArray is beneficial. The point is that most likely dropping back to PooledArray is slower:

julia> x = PooledArray(rand(1:100, 10^6));

julia> y = PooledArray(rand(1:100, 10^6));

julia> @time vcat(x, y);
  0.106547 seconds (39 allocations: 2.876 MiB)

julia> using DataFrames

julia> dx = DataFrame(a=x);

julia> dy = DataFrame(a=y);

julia> @time z = vcat(dx, dy);
  0.003815 seconds (72 allocations: 15.263 MiB)

julia> @time PooledArray(z.a);
  0.016989 seconds (21 allocations: 7.636 MiB)

so maybe actually it is better to return Vector and if the user wants PooledArray then it can be re-pooled? (as - as you can see - it is cheaper that vcat in PooledArrays.jl right now)

@nalimilan
Copy link
Member

Looks like we should improve vcat for PooledArrays? In theory it could be much faster than the fallback, in particular when pools are equal (or one is a subset of the other).

@bkamins
Copy link
Member

bkamins commented Jun 1, 2022

Looks like we should improve vcat for PooledArrays?

Yes (a rabbit hole as usual)

@bkamins bkamins modified the milestones: 1.4, 1.5 Jun 7, 2022
@bkamins
Copy link
Member

bkamins commented Jun 7, 2022

I am moving it to for a decision in 1.5 release since the decision is not one-sided.

@MrHenning - do you see significant benefits of keeping PooledArray under vcat?

@MrHenning
Copy link
Author

@MrHenning - do you see significant benefits of keeping PooledArray under vcat?

I often have a many rows in my DataFrames, so PooledArrays are my obvious go-to.
(I think I chose PooledArrays over CategoricalArrays because they play nicer with Arrow.jl, but I might be mistaken...)
That being said, right now I just re-convert the columns to PooledArrays after the vcat, which only adds a few lines of code.

So the benefit would be to have vcat(::DataFrame, ::DataFrame) behave like vcat(::Vector, ::Vector) (in that it keeps the Type), and reducing the lines of code you'd need to write.
On the other hand, it does seem to involve quite a bit of work to achieve this, so reverting to CategoricalArrays on the user side is probably the easier solution 🙂

In any case, thanks for your efforts!!! I'll still be happy if things stay as they are right now 🙂

@bkamins bkamins modified the milestones: 1.5, 1.6 Feb 5, 2023
@bkamins bkamins modified the milestones: 1.6, 1.7 Jul 10, 2023
@bkamins bkamins modified the milestones: 1.7, 1.x Sep 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants