Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Adjust to eltype changes in DataArrays where the function now returns Union{T,NAtype} #1209

Closed
wants to merge 1 commit into from

Conversation

andreasnoack
Copy link
Member

These are the adjustments required for DataFrames's tests to pass after the changes in JuliaStats/DataArrays.jl#280

@@ -62,7 +62,7 @@ type ModelFrame
contrasts::Dict{Symbol, ContrastsMatrix}
end

@compat const AbstractFloatMatrix{T<:AbstractFloat} = AbstractMatrix{T}
@compat const AbstractFloatMatrix{T<:Union{AbstractFloat,NAtype}} = AbstractMatrix{T}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the alias is not completely correct with this change. Maybe you could just remove that variable and use ModelMatrix{S<:Union{AbstractFloat,NAtype}, T<:AbstractMatrix{S}} below now that it's supported?

@quinnj
Copy link
Member

quinnj commented Sep 7, 2017

No longer relevant?

@nalimilan
Copy link
Member

Only relevant if somebody wants to make another (last?) release based on DataArrays.

@ararslan
Copy link
Member

ararslan commented Sep 7, 2017

Probably not worth it at this point. The DataArrays tag that introduces the eltype change still hasn't been merged, as it's expected to be breaking and folks wanted to know the extent of the breakage first. If we're moving away from DataArrays entirely, our last DataArrays-based release should be with DataArrays as it is now, perhaps upper bounded on 0.7.0. That would also make this PR obsolete.

@ararslan
Copy link
Member

ararslan commented Sep 7, 2017

I'm not sure whether the changes introduced in DataArrays 0.7.0 will be user-facing in DataFrames after this PR, but if so, it's likely not worth the churn to then change it again as we adopt Nulls.

@nalimilan
Copy link
Member

We could still imagine replacing NA with null in DataArrays, and then tag a release of both the new DataArrays and DataFrames based on it, at the same time as a release based on master. That way people who need the performance of DataArray on Julia 0.6 could have it, while starting to port some of their code to Nulls.

Of course that's also a lot of work for an unclear gain, so I guess that depends on somebody being directly interested in it (maybe @andreasnoack is).

@ararslan
Copy link
Member

ararslan commented Sep 7, 2017

I believe @iamed2 was particularly interested in Union element types in DataArrays (i.e. Andreas' "stop lying" PR), so perhaps it would be good to hear from Eric as well from a user's perspective.

@iamed2
Copy link

iamed2 commented Sep 7, 2017

This isn't important if the ecosystem can stabilize around Nulls soon. And we only care about DataArrays insofar as they're the array type for DataFrames. If they stop being that, they're irrelevant to us.

@ararslan
Copy link
Member

ararslan commented Sep 7, 2017

Great, thanks for the input. I'm not sure it's worth churning DataArrays if we aren't going to use them here anymore. But the upside of no longer using them here is that we can update DataArrays independently of DataFrames, so we can move to Nulls there later if we want. (Assuming DataArrays will still be relevant once things settle on Nulls, which I imagine it won't be.)

@andreasnoack
Copy link
Member Author

Fine with me but has anybody made any benchmarks comparing operations on Array{Union{T,Null},N} to DataArray{T,N}? Just curious.

@andreasnoack
Copy link
Member Author

andreasnoack commented Sep 12, 2017

Just two small examples for now. (I'm testing Union{Null, T} on 0.7 and DataArrays on 0.6 since DataStructures appears to be broken on master)

sum

# Union{Float64,Null}
julia> x = Union{Float64,Null}[rand() for i in 1:10^7];

julia> @time sum(x);
  0.065898 seconds (65.54 k allocations: 1.000 MiB)

# DataArrays
julia> y = @data rand(10^7);

julia> @time sum(x);
  0.007327 seconds (9 allocations: 320 bytes)

broadcast

# Union{Float64,Null}
julia> x = Union{Float64,Null}[rand() for i in 1:10^7];

julia> @time broadcast(+, x, 1);
  0.447139 seconds (20.00 M allocations: 381.479 MiB, 35.30% gc time)

julia> x[end] = Null()
null

julia> @time broadcast(+, x, 1);
  1.849494 seconds (40.00 M allocations: 762.941 MiB, 27.76% gc time)

# DataArrays
julia> y = @data rand(10^7);

julia> @time broadcast(+, y, 1);
  0.057529 seconds (61 allocations: 78.681 MiB, 20.38% gc time)

julia> y[end] = NA
NA

julia> @time broadcast(+, y, 1);
  0.054045 seconds (61 allocations: 78.681 MiB, 19.61% gc time)

@ararslan
Copy link
Member

Brutal

@quinnj
Copy link
Member

quinnj commented Sep 12, 2017

Having a brief glance, I imagine most of the cost is JuliaLang/julia#23338, i.e. nothing getting inlined anywhere so the dynamic dispatch is killing performance.

@rofinn
Copy link
Member

rofinn commented Sep 12, 2017

@quinnj It seems to be allocating a lot of memory too though.

julia> using BenchmarkTools

julia> using Nulls

julia> using DataArrays

julia> x = Union{Float64,Null}[rand() for i in 1:10^7];

julia> y = @data rand(10^7);

julia> @benchmark sum($x)
BenchmarkTools.Trial:
  memory estimate:  763.94 MiB
  allocs estimate:  50065533
  --------------
  minimum time:     494.707 ms (5.28% GC)
  median time:      505.274 ms (5.27% GC)
  mean time:        506.101 ms (5.28% GC)
  maximum time:     526.756 ms (5.40% GC)
  --------------
  samples:          10
  evals/sample:     1

julia> @benchmark sum($y)
BenchmarkTools.Trial:
  memory estimate:  112 bytes
  allocs estimate:  2
  --------------
  minimum time:     5.421 ms (0.00% GC)
  median time:      5.887 ms (0.00% GC)
  mean time:        6.083 ms (0.00% GC)
  maximum time:     13.873 ms (0.00% GC)
  --------------
  samples:          819
  evals/sample:     1

Couldn't we just repurpose DataArrays to operate on Union{T, Null} instead?

@quinnj
Copy link
Member

quinnj commented Sep 12, 2017

Oh yes, it definitely will on 0.6; isbits Union arrays having their elements stored inline is 0.7 only.

@nalimilan
Copy link
Member

Union{Null, T} performance is 10x better on Julia 0.7 compared with 0.6 for sum, and 2x better for broadcast without nulls (no clear difference when there's a null value). So the difference with DataArray is already less of an issue there, and could probably be reduced even more. But it would still be problematic to release the new Nulls-based DataFrames on Julia 0.6 given the poor performance.

@andreasnoack
Copy link
Member Author

Just to avoid confusion, my timings of Union{Null, T} were on 0.7 (from yesterday). Only the DataArrays timings were made on 0.6.

@nalimilan
Copy link
Member

Ah, OK. So that would be even worse on 0.6. :-/

@nalimilan
Copy link
Member

A possible intermediate solution would be to port DataArrays to Nulls (really, this just implies deprecating NA in favor of Null; PooledDataArray could also be deprecated in favor of CategoricalArray), and check that they are supported in DataFrames master, at least for most operations. Since DataFrames is now column type agnostic, it should already work, except that when a new column is created (concatenation, merge) an Array{Union{Null, T}} is created. We could add some temporary promotion rules so that DataArray wins. Then people could keep using DataArrays if they care about performance.

@nalimilan
Copy link
Member

See JuliaStats/DataArrays.jl#288 for a DataArrays port to Nulls.

@nalimilan
Copy link
Member

For reference, things have of course improved a lot in 1.0, but we've still regressed a lot. We should probably file issues to track progress for these specific cases.

On Julia 0.6:

julia> x = DataArray(Union{Float64,Missing}[rand() for i in 1:10^7]);

julia> @btime sum(x);
  5.518 ms (2 allocations: 128 bytes)

julia> @btime broadcast(+, x, 1);
  53.678 ms (70 allocations: 78.68 MiB)

julia> x[end] = missing
missing

julia> @btime broadcast(+, x, 1);
  54.128 ms (70 allocations: 78.68 MiB)

julia> x[1] = missing
missing

julia> @btime broadcast(+, x, 1);
  52.828 ms (70 allocations: 78.68 MiB)

On Julia 1.0:

julia> x = Union{Float64,Missing}[rand() for i in 1:10^7];

julia> @btime sum(x);
  31.034 ms (1 allocation: 16 bytes)

julia> @btime broadcast(+, x, 1);
  127.155 ms (8 allocations: 76.29 MiB)

julia> x[end] = missing
missing

julia> @btime broadcast(+, x, 1);
  485.791 ms (19999498 allocations: 467.29 MiB)

julia> x[1] = missing
missing

julia> @btime broadcast(+, x, 1);
  734.313 ms (9 allocations: 85.83 MiB)

Closing anyway since the solution won't come from this PR.

@nalimilan nalimilan closed this Sep 22, 2018
@nalimilan nalimilan deleted the anj/eltype branch September 22, 2018 10:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants