-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Adjust to eltype changes in DataArrays where the function now returns Union{T,NAtype} #1209
Conversation
src/statsmodels/formula.jl
Outdated
@@ -62,7 +62,7 @@ type ModelFrame | |||
contrasts::Dict{Symbol, ContrastsMatrix} | |||
end | |||
|
|||
@compat const AbstractFloatMatrix{T<:AbstractFloat} = AbstractMatrix{T} | |||
@compat const AbstractFloatMatrix{T<:Union{AbstractFloat,NAtype}} = AbstractMatrix{T} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of the alias is not completely correct with this change. Maybe you could just remove that variable and use ModelMatrix{S<:Union{AbstractFloat,NAtype}, T<:AbstractMatrix{S}}
below now that it's supported?
Union{T,NAtype} aka DataArrays.Data{T}
dd9a716
to
0cfce99
Compare
No longer relevant? |
Only relevant if somebody wants to make another (last?) release based on DataArrays. |
Probably not worth it at this point. The DataArrays tag that introduces the |
I'm not sure whether the changes introduced in DataArrays 0.7.0 will be user-facing in DataFrames after this PR, but if so, it's likely not worth the churn to then change it again as we adopt Nulls. |
We could still imagine replacing Of course that's also a lot of work for an unclear gain, so I guess that depends on somebody being directly interested in it (maybe @andreasnoack is). |
I believe @iamed2 was particularly interested in |
This isn't important if the ecosystem can stabilize around Nulls soon. And we only care about DataArrays insofar as they're the array type for DataFrames. If they stop being that, they're irrelevant to us. |
Great, thanks for the input. I'm not sure it's worth churning DataArrays if we aren't going to use them here anymore. But the upside of no longer using them here is that we can update DataArrays independently of DataFrames, so we can move to Nulls there later if we want. (Assuming DataArrays will still be relevant once things settle on Nulls, which I imagine it won't be.) |
Fine with me but has anybody made any benchmarks comparing operations on |
Just two small examples for now. (I'm testing
|
Brutal |
Having a brief glance, I imagine most of the cost is JuliaLang/julia#23338, i.e. nothing getting inlined anywhere so the dynamic dispatch is killing performance. |
@quinnj It seems to be allocating a lot of memory too though. julia> using BenchmarkTools
julia> using Nulls
julia> using DataArrays
julia> x = Union{Float64,Null}[rand() for i in 1:10^7];
julia> y = @data rand(10^7);
julia> @benchmark sum($x)
BenchmarkTools.Trial:
memory estimate: 763.94 MiB
allocs estimate: 50065533
--------------
minimum time: 494.707 ms (5.28% GC)
median time: 505.274 ms (5.27% GC)
mean time: 506.101 ms (5.28% GC)
maximum time: 526.756 ms (5.40% GC)
--------------
samples: 10
evals/sample: 1
julia> @benchmark sum($y)
BenchmarkTools.Trial:
memory estimate: 112 bytes
allocs estimate: 2
--------------
minimum time: 5.421 ms (0.00% GC)
median time: 5.887 ms (0.00% GC)
mean time: 6.083 ms (0.00% GC)
maximum time: 13.873 ms (0.00% GC)
--------------
samples: 819
evals/sample: 1 Couldn't we just repurpose DataArrays to operate on |
Oh yes, it definitely will on 0.6; isbits Union arrays having their elements stored inline is 0.7 only. |
|
Just to avoid confusion, my timings of |
Ah, OK. So that would be even worse on 0.6. :-/ |
A possible intermediate solution would be to port DataArrays to |
See JuliaStats/DataArrays.jl#288 for a DataArrays port to Nulls. |
For reference, things have of course improved a lot in 1.0, but we've still regressed a lot. We should probably file issues to track progress for these specific cases. On Julia 0.6: julia> x = DataArray(Union{Float64,Missing}[rand() for i in 1:10^7]);
julia> @btime sum(x);
5.518 ms (2 allocations: 128 bytes)
julia> @btime broadcast(+, x, 1);
53.678 ms (70 allocations: 78.68 MiB)
julia> x[end] = missing
missing
julia> @btime broadcast(+, x, 1);
54.128 ms (70 allocations: 78.68 MiB)
julia> x[1] = missing
missing
julia> @btime broadcast(+, x, 1);
52.828 ms (70 allocations: 78.68 MiB) On Julia 1.0: julia> x = Union{Float64,Missing}[rand() for i in 1:10^7];
julia> @btime sum(x);
31.034 ms (1 allocation: 16 bytes)
julia> @btime broadcast(+, x, 1);
127.155 ms (8 allocations: 76.29 MiB)
julia> x[end] = missing
missing
julia> @btime broadcast(+, x, 1);
485.791 ms (19999498 allocations: 467.29 MiB)
julia> x[1] = missing
missing
julia> @btime broadcast(+, x, 1);
734.313 ms (9 allocations: 85.83 MiB) Closing anyway since the solution won't come from this PR. |
These are the adjustments required for DataFrames's tests to pass after the changes in JuliaStats/DataArrays.jl#280