-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
std() and var() do not work on array of arrays while mean() does #23884
Comments
These functions have been written for |
I understand that mean is much simpler. However, when summation and division is possible on the elements of the array to "mean" over, the variance should also work without any troubles as is only uses summation and multiplication (which is possible, when division is possible). In general, all such "over an array" statistical functions should work in the same way and should work when the described operations are possible on the elements. |
I think you are underestimating the challenges in writing generic code. It might be possible to get this working though. It depends on what you expect to get when computing |
This is certainly true.
std([[2,4,6],[4,6,8]]) = [std([2,4]), std([4,6]), std([6,8])] so basically I would expect a point wise calculation on every position of the inner elements (arrays in this case but could of cause be matrices as well) "averaging" over all outer array positions. Sounds difficult but I hope the idea is clear. |
Or in formulas and julia notation: x = [x1, x2, ..., xN]
x1 = [2, 4, 6]
x2 = [4, 6, 8]
...
xN = ...
mean(x) = 1 / N * (x1 .+ x2 .+ ... .+ xN)
var(x) = [mean(x1.^2), mean(x2.^2), ..., mean(xN.^2)] .- mean(x).^2 # probably with some correction factors for unbiased version
std(x) = sqrt.(var(x)) |
…ction from some signatures as well as using broadcasting in std. Fixes #23884
I have written a first "workaround" (it works good but of course does not handle any errors or so). Maybe this can help to develop a generic code with this functionality. It does not recognize The code: import Base.std
function std(v::AbstractArray{V,1};corrected=true) where {V<:AbstractArray{T,N} where {T,N}}
return sqrt.(var(v;corrected=corrected))
end
import Base.var
function var(v::AbstractArray{V,1};corrected=true) where {V<:AbstractArray{T,N} where {T,N}}
return sum(abs2,[i - mean(v) for i in v]) / (length(v) - Int(corrected))
end
## arrays for which the var/std should be calculated
x1_test = [1,3,5]
x2_test = [6,7,8]
x3_test = [3,4,7]
x4_test = [-1,0,1]
## but now orderd in an array of arrays/matrices
x = [[1,6,3,-1],[3,7,4,0],[5,8,7,1]]
M = [[1 6; 3 -1], [3 7; 4 0], [5 8; 7 1]]
## just some output
println("reference: \n")
println("\ntrue uncorrected variances")
println(var(x1_test;corrected = false))
println(var(x2_test;corrected = false))
println(var(x3_test;corrected = false))
println(var(x4_test;corrected = false))
println("\ntrue corrected variances")
println(var(x1_test;corrected = true))
println(var(x2_test;corrected = true))
println(var(x3_test;corrected = true))
println(var(x4_test;corrected = true))
println("\ntrue uncorrected standard deviations")
println(std(x1_test;corrected = false))
println(std(x2_test;corrected = false))
println(std(x3_test;corrected = false))
println(std(x4_test;corrected = false))
println("\ntrue corrected standard deviations")
println(std(x1_test;corrected = true))
println(std(x2_test;corrected = true))
println(std(x3_test;corrected = true))
println(std(x4_test;corrected = true))
println("\nnew function: \n")
println("\nuncorrected variances")
println(var(x;corrected=false))
println("\ncorrected variances")
println(var(x;corrected=true))
println("\nuncorrected standard deviations")
println(std(x;corrected=false))
println("\ncorrected standard deviations")
println(std(x;corrected=true))
println("\nnew function with array of matrices")
println("\nuncorrected variances")
println(var(M;corrected=false))
println("\ncorrected variances")
println(var(M;corrected=true))
println("\nuncorrected standard deviations")
println(std(M;corrected=false))
println("\ncorrected standard deviations")
println(std(M;corrected=true)) and the output: reference:
true uncorrected variances
2.6666666666666665
0.6666666666666666
2.888888888888889
0.6666666666666666
true corrected variances
4.0
1.0
4.333333333333333
1.0
true uncorrected standard deviations
1.632993161855452
0.816496580927726
1.699673171197595
0.816496580927726
true corrected standard deviations
2.0
1.0
2.0816659994661326
1.0
new function:
uncorrected variances
[2.66667, 0.666667, 2.88889, 0.666667]
corrected variances
[4.0, 1.0, 4.33333, 1.0]
uncorrected standard deviations
[1.63299, 0.816497, 1.69967, 0.816497]
corrected standard deviations
[2.0, 1.0, 2.08167, 1.0]
new function with array of matrices
uncorrected variances
[2.66667 0.666667; 2.88889 0.666667]
corrected variances
[4.0 1.0; 4.33333 1.0]
uncorrected standard deviations
[1.63299 0.816497; 1.69967 0.816497]
corrected standard deviations
[2.0 1.0; 2.08167 1.0] |
See #23897 which I opened just a minute before you posted this. |
Nice ;) Well, here, just for completeness my wrapper for DataFrames. function std(v::DataArrays.DataArray{V,1};corrected=true) where {V<:AbstractArray{T,N} where {T,N}}
return std(Array(v);corrected=corrected)
end
function var(v::DataArrays.DataArray{V,1};corrected=true) where {V<:AbstractArray{T,N} where {T,N}}
return var(Array(v);corrected=corrected)
end |
Thank you for the quick work. I hope that this commit will be merged soon ;) |
* Make var and std work for Vector{Vector{T}} by removing Number restriction from some signatures as well as using broadcasting in std. Fixes #23884 * Make cov work for Vector{Vector}
Hello, I wonder why I cannot use
std()
orvar()
on an array of arrays while I can do so formean()
.Consider this simple example:
The version information is
I find it strange that the one function is possible to give the intended outcome while the other end up with an error. Is this the desired behavior? Have I missed something?
I would say that exactly the same interpretation as for the
mean()
function should be used forstd()
andvar()
. I am aware of the possibility of a multidimensional array and direction argument tostd()
but this does not explain why it works out so nicely formean()
.I came across this while using DataFrames with arrays as elements of the DataArray but the problem seams to be much more general.
Maybe there are other ways to do this?
The text was updated successfully, but these errors were encountered: