Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std() and var() do not work on array of arrays while mean() does #23884

Closed
stakaz opened this issue Sep 26, 2017 · 9 comments
Closed

std() and var() do not work on array of arrays while mean() does #23884

stakaz opened this issue Sep 26, 2017 · 9 comments
Labels
maths Mathematical functions

Comments

@stakaz
Copy link

stakaz commented Sep 26, 2017

Hello, I wonder why I cannot use std() or var() on an array of arrays while I can do so for mean().

Consider this simple example:

julia> x = [[2,4,6],[4,6,8]]
2-element Array{Array{Int64,1},1}:
 [2, 4, 6]
 [4, 6, 8]

julia> mean(x)
3-element Array{Float64,1}:
 3.0
 5.0
 7.0

julia> std(x)
ERROR: MethodError: no method matching zero(::Type{Array{Int64,1}})
Closest candidates are:
  zero(::Type{Base.LibGit2.GitHash}) at libgit2/oid.jl:106
  zero(::Type{Base.Pkg.Resolve.VersionWeights.VWPreBuildItem}) at pkg/resolve/versionweight.jl:82
  zero(::Type{Base.Pkg.Resolve.VersionWeights.VWPreBuild}) at pkg/resolve/versionweight.jl:124
  ...
Stacktrace:                                                                                                                                                                                                       
 [1] #var#533(::Bool, ::Void, ::Function, ::Array{Array{Int64,1},1}) at ./statistics.jl:184                                                                                                                       
 [2] (::Base.#kw##var)(::Array{Any,1}, ::Base.#var, ::Array{Array{Int64,1},1}) at ./<missing>:0                                                                                                                   
 [3] std(::Array{Array{Int64,1},1}) at ./statistics.jl:244                                                                                                                                                        
 [4] macro expansion at ./REPL.jl:97 [inlined]                                                                                                                                                                    
 [5] (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:73   

The version information is

Julia Version 0.6.0
Commit 903644385b* (2017-06-19 13:05 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Pentium(R) CPU  N3530  @ 2.16GHz
  WORD_SIZE: 64
  BLAS: libblas
  LAPACK: liblapack
  LIBM: libm
  LLVM: libLLVM-3.9.1 (ORCJIT, silvermont)

I find it strange that the one function is possible to give the intended outcome while the other end up with an error. Is this the desired behavior? Have I missed something?

I would say that exactly the same interpretation as for the mean() function should be used for std() and var(). I am aware of the possibility of a multidimensional array and direction argument to std() but this does not explain why it works out so nicely for mean().

I came across this while using DataFrames with arrays as elements of the DataArray but the problem seams to be much more general.

Maybe there are other ways to do this?

@andreasnoack
Copy link
Member

I find it strange that the one function is possible to give the intended outcome while the other end up with an error.

These functions have been written for Number types. mean is a much simpler function than var and therefore works by chance. It might not be too complicated to get var working though.

@stakaz
Copy link
Author

stakaz commented Sep 26, 2017

I understand that mean is much simpler. However, when summation and division is possible on the elements of the array to "mean" over, the variance should also work without any troubles as is only uses summation and multiplication (which is possible, when division is possible).

In general, all such "over an array" statistical functions should work in the same way and should work when the described operations are possible on the elements.

@andreasnoack
Copy link
Member

the variance should also work without any troubles

I think you are underestimating the challenges in writing generic code. It might be possible to get this working though. It depends on what you expect to get when computing std([[2,4,6],[4,6,8]]). What do you expect to get?

@stakaz
Copy link
Author

stakaz commented Sep 26, 2017

I think you are underestimating the challenges in writing generic code.

This is certainly true.

What do you expect to get?

std([[2,4,6],[4,6,8]]) = [std([2,4]), std([4,6]), std([6,8])]

so basically I would expect a point wise calculation on every position of the inner elements (arrays in this case but could of cause be matrices as well) "averaging" over all outer array positions. Sounds difficult but I hope the idea is clear.

@stakaz
Copy link
Author

stakaz commented Sep 26, 2017

Or in formulas and julia notation:

x = [x1, x2, ..., xN]
x1 = [2, 4, 6]
x2 = [4, 6, 8]
...
xN = ...

mean(x) = 1 / N * (x1 .+ x2 .+ ... .+ xN)
var(x) = [mean(x1.^2), mean(x2.^2), ..., mean(xN.^2)] .- mean(x).^2 # probably with some correction factors for unbiased version
std(x) = sqrt.(var(x))

andreasnoack added a commit that referenced this issue Sep 27, 2017
…ction

from some signatures as well as using broadcasting in std. Fixes #23884
@stakaz
Copy link
Author

stakaz commented Sep 27, 2017

I have written a first "workaround" (it works good but of course does not handle any errors or so). Maybe this can help to develop a generic code with this functionality.

It does not recognize DataArrays.DataArray as an AbstractArray, I don't know why. So for now a wrapper must be used as well where Array(dataframe[:somecolumn]) is passed instead of dataframe[:somecolumn] directly.

The code:

import Base.std
function std(v::AbstractArray{V,1};corrected=true) where {V<:AbstractArray{T,N} where {T,N}}
	return sqrt.(var(v;corrected=corrected))
end

import Base.var
function var(v::AbstractArray{V,1};corrected=true) where {V<:AbstractArray{T,N} where {T,N}}
	return sum(abs2,[i - mean(v) for i in v]) / (length(v) - Int(corrected))
end

## arrays for which the var/std should be calculated
x1_test = [1,3,5]
x2_test = [6,7,8]
x3_test = [3,4,7]
x4_test = [-1,0,1]

## but now orderd in an array of arrays/matrices
x = [[1,6,3,-1],[3,7,4,0],[5,8,7,1]]
M = [[1 6; 3 -1], [3 7; 4 0], [5 8; 7 1]]

## just some output

println("reference: \n")
println("\ntrue uncorrected variances")
println(var(x1_test;corrected = false))
println(var(x2_test;corrected = false))
println(var(x3_test;corrected = false))
println(var(x4_test;corrected = false))

println("\ntrue corrected variances")
println(var(x1_test;corrected = true))
println(var(x2_test;corrected = true))
println(var(x3_test;corrected = true))
println(var(x4_test;corrected = true))

println("\ntrue uncorrected standard deviations")
println(std(x1_test;corrected = false))
println(std(x2_test;corrected = false))
println(std(x3_test;corrected = false))
println(std(x4_test;corrected = false))

println("\ntrue corrected standard deviations")
println(std(x1_test;corrected = true))
println(std(x2_test;corrected = true))
println(std(x3_test;corrected = true))
println(std(x4_test;corrected = true))

println("\nnew function: \n")
println("\nuncorrected variances")
println(var(x;corrected=false))
println("\ncorrected variances")
println(var(x;corrected=true))
println("\nuncorrected standard deviations")
println(std(x;corrected=false))
println("\ncorrected standard deviations")
println(std(x;corrected=true))

println("\nnew function with array of matrices")
println("\nuncorrected variances")
println(var(M;corrected=false))
println("\ncorrected variances")
println(var(M;corrected=true))
println("\nuncorrected standard deviations")
println(std(M;corrected=false))
println("\ncorrected standard deviations")
println(std(M;corrected=true))

and the output:

reference: 


true uncorrected variances
2.6666666666666665
0.6666666666666666
2.888888888888889
0.6666666666666666

true corrected variances
4.0
1.0
4.333333333333333
1.0

true uncorrected standard deviations
1.632993161855452
0.816496580927726
1.699673171197595
0.816496580927726

true corrected standard deviations
2.0
1.0
2.0816659994661326
1.0

new function: 


uncorrected variances
[2.66667, 0.666667, 2.88889, 0.666667]

corrected variances
[4.0, 1.0, 4.33333, 1.0]

uncorrected standard deviations
[1.63299, 0.816497, 1.69967, 0.816497]

corrected standard deviations
[2.0, 1.0, 2.08167, 1.0]

new function with array of matrices

uncorrected variances
[2.66667 0.666667; 2.88889 0.666667]

corrected variances
[4.0 1.0; 4.33333 1.0]

uncorrected standard deviations
[1.63299 0.816497; 1.69967 0.816497]

corrected standard deviations
[2.0 1.0; 2.08167 1.0]

@andreasnoack
Copy link
Member

See #23897 which I opened just a minute before you posted this.

@stakaz
Copy link
Author

stakaz commented Sep 27, 2017

Nice ;) Well, here, just for completeness my wrapper for DataFrames.

function std(v::DataArrays.DataArray{V,1};corrected=true) where {V<:AbstractArray{T,N} where {T,N}}
	return std(Array(v);corrected=corrected)
end

function var(v::DataArrays.DataArray{V,1};corrected=true) where {V<:AbstractArray{T,N} where {T,N}}
	return var(Array(v);corrected=corrected)
end

@stakaz
Copy link
Author

stakaz commented Sep 27, 2017

Thank you for the quick work. I hope that this commit will be merged soon ;)

@kshyatt kshyatt added the maths Mathematical functions label Sep 29, 2017
andreasnoack added a commit that referenced this issue Oct 2, 2017
* Make var and std work for Vector{Vector{T}} by removing Number restriction
from some signatures as well as using broadcasting in std. Fixes #23884

* Make cov work for Vector{Vector}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maths Mathematical functions
Projects
None yet
Development

No branches or pull requests

3 participants