Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of NaN in ranks and rank correlations #659

Merged
merged 7 commits into from
Mar 7, 2021
Merged

Conversation

nalimilan
Copy link
Member

Throw an error when NaN is encountered by rank functions, and make corspearman return NaN, instead of silently sorting them at the end. This is consistent with what corkendall and cor do.

Fixes #657.

Throw an error when NaN is encountered by rank functions, and make
`corspearman` return `NaN`, instead of silently sorting them at the end.
This is consistent with what `corkendall` and `cor` do.
@nalimilan
Copy link
Member Author

The last commit fixes corkendall(matrix, vector) so that it returns a matrix like cor and corspearman.

Cc: @piever and @PGS62 who have worked on this code

@alyst
Copy link
Member

alyst commented Feb 15, 2021

I'm not sure we need to throw an exception in the ranking code as a default option:

  • from the user perspective -- if the data [s]he got from somewhere contains NaNs, but [s]he just wanted to quickly sort it and display -- what [s]he is supposed to do? One solution would be to replace NaNs with some non-throwing placeholders, but I'm not sure it would improve user code.
  • the sorting behaviour is defined by the comparison function, and < (the default one) doesn't throw in Julia. IIUC, the current code should always put NaNs in the end, and that's fine for most use cases.
  • if by= is not identity, the current implementation of NaNs handling would not work.

Suggestions:

  • maybe we keep ranking as is, checking for NaNs just in the correlation code would be fine
  • or we introduce the nans= kwargs with the options:
    • :last, the default one -- keep them sorted in the end as it is now
    • :throw -- instead of <, use the function that would throw if NaN is encountered. That gets a bit more complicated if the user has provided lt=, in which case we call underlying sort() with some NaN-checking wrapper lt=nan_check(lt).
    • :missing (?) -- treat NaNs as missing value and return missing rank. But that is not type-stable, unfortunately.

@PGS62
Copy link
Contributor

PGS62 commented Feb 15, 2021

In the (matrix,vector) case, it sounds sensible to match the size of the return between corkendall, cor and corspearman.

But I'm pretty sure that my recent rewrite of corkendall (#647) did not make a change to the behaviour in this case. If that's correct then surely @nalimilan 's change is potentially breaking. corkendall(mymatrix,myvector) would previously have returned a 2-dimension array with 1 column, but now returns a two dimensional array with 1 row.

Perhaps making a breaking change is a price worth paying for the consistency? Or perhaps I'm wrong to think I preserved the behaviour of corkendall(matrix,vector) in my re-write?

@piever
Copy link
Contributor

piever commented Feb 15, 2021

I'd tend to agree with @alyst about the ranking function. I think it is a bit excessive to throw on NaN, and it could lead to some inconsistencies. For example, I imagine with this PR ordinalrank([NaN, 0.1]) would throw but ordinalrank([[NaN], [0.1]]) would work.

It would seem more natural to me to throw on the correlation functions with NaNs, without special-casing ranking to throw on NaNs. Maybe one could mention in the docs that to exclude them from ranking one should replace NaN with missing.

@nalimilan
Copy link
Member Author

OK. Actually this PR was partly prompted by https://github.com/JuliaLang/Statistics.jl/pull/72, where NaN weren't properly detected by quantile as they were supposed to, which could lead to subtle bugs. So I thought it would be consistent to also reject them in ranking functions.

I really don't like when NaNs can go unnoticed, as it could give absurd results. I don't think we have many functions that transform NaNs with normal values silently -- in general an error is thrown or they propagate. We're very picky about not skipping missing by default so IMO it would be weird to do that for NaN, which can indicate a problem more often than missing. Can you think about a real use case where it would be useful to rank NaNs at the end? @alyst's example is about sorting, but we have sort and sortperm! for that.

So I think that we shouldn't accept them silently by default. To avoid breaking code we could just print a warning for now, with an argument to silence it. Do you think that would be acceptable? That sounds consistent with what we do elsewhere.

Regarding the implementation, indeed it would be better to support by properly. Need to think about how to achieve this.

@PGS62 Yes, changing corkendall as in this PR would be breaking. Depending on whether we have other changes that justify a new breaking release or not, we could leave this change out for now. I'll move it to a separate PR anyway, that will be simpler (it's just that it makes it impossible to test corspearman and corkendall at the same time).

@bkamins
Copy link
Contributor

bkamins commented Feb 16, 2021

I have checked what R does:

> rank(c(1, NaN, NaN, 1), na.last=F)
[1] 3.5 1.0 2.0 3.5
> rank(c(1, NaN, NaN, 1), na.last=T)
[1] 1.5 3.0 4.0 1.5
> rank(c(1, NaN, NaN, 1), na.last="keep")
[1] 1.5  NA  NA 1.5

So the conclusions are:

  • different options are allowed
  • NaN is treated the same as NA
  • there is some inconsistency about tie breaking (they are always considered different)

For us I would say that the lesson is:

  • we should have a kwarg allowing for different behaviors (this will in particular prompt users to think what they want)
  • we should carefully decide how we want to treat NaN (if they should be considered equal or not)

Then the question is what should be the default. I would error on default unless it degrades performance significantly. The reason is:

  • if it is super problematic we can change this behavior in a future in a non breaking way
  • in my opinion if someone has NaN in data and wants to calculate the statistics it is due to an error and not deliberate

@alyst
Copy link
Member

alyst commented Feb 16, 2021

@bkamins Thanks for checking out R behaviour!

I still would suggest not throwing from ranking functions by default, unless there's a real NaN-related bug, which could only be properly addressed from the ranking code, and not in the calling context. It's just my (very subjective) observation that, while Julia equivalents are generally stricter than tidyverse flavour of R, it doesn't necessarily improve the code writing efficiency for me (but this is going OT, I just provide it here to explain my motivation). So I'm a bit concerned with adding more throw-by-default cases.

@bkamins
Copy link
Contributor

bkamins commented Feb 16, 2021

Yes - this is a typical tension between "interactive work friendly" vs "production code friendly" behavior. Julia in general follows the "production code friendly" approach, and R the opposite. However, maybe this is not a right approach in this case.

I would not die for throwing an error by default (especially that this is breaking). If we have a kwarg users will be prompted about the potential problem with NaN they might hit.

@nalimilan
Copy link
Member Author

I've reverted controversial changes to concentrate on corspearman for now so that we can discuss other issues separately. Technically, adding a new error is breaking, but given that it will only throw when the code was incorrect maybe it's OK not to tag a new breaking release?

src/rankcorr.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Contributor

bkamins commented Feb 26, 2021

OK - so I understand we produce NaN if any NaN is encountered. This looks OK.

src/rankcorr.jl Outdated Show resolved Hide resolved
@nalimilan nalimilan merged commit 870712d into master Mar 7, 2021
@nalimilan nalimilan deleted the nl/nanrank branch March 7, 2021 16:59
@PGS62
Copy link
Contributor

PGS62 commented Apr 13, 2021

I think the handling of NaN in functions cor, corspearman and corkendall is not quite correct in the cor(RealMatrix) case, and is in fact inconsistent with the cor(RealMatrix,RealMatrix) case. When called with a single matrix argument, the functions always return 1.0 on the diagonal even when NaN is (in my opinion) the correct result:

julia> X = fill(NaN,10,3);

julia> StatsBase.cor(X)
3×3 Matrix{Float64}:
   1.0  NaN    NaN
 NaN      1.0  NaN
 NaN    NaN      1.0

julia> StatsBase.cor(X,X)
3×3 Matrix{Float64}:
 NaN  NaN  NaN
 NaN  NaN  NaN
 NaN  NaN  NaN

In the case of corkendall, the fix would be:

function corkendall(X::RealMatrix)
    n = size(X, 2)
    C = Matrix{Float64}(I, n, n)
    for j = 1:n                        #THREE NEW LINES
        C[j,j] = any(isnan,X[:,j]) ? NaN : 1.0
    end
    for j = 2:n
        permx = sortperm(X[:,j])
        for i = 1:j - 1
            C[j,i] = corkendall!(X[:,j], X[:,i], permx)
            C[i,j] = C[j,i]
        end
    end
    return C
end

but I have not looked at what would need to change in cor or corspearman.

@nalimilan
Copy link
Member Author

Always using 1 for the diagonal is indeed no completely correct, but it has the advantage of saving one computation for each column. Typically, diagonal entries are not very useful, and if there's a NaN in one column then other entries in the matrix will be NaN, so its presence is unlikely to go unnoticed.

I guess we could use tricks to make this efficient, e.g. by checking whether other entries in the same column of the correlation matrix are NaN. But cor is defined in Statistics, not in StatsBase, so we can't change it until Julia 2.0.

@PGS62
Copy link
Contributor

PGS62 commented Apr 13, 2021

Fair point that the diagonal entries are not often useful. But an alternative solution is, once we detect a NaN in a column of X, to squirt NaNs into an entire row and column of C and avoid calling corkendall! for those elements at all, hence reduce computation, as well as being more correct.

Something like this:

function corkendall(X::RealMatrix)
    n = size(X, 2)
    C = Matrix{Float64}(I, n, n)
    for j = 1:n
        if any(isnan,X[:,j])
            C[:,j] .= NaN
            C[j,:] .= NaN
        elseif j > 1
            permx = sortperm(X[:,j])
            for i = 1:j - 1
                C[j,i] = corkendall!(X[:,j], X[:,i], permx)
                C[i,j] = C[j,i]
            end
        end
    end
    return C
end

Having inconsistency between corkendall(X) and corkendall(X,X) just feels a bit "disappointing" somehow!

@bkamins
Copy link
Contributor

bkamins commented Apr 13, 2021

corkendall(X) and corkendall(X,X) do not have to be consistent on a diagonal on a philosophical level (but maybe it would be good to have them consistent for practical reason).

The philosophical perspective is that in corkendall(X) we know objectively that the diagonal is the same vector with itself (no matter what is stored inside it). It is a stronger information than in corkendall(X, X) in which case at most we can check that they store the same values.

See e.g. the difference in:

julia> cor([1 1; 1 1])
2×2 Matrix{Float64}:
   1.0  NaN
 NaN      1.0

julia> cor([1 1; 1 1], [1 1; 1 1])
2×2 Matrix{Float64}:
 NaN  NaN
 NaN  NaN

@PGS62
Copy link
Contributor

PGS62 commented Apr 13, 2021

That's all correct, in cor(X) we do indeed know that the on-diagonal elements are the correlation of some vector (a column of X) with itself. But the correlation of a vector with itself may be 1.0 or it may be NaN. It will be NaN if there is a NaN present in the vector or else if all elements of the vector are identical. As currently coded, cor (which is not in StatsBase) and corkendall and corspearman (which are) all hard-wire the on-diagonal terms to 1.0. Surely that's just wrong.

On the other hand, if we (Julia) get cor([1 1;1 1]) wrong (as I think we do) then we are in company. I just checked and R does the same thing:

> x
     [,1] [,2]
[1,]    1    1
[2,]    1    1
> cor(x)
     [,1] [,2]
[1,]    1   NA
[2,]   NA    1
Warning message:
In cor(x) : the standard deviation is zero

Confession: I was hoping that R would "agree with me" but it didn't!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

corspearman doesn't deal with NaNs
5 participants