Improve handling of NaN in ranks and rank correlations #659

nalimilan · 2021-02-14T17:55:48Z

Throw an error when NaN is encountered by rank functions, and make corspearman return NaN, instead of silently sorting them at the end. This is consistent with what corkendall and cor do.

Fixes #657.

Throw an error when NaN is encountered by rank functions, and make `corspearman` return `NaN`, instead of silently sorting them at the end. This is consistent with what `corkendall` and `cor` do.

nalimilan · 2021-02-15T09:51:36Z

The last commit fixes corkendall(matrix, vector) so that it returns a matrix like cor and corspearman.

Cc: @piever and @PGS62 who have worked on this code

alyst · 2021-02-15T15:30:14Z

I'm not sure we need to throw an exception in the ranking code as a default option:

from the user perspective -- if the data [s]he got from somewhere contains NaNs, but [s]he just wanted to quickly sort it and display -- what [s]he is supposed to do? One solution would be to replace NaNs with some non-throwing placeholders, but I'm not sure it would improve user code.
the sorting behaviour is defined by the comparison function, and < (the default one) doesn't throw in Julia. IIUC, the current code should always put NaNs in the end, and that's fine for most use cases.
if by= is not identity, the current implementation of NaNs handling would not work.

Suggestions:

maybe we keep ranking as is, checking for NaNs just in the correlation code would be fine
or we introduce the nans= kwargs with the options:
- :last, the default one -- keep them sorted in the end as it is now
- :throw -- instead of <, use the function that would throw if NaN is encountered. That gets a bit more complicated if the user has provided lt=, in which case we call underlying sort() with some NaN-checking wrapper lt=nan_check(lt).
- :missing (?) -- treat NaNs as missing value and return missing rank. But that is not type-stable, unfortunately.

PGS62 · 2021-02-15T16:21:57Z

In the (matrix,vector) case, it sounds sensible to match the size of the return between corkendall, cor and corspearman.

But I'm pretty sure that my recent rewrite of corkendall (#647) did not make a change to the behaviour in this case. If that's correct then surely @nalimilan 's change is potentially breaking. corkendall(mymatrix,myvector) would previously have returned a 2-dimension array with 1 column, but now returns a two dimensional array with 1 row.

Perhaps making a breaking change is a price worth paying for the consistency? Or perhaps I'm wrong to think I preserved the behaviour of corkendall(matrix,vector) in my re-write?

piever · 2021-02-15T17:01:49Z

I'd tend to agree with @alyst about the ranking function. I think it is a bit excessive to throw on NaN, and it could lead to some inconsistencies. For example, I imagine with this PR ordinalrank([NaN, 0.1]) would throw but ordinalrank([[NaN], [0.1]]) would work.

It would seem more natural to me to throw on the correlation functions with NaNs, without special-casing ranking to throw on NaNs. Maybe one could mention in the docs that to exclude them from ranking one should replace NaN with missing.

nalimilan · 2021-02-15T17:42:50Z

OK. Actually this PR was partly prompted by https://github.com/JuliaLang/Statistics.jl/pull/72, where NaN weren't properly detected by quantile as they were supposed to, which could lead to subtle bugs. So I thought it would be consistent to also reject them in ranking functions.

I really don't like when NaNs can go unnoticed, as it could give absurd results. I don't think we have many functions that transform NaNs with normal values silently -- in general an error is thrown or they propagate. We're very picky about not skipping missing by default so IMO it would be weird to do that for NaN, which can indicate a problem more often than missing. Can you think about a real use case where it would be useful to rank NaNs at the end? @alyst's example is about sorting, but we have sort and sortperm! for that.

So I think that we shouldn't accept them silently by default. To avoid breaking code we could just print a warning for now, with an argument to silence it. Do you think that would be acceptable? That sounds consistent with what we do elsewhere.

Regarding the implementation, indeed it would be better to support by properly. Need to think about how to achieve this.

@PGS62 Yes, changing corkendall as in this PR would be breaking. Depending on whether we have other changes that justify a new breaking release or not, we could leave this change out for now. I'll move it to a separate PR anyway, that will be simpler (it's just that it makes it impossible to test corspearman and corkendall at the same time).

bkamins · 2021-02-16T09:32:54Z

I have checked what R does:

> rank(c(1, NaN, NaN, 1), na.last=F)
[1] 3.5 1.0 2.0 3.5
> rank(c(1, NaN, NaN, 1), na.last=T)
[1] 1.5 3.0 4.0 1.5
> rank(c(1, NaN, NaN, 1), na.last="keep")
[1] 1.5  NA  NA 1.5

So the conclusions are:

different options are allowed
NaN is treated the same as NA
there is some inconsistency about tie breaking (they are always considered different)

For us I would say that the lesson is:

we should have a kwarg allowing for different behaviors (this will in particular prompt users to think what they want)
we should carefully decide how we want to treat NaN (if they should be considered equal or not)

Then the question is what should be the default. I would error on default unless it degrades performance significantly. The reason is:

if it is super problematic we can change this behavior in a future in a non breaking way
in my opinion if someone has NaN in data and wants to calculate the statistics it is due to an error and not deliberate

alyst · 2021-02-16T10:46:49Z

@bkamins Thanks for checking out R behaviour!

I still would suggest not throwing from ranking functions by default, unless there's a real NaN-related bug, which could only be properly addressed from the ranking code, and not in the calling context. It's just my (very subjective) observation that, while Julia equivalents are generally stricter than tidyverse flavour of R, it doesn't necessarily improve the code writing efficiency for me (but this is going OT, I just provide it here to explain my motivation). So I'm a bit concerned with adding more throw-by-default cases.

bkamins · 2021-02-16T10:55:49Z

Yes - this is a typical tension between "interactive work friendly" vs "production code friendly" behavior. Julia in general follows the "production code friendly" approach, and R the opposite. However, maybe this is not a right approach in this case.

I would not die for throwing an error by default (especially that this is breaking). If we have a kwarg users will be prompted about the potential problem with NaN they might hit.

nalimilan · 2021-02-25T18:57:59Z

I've reverted controversial changes to concentrate on corspearman for now so that we can discuss other issues separately. Technically, adding a new error is breaking, but given that it will only throw when the code was incorrect maybe it's OK not to tag a new breaking release?

src/rankcorr.jl

bkamins · 2021-02-26T20:35:16Z

OK - so I understand we produce NaN if any NaN is encountered. This looks OK.

src/rankcorr.jl

PGS62 · 2021-04-13T09:29:04Z

I think the handling of NaN in functions cor, corspearman and corkendall is not quite correct in the cor(RealMatrix) case, and is in fact inconsistent with the cor(RealMatrix,RealMatrix) case. When called with a single matrix argument, the functions always return 1.0 on the diagonal even when NaN is (in my opinion) the correct result:

julia> X = fill(NaN,10,3);

julia> StatsBase.cor(X)
3×3 Matrix{Float64}:
   1.0  NaN    NaN
 NaN      1.0  NaN
 NaN    NaN      1.0

julia> StatsBase.cor(X,X)
3×3 Matrix{Float64}:
 NaN  NaN  NaN
 NaN  NaN  NaN
 NaN  NaN  NaN

In the case of corkendall, the fix would be:

function corkendall(X::RealMatrix)
    n = size(X, 2)
    C = Matrix{Float64}(I, n, n)
    for j = 1:n                        #THREE NEW LINES
        C[j,j] = any(isnan,X[:,j]) ? NaN : 1.0
    end
    for j = 2:n
        permx = sortperm(X[:,j])
        for i = 1:j - 1
            C[j,i] = corkendall!(X[:,j], X[:,i], permx)
            C[i,j] = C[j,i]
        end
    end
    return C
end

but I have not looked at what would need to change in cor or corspearman.

nalimilan · 2021-04-13T10:11:51Z

Always using 1 for the diagonal is indeed no completely correct, but it has the advantage of saving one computation for each column. Typically, diagonal entries are not very useful, and if there's a NaN in one column then other entries in the matrix will be NaN, so its presence is unlikely to go unnoticed.

I guess we could use tricks to make this efficient, e.g. by checking whether other entries in the same column of the correlation matrix are NaN. But cor is defined in Statistics, not in StatsBase, so we can't change it until Julia 2.0.

PGS62 · 2021-04-13T11:02:21Z

Fair point that the diagonal entries are not often useful. But an alternative solution is, once we detect a NaN in a column of X, to squirt NaNs into an entire row and column of C and avoid calling corkendall! for those elements at all, hence reduce computation, as well as being more correct.

Something like this:

function corkendall(X::RealMatrix)
    n = size(X, 2)
    C = Matrix{Float64}(I, n, n)
    for j = 1:n
        if any(isnan,X[:,j])
            C[:,j] .= NaN
            C[j,:] .= NaN
        elseif j > 1
            permx = sortperm(X[:,j])
            for i = 1:j - 1
                C[j,i] = corkendall!(X[:,j], X[:,i], permx)
                C[i,j] = C[j,i]
            end
        end
    end
    return C
end

Having inconsistency between corkendall(X) and corkendall(X,X) just feels a bit "disappointing" somehow!

bkamins · 2021-04-13T11:59:01Z

corkendall(X) and corkendall(X,X) do not have to be consistent on a diagonal on a philosophical level (but maybe it would be good to have them consistent for practical reason).

The philosophical perspective is that in corkendall(X) we know objectively that the diagonal is the same vector with itself (no matter what is stored inside it). It is a stronger information than in corkendall(X, X) in which case at most we can check that they store the same values.

See e.g. the difference in:

julia> cor([1 1; 1 1])
2×2 Matrix{Float64}:
   1.0  NaN
 NaN      1.0

julia> cor([1 1; 1 1], [1 1; 1 1])
2×2 Matrix{Float64}:
 NaN  NaN
 NaN  NaN

PGS62 · 2021-04-13T12:35:48Z

That's all correct, in cor(X) we do indeed know that the on-diagonal elements are the correlation of some vector (a column of X) with itself. But the correlation of a vector with itself may be 1.0 or it may be NaN. It will be NaN if there is a NaN present in the vector or else if all elements of the vector are identical. As currently coded, cor (which is not in StatsBase) and corkendall and corspearman (which are) all hard-wire the on-diagonal terms to 1.0. Surely that's just wrong.

On the other hand, if we (Julia) get cor([1 1;1 1]) wrong (as I think we do) then we are in company. I just checked and R does the same thing:

> x
     [,1] [,2]
[1,]    1    1
[2,]    1    1
> cor(x)
     [,1] [,2]
[1,]    1   NA
[2,]   NA    1
Warning message:
In cor(x) : the standard deviation is zero

Confession: I was hoping that R would "agree with me" but it didn't!

Improve handling of NaN in ranks and rank correlations

c596aa7

Throw an error when NaN is encountered by rank functions, and make `corspearman` return `NaN`, instead of silently sorting them at the end. This is consistent with what `corkendall` and `cor` do.

nalimilan requested review from alyst and ararslan February 14, 2021 17:55

nalimilan mentioned this pull request Feb 14, 2021

corspearman doesn't deal with NaNs #657

Closed

More tests, change corkendall dimensions

f8116c0

Another missing test

480ef05

nalimilan added 2 commits February 25, 2021 19:42

Revert changes to ranking functions

e09cbd2

Revert corkendall changes

e7f16e3

bkamins reviewed Feb 26, 2021

View reviewed changes

src/rankcorr.jl Outdated Show resolved Hide resolved

Cache any(isnan, Xi)

84cfb9c

bkamins reviewed Feb 28, 2021

View reviewed changes

src/rankcorr.jl Outdated Show resolved Hide resolved

Simplify anynan

2437e07

bkamins approved these changes Mar 1, 2021

View reviewed changes

nalimilan merged commit 870712d into master Mar 7, 2021

nalimilan deleted the nl/nanrank branch March 7, 2021 16:59

adknudson mentioned this pull request Mar 20, 2021

fixed corkendall(RealMatrix, RealVector) #673

Closed

devmotion mentioned this pull request Sep 25, 2023

Fix test errors on Julia nightly #895

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of NaN in ranks and rank correlations #659

Improve handling of NaN in ranks and rank correlations #659

nalimilan commented Feb 14, 2021

nalimilan commented Feb 15, 2021

alyst commented Feb 15, 2021

PGS62 commented Feb 15, 2021

piever commented Feb 15, 2021

nalimilan commented Feb 15, 2021

bkamins commented Feb 16, 2021 •

edited

Loading

alyst commented Feb 16, 2021

bkamins commented Feb 16, 2021

nalimilan commented Feb 25, 2021

bkamins commented Feb 26, 2021

PGS62 commented Apr 13, 2021

nalimilan commented Apr 13, 2021

PGS62 commented Apr 13, 2021

bkamins commented Apr 13, 2021

PGS62 commented Apr 13, 2021

Improve handling of NaN in ranks and rank correlations #659

Improve handling of NaN in ranks and rank correlations #659

Conversation

nalimilan commented Feb 14, 2021

nalimilan commented Feb 15, 2021

alyst commented Feb 15, 2021

PGS62 commented Feb 15, 2021

piever commented Feb 15, 2021

nalimilan commented Feb 15, 2021

bkamins commented Feb 16, 2021 • edited Loading

alyst commented Feb 16, 2021

bkamins commented Feb 16, 2021

nalimilan commented Feb 25, 2021

bkamins commented Feb 26, 2021

PGS62 commented Apr 13, 2021

nalimilan commented Apr 13, 2021

PGS62 commented Apr 13, 2021

bkamins commented Apr 13, 2021

PGS62 commented Apr 13, 2021

bkamins commented Feb 16, 2021 •

edited

Loading