Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spearman correlation gives difference result as Scipy #9407

Closed
2 tasks done
qqlearn123 opened this issue Jun 17, 2023 · 2 comments · Fixed by #9415
Closed
2 tasks done

Spearman correlation gives difference result as Scipy #9407

qqlearn123 opened this issue Jun 17, 2023 · 2 comments · Fixed by #9415
Labels
bug Something isn't working python Related to Python Polars

Comments

@qqlearn123
Copy link

qqlearn123 commented Jun 17, 2023

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

Spearman correlation in Polars is different than the equivalent in Scipy.

The difference comes from the ranking method for ties - Polars use min whereas Scipy uses average.

Is this behavior rather a design choice? If so, is it possible to provide alternative which has the same behavior as Scipy?

Note: Spearman correlation in R also uses average as its ranking method.

Reproducible example

import polars as pl
import scipy

df = pl.DataFrame({"a": [1, 1, 1, 2, 3, 7, 4], "b": [4, 3, 2, 2, 4, 3, 1]})

df.select(
    pl.corr("a", "b", method="spearman"),
    pl.corr(pl.col("a").rank("min"), pl.col("b").rank("min")).alias("a2"),
    pl.corr(pl.col("a").rank(), pl.col("b").rank()).alias("a3"),
)

# a           b           c
# ---         ---         ---
# f64         f64         f64
# -0.172237   -0.172237   -0.190485

Expected behavior

scipy.stats.spearmanr([1, 1, 1, 2, 3, 7, 4], [4, 3, 2, 2, 4, 3, 1])
# -0.1904848294

Installed versions

Polars: 0.17.15
@qqlearn123 qqlearn123 added bug Something isn't working python Related to Python Polars labels Jun 17, 2023
@stinodego
Copy link
Contributor

Wikipedia states that taking the average is the most common approach. We should probably update the calculation.

@zundertj
Copy link
Collaborator

Fixed this in #9415. Thank you for the clear code example, could use that directly for a unit test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants