Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pairwise correlation #759

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions lib/explorer/backend/data_frame.ex
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,7 @@ defmodule Explorer.Backend.DataFrame do
@callback nil_count(df) :: df()
@callback explode(df, out_df :: df(), columns :: [column_name()]) :: df()
@callback unnest(df, out_df :: df(), columns :: [column_name()]) :: df()
@callback correlation(df, out_df :: df(), ddof :: integer()) :: df()

# Two or more table verbs

Expand Down
53 changes: 53 additions & 0 deletions lib/explorer/data_frame.ex
Original file line number Diff line number Diff line change
Expand Up @@ -5657,6 +5657,59 @@ defmodule Explorer.DataFrame do

def frequencies(_df, []), do: raise(ArgumentError, "columns cannot be empty")

@doc """
Calculates the pairwise Pearson's correlation of numeric columns.

## Supported dtypes

Only columns with the following dtypes are taken into account.

* `:integer`
* `{:f, 32}`
* `{:f, 64}`

The resultant columns are always `{:f, 64}`.

## Options

* `:columns` - the selection of columns to calculate. Defaults to all numeric columns.
* `:column_name` - the name of the column with column names. Defaults to "names".
* `:ddof` - the 'delta degrees of freedom' - the divisor used in the correlation
calculation. Defaults to 1.

## Examples

iex> df = Explorer.DataFrame.new(dogs: [1, 8, 3], cats: [4, 5, 2])
iex> Explorer.DataFrame.correlation(df)
#Explorer.DataFrame<
Polars[2 x 3]
names string ["dogs", "cats"]
dogs f64 [1.0000000000000002, 0.5447047794019219]
cats f64 [0.5447047794019219, 1.0]
>
"""
@doc type: :single
@spec correlation(df :: DataFrame.t(), opts :: Keyword.t()) :: df :: DataFrame.t()
def correlation(df, opts \\ []) do
opts = Keyword.validate!(opts, column_name: "names", columns: names(df), ddof: 1)

column_name = to_column_name(opts[:column_name])

cols =
df
|> to_existing_columns(opts[:columns])
|> Enum.filter(fn name -> numeric_column?(df, name) end)

out_dtypes = for col <- cols, into: %{column_name => :string}, do: {col, {:f, 64}}
out_df = %{df | dtypes: out_dtypes, names: [column_name | cols]}

Shared.apply_impl(df, :correlation, [out_df, opts[:ddof]])
end

defp numeric_column?(df, name) do
Series.dtype(df[name]) in [:integer | Explorer.Shared.float_types()]
end

# Helpers

defp backend_from_options!(opts) do
Expand Down
20 changes: 20 additions & 0 deletions lib/explorer/polars_backend/data_frame.ex
Original file line number Diff line number Diff line change
Expand Up @@ -764,6 +764,26 @@ defmodule Explorer.PolarsBackend.DataFrame do
Shared.apply_dataframe(df, out_df, :df_unnest, [columns])
end

@impl true
def correlation(df, out_df, ddof) do
[column_name | cols] = out_df.names

correlations =
Enum.map(cols, fn left ->
corr_series =
cols
|> Enum.map(fn right -> PolarsSeries.correlation(df[left], df[right], ddof) end)
|> Shared.from_list({:f, 64})
|> Shared.create_series()

{left, corr_series}
end)

names_series = cols |> Shared.from_list(:string) |> Shared.create_series()

from_series([{column_name, names_series} | correlations])
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need it in the backend? I think the implementation within Explorer.DataFrame was better, no? Or is this faster in any way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@josevalim we did using the backend because I wanted to raise for the lazy version, since we cannot implement for that right now. Do you think it worth to revert it anyway?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@philss maybe we can implement it with lazy by implementing it with a mutate + select/discard?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way it works with lazy and within Explorer.DataFrame as well! Something like:

mutate_with(df, fn ldf ->
  Enum.map(columns, ...)
end)
|> select(existing_columns -- new_columns, :discard)

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I'm going to try.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@josevalim I think this won't work the way we want, because mutate_with expects only lazy_series/expressions as values, and we are trying to create it with a list of lazy series. On the other hand, we could try create a column for each pair and try to pivot the results later. But again, pivoting does not work with lazy frames.

Maybe there is another way to reshape this DF, but I don't know yet.
I'm going to investigate more tomorrow :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, now that you mention it, I think you are right. Feel free to ship it. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, don't ship it yet. I would like Chris' approval on the API before. :D


# Two or more table verbs

@impl true
Expand Down
1 change: 1 addition & 0 deletions lib/explorer/polars_backend/lazy_frame.ex
Original file line number Diff line number Diff line change
Expand Up @@ -488,6 +488,7 @@ defmodule Explorer.PolarsBackend.LazyFrame do
end

not_available_funs = [
correlation: 3,
describe: 2,
nil_count: 1,
dummies: 3,
Expand Down
55 changes: 55 additions & 0 deletions test/explorer/data_frame_test.exs
Original file line number Diff line number Diff line change
Expand Up @@ -3855,4 +3855,59 @@ defmodule Explorer.DataFrameTest do
fn -> DF.unnest(df, [:a, :b]) end
end
end

describe "correlation/2" do
test "two integer columns" do
df = DF.new(dogs: [1, 8, 3], cats: [4, 5, 2])
df1 = DF.correlation(df)

assert DF.to_columns(df1, atom_keys: true) == %{
names: ["dogs", "cats"],
dogs: [1.0000000000000002, 0.5447047794019219],
cats: [0.5447047794019219, 1.0]
}
end

test "three integer columns and custom column name" do
df = DF.new(dogs: [1, 2, 3], cats: [3, 2, 1], frogs: [7, 8, 9])
df1 = DF.correlation(df, column_name: "variables")

assert DF.to_columns(df1, atom_keys: true) == %{
variables: ["dogs", "cats", "frogs"],
dogs: [1.0, -1.0, 1.0],
cats: [-1.0, 1.0, -1.0],
frogs: [1.0, -1.0, 1.0]
}
end

test "two float columns" do
df = DF.new(dogs: [1.4, 8.6, 3.7], cats: [4.1, 5.3, 2.2])
df1 = DF.correlation(df)

assert DF.to_columns(df1, atom_keys: true) == %{
names: ["dogs", "cats"],
dogs: [0.9999999999999999, 0.5642328261411999],
cats: [0.5642328261411999, 0.9999999999999998]
}
end

test "one column" do
df = DF.new(cats: [4, 5, 2])
df1 = DF.correlation(df)

assert DF.to_columns(df1, atom_keys: true) == %{
names: ["cats"],
cats: [1.0]
}
end

test "no numeric columns" do
df = DF.new(cats: ["susie", "tuka", "tobias", "terror"])
df1 = DF.correlation(df)

assert DF.to_columns(df1, atom_keys: true) == %{
names: []
}
end
end
end
Loading