Add `DF.transform` #912

billylanchantin · 2024-05-25T16:03:54Z

Description

Adds DF.transform/3 which is the analogous function to S.transform/2. I've needed a version of this function many times in my own work.

Example

alias Explorer.DataFrame, as: DF

df = DF.new(
  numbers: [1, 2],
  datetime_local: [~N[2024-01-01 00:00:00], ~N[2024-01-01 00:00:00]],
  timezone: ["Etc/UTC", "America/New_York"]
)

DF.transform(df, [names: ["datetime_local", "timezone"]], fn row ->
  datetime_utc =
    row["datetime_local"]
    |> DateTime.from_naive!(row["timezone"])
    |> DateTime.shift_zone!("Etc/UTC")

  %{datetime_utc: datetime_utc}
end)

# #Explorer.DataFrame<
#   Polars[2 x 4]
#   numbers s64 [1, 2]
#   datetime_local naive_datetime[μs] [2024-01-01 00:00:00.000000, 2024-01-01 00:00:00.000000]
#   timezone string ["Etc/UTC", "America/New_York"]
#   datetime_utc datetime[μs, Etc/UTC] [2024-01-01 00:00:00.000000Z, 2024-01-01 05:00:00.000000Z]
# >

lib/explorer/data_frame.ex

josevalim · 2024-05-25T18:40:16Z

This operation is doing three things at the moment:

selecting
converting to rows
merging the columns (which we call concat_columns)

It also has a limitation that it computes one single column. For example, instead we could have:

DF.transform(df, [columns: ["datetime_local", "timezone"]], fn row ->
  [datetime_utc: row["datetime_local"]
  |> DateTime.from_naive!(row["timezone"])
  |> DateTime.shift_zone!("Etc/UTC")]
end)

We could also emit a custom row struct that accepts both strings and atom keys and converts fields as necessary. For example, imagine we had a %Explorer.DataFrame.Row{index: index, df: df}. When you called row["datetime_local"], it would get that particular column and access it as index. Does Polars guarantee constant time access to all of its rows? If it does, then we can provide both atom/string ergonomics and only convert the necessary keys lazily.

josevalim · 2024-05-25T18:41:06Z

However, we should benchmark the approaches. The lazy one may end-up being less efficient if we do too many trips to Rust. We should certainly have a single operation to access a given column+row.

billylanchantin · 2024-05-25T21:17:32Z

José I didn't want to use my brain today :P

It also has a limitation that it computes one single column.

👍 Yeah we could definitely get multiple columns with concat_columns. Great suggestion.

EDIT: removed a comment about validation.

Does Polars guarantee constant time access to all of its rows?

I don't think so but I'm not sure. I couldn't find a definite answer in the docs.

They seem to support several kinds of index-based access and I'm not sure which is the "right" one. Following some source code led me to this file:

https://github.com/pola-rs/polars/blob/17cf8b7e17974b51c9bdc760422005fe591bc7c5/crates/polars-core/src/chunked_array/ops/gather.rs#L149

If this is the right place, I see several references to binary searches. That makes me think it's $O(k \cdot log(n))$. Maybe they can get good amortized performance?

However, we should benchmark the approaches. The lazy one may end-up being less efficient if we do too many trips to Rust. We should certainly have a single operation to access a given column+row.

Yeah definitely some benchmarks are in order. I suspect the most expensive part is the de-serialization step required to feed the Elixir functions. I'll try your lazy approach and get back with some numbers.

I also want to try and leverage Arrow's chunking. If de-serializing a single chunk is fast, it may be worth parallelizing over chunks on the Elixir side rather than trying to trick Polars into doing what we want. IDK how easy that level of control will be though.

josevalim · 2024-05-26T08:42:21Z

My understanding from the Rust code is that they do a binary search only if there are several chunks. What we may want to do is to rechunk the dataframe before using it. Another potential concern here is doing the bounds check on every operation, but they do have an _unchecked version.

lib/explorer/data_frame.ex

lib/explorer/backend/data_frame.ex

billylanchantin added 2 commits May 25, 2024 11:56

function

a1476d3

docs

6d3bfc0

billylanchantin commented May 25, 2024

View reviewed changes

lib/explorer/data_frame.ex Outdated Show resolved Hide resolved

try version that adds :names option to to_rows* functions

c11cc01

billylanchantin commented Jun 1, 2024

View reviewed changes

lib/explorer/data_frame.ex Outdated Show resolved Hide resolved

billylanchantin changed the title ~~Add DF.transform/4~~ Add DF.transform Jun 1, 2024

josevalim reviewed Jun 2, 2024

View reviewed changes

lib/explorer/backend/data_frame.ex Outdated Show resolved Hide resolved

move :names back to DF.transform only

1487c46

josevalim approved these changes Jul 3, 2024

View reviewed changes

billylanchantin merged commit 66cc50c into main Jul 3, 2024
3 checks passed

billylanchantin deleted the bl-df-transform branch July 3, 2024 17:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `DF.transform` #912

Add `DF.transform` #912

billylanchantin commented May 25, 2024 •

edited

Loading

josevalim commented May 25, 2024

josevalim commented May 25, 2024

billylanchantin commented May 25, 2024 •

edited

Loading

josevalim commented May 26, 2024

Add DF.transform #912

Add DF.transform #912

Conversation

billylanchantin commented May 25, 2024 • edited Loading

Description

Example

josevalim commented May 25, 2024

josevalim commented May 25, 2024

billylanchantin commented May 25, 2024 • edited Loading

josevalim commented May 26, 2024

Add `DF.transform` #912

Add `DF.transform` #912

billylanchantin commented May 25, 2024 •

edited

Loading

billylanchantin commented May 25, 2024 •

edited

Loading