Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DF.transform #912

Merged
merged 4 commits into from
Jul 3, 2024
Merged

Add DF.transform #912

merged 4 commits into from
Jul 3, 2024

Conversation

billylanchantin
Copy link
Contributor

@billylanchantin billylanchantin commented May 25, 2024

Description

Adds DF.transform/3 which is the analogous function to S.transform/2. I've needed a version of this function many times in my own work.

Example

alias Explorer.DataFrame, as: DF

df = DF.new(
  numbers: [1, 2],
  datetime_local: [~N[2024-01-01 00:00:00], ~N[2024-01-01 00:00:00]],
  timezone: ["Etc/UTC", "America/New_York"]
)

DF.transform(df, [names: ["datetime_local", "timezone"]], fn row ->
  datetime_utc =
    row["datetime_local"]
    |> DateTime.from_naive!(row["timezone"])
    |> DateTime.shift_zone!("Etc/UTC")

  %{datetime_utc: datetime_utc}
end)

# #Explorer.DataFrame<
#   Polars[2 x 4]
#   numbers s64 [1, 2]
#   datetime_local naive_datetime[μs] [2024-01-01 00:00:00.000000, 2024-01-01 00:00:00.000000]
#   timezone string ["Etc/UTC", "America/New_York"]
#   datetime_utc datetime[μs, Etc/UTC] [2024-01-01 00:00:00.000000Z, 2024-01-01 05:00:00.000000Z]
# >

@josevalim
Copy link
Member

This operation is doing three things at the moment:

  1. selecting
  2. converting to rows
  3. merging the columns (which we call concat_columns)

It also has a limitation that it computes one single column. For example, instead we could have:

DF.transform(df, [columns: ["datetime_local", "timezone"]], fn row ->
  [datetime_utc: row["datetime_local"]
  |> DateTime.from_naive!(row["timezone"])
  |> DateTime.shift_zone!("Etc/UTC")]
end)

We could also emit a custom row struct that accepts both strings and atom keys and converts fields as necessary. For example, imagine we had a %Explorer.DataFrame.Row{index: index, df: df}. When you called row["datetime_local"], it would get that particular column and access it as index. Does Polars guarantee constant time access to all of its rows? If it does, then we can provide both atom/string ergonomics and only convert the necessary keys lazily.

@josevalim
Copy link
Member

However, we should benchmark the approaches. The lazy one may end-up being less efficient if we do too many trips to Rust. We should certainly have a single operation to access a given column+row.

@billylanchantin
Copy link
Contributor Author

billylanchantin commented May 25, 2024

José I didn't want to use my brain today :P

It also has a limitation that it computes one single column.

👍 Yeah we could definitely get multiple columns with concat_columns. Great suggestion.

EDIT: removed a comment about validation.

Does Polars guarantee constant time access to all of its rows?

I don't think so but I'm not sure. I couldn't find a definite answer in the docs.

They seem to support several kinds of index-based access and I'm not sure which is the "right" one. Following some source code led me to this file:

If this is the right place, I see several references to binary searches. That makes me think it's $O(k \cdot log(n))$. Maybe they can get good amortized performance?

However, we should benchmark the approaches. The lazy one may end-up being less efficient if we do too many trips to Rust. We should certainly have a single operation to access a given column+row.

Yeah definitely some benchmarks are in order. I suspect the most expensive part is the de-serialization step required to feed the Elixir functions. I'll try your lazy approach and get back with some numbers.

I also want to try and leverage Arrow's chunking. If de-serializing a single chunk is fast, it may be worth parallelizing over chunks on the Elixir side rather than trying to trick Polars into doing what we want. IDK how easy that level of control will be though.

@josevalim
Copy link
Member

My understanding from the Rust code is that they do a binary search only if there are several chunks. What we may want to do is to rechunk the dataframe before using it. Another potential concern here is doing the bounds check on every operation, but they do have an _unchecked version.

@billylanchantin billylanchantin changed the title Add DF.transform/4 Add DF.transform Jun 1, 2024
@billylanchantin billylanchantin merged commit 66cc50c into main Jul 3, 2024
3 checks passed
@billylanchantin billylanchantin deleted the bl-df-transform branch July 3, 2024 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants