Add `DataFrame.explode/2` #751

costaraphael · 2023-11-30T19:20:29Z

Hey folks!

With this one list processing becomes essentially complete IMO, as if there is no convenient list-specific function for some operation (like string trimming, or arithmetic operations), you can just explode the column, process the elements, and then aggregate them back into lists.

One thing that caught me by surprise while implementing this is that I assumed that passing multiple columns to explode would perform a cartesian-style expansion on the data frame, which is not the case. It instead matches the list elements one by one for each row of the exploded columns.

Since I can see how this can be a bit unexpected, I've addressed this behavior in the function docs. Let me know if you think it should be given more emphasis, or if you think it should force a different behavior.

Regarding the name, explode sounds nice and it matches exactly what happens in Polars, so I personally like it. It also gives us a nice error message for free when the list lengths don't match:

** (RuntimeError) Polars Error: lengths don't match: exploded columns must have matching element counts

The other name I could think of is unnest, which is the name of a similar operation in PostgreSQL, but the semantics are a bit different, and there would be extra work translating the error message. Let me know your thoughts on it!

philss

LGTM! I like the Polars behaviour with multiple columns explosion, and that you kept it :)

I added one suggestion of validation, but the other one - about the out_df - is optional for now.

Also, I'm going to release a new version without this, but we can work in another patch version soon.

Thank you for the PR!

philss · 2023-11-30T19:58:51Z

lib/explorer/data_frame.ex

+  @doc type: :single
+  @spec explode(df :: DataFrame.t(), column :: column_name() | [column_name()]) :: DataFrame.t()
+  def explode(%DataFrame{} = df, column_or_columns) do
+    columns = to_existing_columns(df, List.wrap(column_or_columns))


I think we could validate that columns is a list of {:list, any()}. Does it return error if we pass a column that is not a list?

It just silently ignores the non-list columns and the operation is successful. Do you think we should raise on the Elixir side in this case?

@costaraphael personally I think it should raise if one of the columns is not of a list dtype.

@philss @josevalim done in a1ad581!

Since it seems like this is the first time this kind of validation is performed at the DataFrame level (checking if the given columns are of a certain type), I've chosen not to introduce any new abstractions. I figured this could be done once there were a couple more examples of this. Let me know if you disagree!

philss · 2023-11-30T20:02:32Z

lib/explorer/polars_backend/data_frame.ex

@@ -748,6 +748,11 @@ defmodule Explorer.PolarsBackend.DataFrame do
    |> Shared.create_dataframe()
  end

+  @impl true
+  def explode(df, columns) do
+    Shared.apply_dataframe(df, df, :df_explode, [columns])


I think the callback could receive the out_df with the "correct" dtype. It's OK to have the first version without it, but I think we can say what dtypes the out DF is going to have.

@philss done it in 6622caa

I also found a bug when aggregating the series elements into lists while implementing this. The commit above should also fix that bug (I've added a test for it).

@costaraphael excellent! ✨
thanks!

josevalim · 2023-12-01T12:01:19Z

💚 💙 💜 💛 ❤️

Add DataFrame.explode/2

76f5991

philss reviewed Nov 30, 2023

View reviewed changes

costaraphael added 2 commits November 30, 2023 20:30

Assert correct out type for explode

6622caa

Validate columns are of the list type

a1ad581

josevalim merged commit 0955a1d into elixir-explorer:main Dec 1, 2023
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `DataFrame.explode/2` #751

Add `DataFrame.explode/2` #751

costaraphael commented Nov 30, 2023

philss left a comment

philss Nov 30, 2023

costaraphael Nov 30, 2023

philss Nov 30, 2023

josevalim Dec 1, 2023

costaraphael Dec 1, 2023

philss Nov 30, 2023

costaraphael Nov 30, 2023

philss Nov 30, 2023

josevalim commented Dec 1, 2023

Add DataFrame.explode/2 #751

Add DataFrame.explode/2 #751

Conversation

costaraphael commented Nov 30, 2023

philss left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

josevalim commented Dec 1, 2023

Add `DataFrame.explode/2` #751

Add `DataFrame.explode/2` #751