Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DataFrame.explode/2 #751

Merged
merged 3 commits into from
Dec 1, 2023

Conversation

costaraphael
Copy link
Contributor

Hey folks!

With this one list processing becomes essentially complete IMO, as if there is no convenient list-specific function for some operation (like string trimming, or arithmetic operations), you can just explode the column, process the elements, and then aggregate them back into lists.

One thing that caught me by surprise while implementing this is that I assumed that passing multiple columns to explode would perform a cartesian-style expansion on the data frame, which is not the case. It instead matches the list elements one by one for each row of the exploded columns.

Since I can see how this can be a bit unexpected, I've addressed this behavior in the function docs. Let me know if you think it should be given more emphasis, or if you think it should force a different behavior.

Regarding the name, explode sounds nice and it matches exactly what happens in Polars, so I personally like it. It also gives us a nice error message for free when the list lengths don't match:

** (RuntimeError) Polars Error: lengths don't match: exploded columns must have matching element counts

The other name I could think of is unnest, which is the name of a similar operation in PostgreSQL, but the semantics are a bit different, and there would be extra work translating the error message. Let me know your thoughts on it!

Copy link
Member

@philss philss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I like the Polars behaviour with multiple columns explosion, and that you kept it :)

I added one suggestion of validation, but the other one - about the out_df - is optional for now.

Also, I'm going to release a new version without this, but we can work in another patch version soon.

Thank you for the PR!

@doc type: :single
@spec explode(df :: DataFrame.t(), column :: column_name() | [column_name()]) :: DataFrame.t()
def explode(%DataFrame{} = df, column_or_columns) do
columns = to_existing_columns(df, List.wrap(column_or_columns))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could validate that columns is a list of {:list, any()}. Does it return error if we pass a column that is not a list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just silently ignores the non-list columns and the operation is successful. Do you think we should raise on the Elixir side in this case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@costaraphael personally I think it should raise if one of the columns is not of a list dtype.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@philss @josevalim done in a1ad581!

Since it seems like this is the first time this kind of validation is performed at the DataFrame level (checking if the given columns are of a certain type), I've chosen not to introduce any new abstractions. I figured this could be done once there were a couple more examples of this. Let me know if you disagree!

@@ -748,6 +748,11 @@ defmodule Explorer.PolarsBackend.DataFrame do
|> Shared.create_dataframe()
end

@impl true
def explode(df, columns) do
Shared.apply_dataframe(df, df, :df_explode, [columns])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the callback could receive the out_df with the "correct" dtype. It's OK to have the first version without it, but I think we can say what dtypes the out DF is going to have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@philss done it in 6622caa

I also found a bug when aggregating the series elements into lists while implementing this. The commit above should also fix that bug (I've added a test for it).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@costaraphael excellent! ✨
thanks!

@josevalim josevalim merged commit 0955a1d into elixir-explorer:main Dec 1, 2023
4 checks passed
@josevalim
Copy link
Member

💚 💙 💜 💛 ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants