Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: parse pandas pivot null values #29898

Merged
merged 3 commits into from
Sep 25, 2024
Merged

Conversation

eschutho
Copy link
Member

@eschutho eschutho commented Aug 8, 2024

fixes: #27499

SUMMARY

If a result has a null value for a pivot table with totals, we are seeing this error:

pandas._libs.lib.maybe_convert_numeric
ValueError: Unable to parse string "NULL"

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@dosubot dosubot bot added python Dependabot - Pull requests that update Python code viz:charts:pivot Related to the Pivot Table charts labels Aug 8, 2024
Copy link
Member

@betodealmeida betodealmeida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this conversion shouldn't be optional.

@@ -150,6 +151,8 @@ def pivot_df( # pylint: disable=too-many-locals, too-many-arguments, too-many-s
if show_rows_total:
# add subtotal for each group and overall total; we start from the
# overall group, and iterate deeper into subgroups
# Ensure "NULL" strings are replaced with NaN
df.replace("NULL", np.nan, inplace=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should do this automatically, ideally there should be an option when building the pivot table to have this conversion. Or people could do it as a derived column or virtual dataset.

Imagine I have a table of users and someone has the username 'NULL'. I don't think we should do this conversion in that case. This is not hypothetical, Instagram's data infra once went down because someone created a user called null.

(Not as bad as when an employee broke the Facebook intranet by using their initials as their username — www.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking of breaking things, looks like the GitHub auto-linker is broken...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, Bobby Tables.
Are you suggesting this chart has a user-defined option to fill nulls with ? or maybe a drop-down of some value?

Copy link
Member

@mistercrunch mistercrunch Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, my name is NULL, and my last name is '); DROP TABLE users; --

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@betodealmeida and I met and talked about using a different placeholder string that we thought would be an unlikely "real" value: SUPERSET_PANDAS_NAN

@eschutho eschutho force-pushed the elizabeth/fix-pandas-pivot-null branch 2 times, most recently from 0e44172 to c34d0bb Compare August 9, 2024 00:04
Copy link
Member

@betodealmeida betodealmeida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think I misunderstood this. The 'NULL' string is being introduced here:

# pivoting with null values will create an empty df
df = df.fillna("NULL")

I wonder if (1) we should do the conversion back to nan in the same function (pivot_df), and if (2) we should use a different sentinel value?

@@ -171,7 +174,7 @@ def pivot_df( # pylint: disable=too-many-locals, too-many-arguments, too-many-s
for subgroup in subgroups:
slice_ = df.index.get_loc(subgroup)
subtotal = pivot_v2_aggfunc_map[aggfunc](
df.iloc[slice_, :].apply(pd.to_numeric), axis=0
df.iloc[slice_, :].apply(pd.to_numeric, errors="coerce"), axis=0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this might be the only change needed to deal with this issue: #27499

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas is so opaque, especially when you haven't touched it for years - .iloc[]?, "coerce"!?!? might be worth adding a comment that explains what it's doing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's locs and ilocs all the way down 🐢

aggfunc="Sum",
transpose_pivot=False,
combine_metrics=False,
show_rows_total=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weirdly, the case I have where i replicate this NULL issue only occurs when I have one of the column or row total set to True.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah, I don't think it would fail in this case, but I added a test for all the combos.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test coverage for nulls here was well-needed! great

@eschutho eschutho force-pushed the elizabeth/fix-pandas-pivot-null branch from c34d0bb to 92206f2 Compare September 18, 2024 21:07
@eschutho eschutho force-pushed the elizabeth/fix-pandas-pivot-null branch from 92206f2 to 0ae7342 Compare September 18, 2024 21:15
@@ -86,7 +87,8 @@ def pivot_df( # pylint: disable=too-many-locals, too-many-arguments, too-many-s
# pivot data; we'll compute totals and subtotals later
if rows or columns:
# pivoting with null values will create an empty df
df = df.fillna("NULL")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmmh, seems the frontend should be doing this ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mistercrunch I'm not sure I got this..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mistercrunch the problem here is that we do the pivot in Pandas (for reports and CSV download), and it will fail if the dataframe has NaNs.

else:
# when we applied metrics on rows, we switched the columns and rows
# so checking column type doesn't apply. Replace everything with np.nan
df.replace("SUPERSET_PANDAS_NAN", np.nan, inplace=True)
Copy link
Member Author

@eschutho eschutho Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@betodealmeida we need this section here (from line 156) when totaling so that we 1) can sum with numbers (by converting the string "SUPERSET_PANDAS_NAN" with np.nan or 2) can sum with a string value. I'm using "nan" so that we don't print "SUPERSET_PANDAS_NAN".

columns={"SUPERSET_PANDAS_NAN": np.nan},
inplace=True,
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting the values back so that we don't print "SUPERSET_PANDAS_NAN"

if pd.api.types.is_numeric_dtype(df[col]):
df[col].replace("SUPERSET_PANDAS_NAN", np.nan, inplace=True)
else:
df[col].replace("SUPERSET_PANDAS_NAN", "nan", inplace=True)
Copy link
Member Author

@eschutho eschutho Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose the string "nan" here because that is the default behavior when there is a null value when pivoting without sums.

Copy link
Member

@betodealmeida betodealmeida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome job! ❤️

@eschutho eschutho merged commit 0e8fa54 into master Sep 25, 2024
33 of 34 checks passed
@rusackas rusackas deleted the elizabeth/fix-pandas-pivot-null branch September 27, 2024 20:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
preset-io python Dependabot - Pull requests that update Python code size/XL viz:charts:pivot Related to the Pivot Table charts
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Export to pivoted .csv: internal server error 500 (Apache Superset version 3.0.1)
4 participants