fix: parse pandas pivot null values #29898

eschutho · 2024-08-08T23:50:02Z

SUMMARY

If a result has a null value for a pivot table with totals, we are seeing this error:

pandas._libs.lib.maybe_convert_numeric
ValueError: Unable to parse string "NULL"

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

Has associated issue:
Required feature flags:
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided
Introduces new feature or API
Removes existing feature or API

betodealmeida

Not sure if this conversion shouldn't be optional.

betodealmeida · 2024-08-08T23:58:08Z

superset/charts/post_processing.py

@@ -150,6 +151,8 @@ def pivot_df(  # pylint: disable=too-many-locals, too-many-arguments, too-many-s
    if show_rows_total:
        # add subtotal for each group and overall total; we start from the
        # overall group, and iterate deeper into subgroups
+        # Ensure "NULL" strings are replaced with NaN
+        df.replace("NULL", np.nan, inplace=True)


I don't think we should do this automatically, ideally there should be an option when building the pivot table to have this conversion. Or people could do it as a derived column or virtual dataset.

Imagine I have a table of users and someone has the username 'NULL'. I don't think we should do this conversion in that case. This is not hypothetical, Instagram's data infra once went down because someone created a user called null.

(Not as bad as when an employee broke the Facebook intranet by using their initials as their username — www.)

Speaking of breaking things, looks like the GitHub auto-linker is broken...

lol, Bobby Tables.
Are you suggesting this chart has a user-defined option to fill nulls with ? or maybe a drop-down of some value?

Hi, my name is NULL, and my last name is '); DROP TABLE users; --

@betodealmeida and I met and talked about using a different placeholder string that we thought would be an unlikely "real" value: SUPERSET_PANDAS_NAN

betodealmeida

Ah, I think I misunderstood this. The 'NULL' string is being introduced here:

superset/superset/charts/post_processing.py

Lines 88 to 89 in fb6efb9

    
           # pivoting with null values will create an empty df 
        
           df = df.fillna("NULL")

I wonder if (1) we should do the conversion back to nan in the same function (pivot_df), and if (2) we should use a different sentinel value?

andy-clapson · 2024-08-16T14:05:27Z

superset/charts/post_processing.py

@@ -171,7 +174,7 @@ def pivot_df(  # pylint: disable=too-many-locals, too-many-arguments, too-many-s
            for subgroup in subgroups:
                slice_ = df.index.get_loc(subgroup)
                subtotal = pivot_v2_aggfunc_map[aggfunc](
-                    df.iloc[slice_, :].apply(pd.to_numeric), axis=0
+                    df.iloc[slice_, :].apply(pd.to_numeric, errors="coerce"), axis=0


i think this might be the only change needed to deal with this issue: #27499

pandas is so opaque, especially when you haven't touched it for years - .iloc[]?, "coerce"!?!? might be worth adding a comment that explains what it's doing

it's locs and ilocs all the way down 🐢

andy-clapson · 2024-08-16T14:06:01Z

tests/unit_tests/charts/test_post_processing.py

+        aggfunc="Sum",
+        transpose_pivot=False,
+        combine_metrics=False,
+        show_rows_total=False,


weirdly, the case I have where i replicate this NULL issue only occurs when I have one of the column or row total set to True.

yah, I don't think it would fail in this case, but I added a test for all the combos.

andy-clapson · 2024-08-16T14:06:39Z

tests/unit_tests/charts/test_post_processing.py

test coverage for nulls here was well-needed! great

mistercrunch · 2024-09-19T06:12:38Z

superset/charts/post_processing.py

@@ -86,7 +87,8 @@ def pivot_df(  # pylint: disable=too-many-locals, too-many-arguments, too-many-s
    # pivot data; we'll compute totals and subtotals later
    if rows or columns:
        # pivoting with null values will create an empty df
-        df = df.fillna("NULL")


mmmh, seems the frontend should be doing this ...

@mistercrunch I'm not sure I got this..

@mistercrunch the problem here is that we do the pivot in Pandas (for reports and CSV download), and it will fail if the dataframe has NaNs.

eschutho · 2024-09-20T20:55:29Z

superset/charts/post_processing.py

+        else:
+            # when we applied metrics on rows, we switched the columns and rows
+            # so checking column type doesn't apply. Replace everything with np.nan
+            df.replace("SUPERSET_PANDAS_NAN", np.nan, inplace=True)


@betodealmeida we need this section here (from line 156) when totaling so that we 1) can sum with numbers (by converting the string "SUPERSET_PANDAS_NAN" with np.nan or 2) can sum with a string value. I'm using "nan" so that we don't print "SUPERSET_PANDAS_NAN".

eschutho · 2024-09-20T20:55:49Z

superset/charts/post_processing.py

+        columns={"SUPERSET_PANDAS_NAN": np.nan},
+        inplace=True,
+    )
+


Converting the values back so that we don't print "SUPERSET_PANDAS_NAN"

eschutho · 2024-09-20T21:02:56Z

superset/charts/post_processing.py

+                if pd.api.types.is_numeric_dtype(df[col]):
+                    df[col].replace("SUPERSET_PANDAS_NAN", np.nan, inplace=True)
+                else:
+                    df[col].replace("SUPERSET_PANDAS_NAN", "nan", inplace=True)


I chose the string "nan" here because that is the default behavior when there is a null value when pivoting without sums.

betodealmeida

Awesome job! ❤️

pull-request-size bot added the size/L label Aug 8, 2024

dosubot bot added python Dependabot - Pull requests that update Python code viz:charts:pivot Related to the Pivot Table charts labels Aug 8, 2024

betodealmeida requested changes Aug 9, 2024

View reviewed changes

eschutho force-pushed the elizabeth/fix-pandas-pivot-null branch 2 times, most recently from 0e44172 to c34d0bb Compare August 9, 2024 00:04

betodealmeida reviewed Aug 9, 2024

View reviewed changes

sfirke mentioned this pull request Aug 16, 2024

Export to pivoted .csv: internal server error 500 (Apache Superset version 3.0.1) #27499

Closed

3 tasks

andy-clapson reviewed Aug 16, 2024

View reviewed changes

fix pandas pivot null values

45e1d51

eschutho force-pushed the elizabeth/fix-pandas-pivot-null branch from c34d0bb to 92206f2 Compare September 18, 2024 21:07

apply replacement to dimensions only

0ae7342

eschutho force-pushed the elizabeth/fix-pandas-pivot-null branch from 92206f2 to 0ae7342 Compare September 18, 2024 21:15

mistercrunch reviewed Sep 19, 2024

View reviewed changes

make tests pass

4aa2899

pull-request-size bot added size/XL and removed size/L labels Sep 20, 2024

eschutho commented Sep 20, 2024

View reviewed changes

betodealmeida approved these changes Sep 25, 2024

View reviewed changes

eschutho merged commit 0e8fa54 into master Sep 25, 2024
33 of 34 checks passed

rusackas deleted the elizabeth/fix-pandas-pivot-null branch September 27, 2024 20:53

github-actions bot added the preset-io label Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: parse pandas pivot null values #29898

fix: parse pandas pivot null values #29898

eschutho commented Aug 8, 2024 •

edited by sadpandajoe

Loading

betodealmeida left a comment

betodealmeida Aug 8, 2024

betodealmeida Aug 9, 2024

andy-clapson Aug 16, 2024

mistercrunch Sep 19, 2024 •

edited

Loading

eschutho Sep 20, 2024

betodealmeida left a comment

andy-clapson Aug 16, 2024

mistercrunch Sep 20, 2024

andy-clapson Sep 20, 2024

andy-clapson Aug 16, 2024

eschutho Sep 20, 2024

andy-clapson Aug 16, 2024

mistercrunch Sep 19, 2024

eschutho Sep 25, 2024

betodealmeida Sep 25, 2024

eschutho Sep 20, 2024 •

edited

Loading

eschutho Sep 20, 2024

eschutho Sep 20, 2024 •

edited

Loading

betodealmeida left a comment

	# pivoting with null values will create an empty df
	df = df.fillna("NULL")

fix: parse pandas pivot null values #29898

fix: parse pandas pivot null values #29898

Conversation

eschutho commented Aug 8, 2024 • edited by sadpandajoe Loading

SUMMARY

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

betodealmeida left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mistercrunch Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

betodealmeida left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eschutho Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eschutho Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

betodealmeida left a comment

Choose a reason for hiding this comment

eschutho commented Aug 8, 2024 •

edited by sadpandajoe

Loading

mistercrunch Sep 19, 2024 •

edited

Loading

eschutho Sep 20, 2024 •

edited

Loading

eschutho Sep 20, 2024 •

edited

Loading