ENH: change get_dummies default dtype to bool #48022

kianelbo · 2022-08-10T11:51:16Z

Added a future warning when no dtype is passed to get_dummies stating the the default dtype will change to bool from np.uint8

Closes ENH: pd.get_dummies should not default to dtype np.uint8 #45848
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

MarcoGorelli

This is off to a good start!

There's some CI failures in the doc build, could you fix them up please? e.g.

Warning in /home/runner/work/pandas/pandas/doc/source/whatsnew/v0.13.0.rst at block ending on line 526
Specify :okwarning: as an option in the ipython:: block to suppress this message

This would also require a whatsnew note

pandas/core/reshape/encoding.py

pandas/tests/reshape/test_get_dummies.py

MarcoGorelli

Nice, just suggesting a minor re-wording, rest looks good to me

cc @bashtage @WillAyd @jreback who'd commented on the issue

doc/source/whatsnew/v1.5.0.rst

Co-authored-by: Marco Edward Gorelli <marcogorelli@protonmail.com>

MarcoGorelli

looks good to me

EDIT: holding off til discussion in pandas-dev meeting

WillAyd · 2022-08-23T20:57:06Z

pandas/tests/reshape/test_get_dummies.py

@@ -169,7 +177,7 @@ def test_get_dummies_unicode(self, sparse):
        e = "e"
        eacute = unicodedata.lookup("LATIN SMALL LETTER E WITH ACUTE")
        s = [e, eacute, eacute]
-        res = get_dummies(s, prefix="letter", sparse=sparse)
+        res = get_dummies(s, dtype=np.uint8, prefix="letter", sparse=sparse)


Wouldn't we rather just catch the warnings for these? Wondering how we remember in the future to go back and update these tests when we make the change to the dtype

Could do, I just thought that would be a lot of warnings to catch

Regarding updating tests - I wouldn't have thought they needed updating, I'd have thought just having a test which called .get_dummies() (without specifying dtype) would be enough

Yea I agree. So that's why I was thinking it is better to catch the warning for now and not change the argument. Otherwise with this in the future we lose testing the behavior of the default argument unless someone comes back and revert what was changed here

no all tests should be fixed now
and then u have an explicit test of the warning

it's not better to defer fixing something like this

OK good points, thanks for raising

I've added this to the agenda for the next dev meeting

@kianelbo let's hold off further changes til after there's been discussion

MarcoGorelli · 2022-08-23T21:51:46Z

pandas/tests/reshape/test_get_dummies.py

+        # https://github.com/pandas-dev/pandas/issues/45848
+        msg = "In a future version of pandas the default dtype will change"
+        with tm.assert_produces_warning(FutureWarning, match=msg):
+            get_dummies(df)


sorry to go back on the approval, but can we check the return value here?

jreback

i think this change (even the deprecation) needs discussion as this is quite some long standing behavior

pls schedule it for the next dev meeting

MarcoGorelli · 2022-08-24T18:52:17Z

pandas/tests/reshape/test_get_dummies.py

@@ -169,7 +177,7 @@ def test_get_dummies_unicode(self, sparse):
        e = "e"
        eacute = unicodedata.lookup("LATIN SMALL LETTER E WITH ACUTE")
        s = [e, eacute, eacute]
-        res = get_dummies(s, prefix="letter", sparse=sparse)
+        res = get_dummies(s, dtype=np.uint8, prefix="letter", sparse=sparse)


OK good points, thanks for raising

I've added this to the agenda for the next dev meeting

@kianelbo let's hold off further changes til after there's been discussion

MarcoGorelli

Hey @kianelbo

Apologies for conflicting instructions here. In the end, at the last dev meeting, we decided it would be best to treat this as a bug, and just change the default dtype without going through the deprecation cycle

Changing unsigned data to signed is unlikely to cause any issues

So, this'd be a much simpler fix on your side - just change the default dtype to bool, and make sure to either update tests or make sure this behaviour is tested

It's too late to get this in to v1.5.0, so for now let's go with v1.6.0 or v1.5.1

WillAyd

lgtm - thanks!

doc/source/whatsnew/v1.5.1.rst

doc/source/whatsnew/v1.6.0.rst

mroeschke · 2022-10-11T16:29:23Z

Awesome, thanks for sticking with this @kianelbo

* ENH: Warn when dtype is not passed to get_dummies * Edit get_dummies' dtype warning * Add whatsnew entry for issue pandas-dev#45848 * Fix dtype warning test * Suppress warnings in docs * Edit whatsnew entry Co-authored-by: Marco Edward Gorelli <marcogorelli@protonmail.com> * Fix find_stack_level in get_dummies dtype warning * Change the default dtype of get_dummies to bool * Revert dtype(bool) change * Move the changelog entry to v1.6.0.rst * Move whatsnew entry to 'Other API changes' Co-authored-by: Marco Edward Gorelli <marcogorelli@protonmail.com> Co-authored-by: Marco Edward Gorelli <33491632+MarcoGorelli@users.noreply.github.com>

tylerjereddy · 2023-02-28T19:40:15Z

FWIW, this does cause some confusion for downstream consumers--for example the common approach of using pd.get_dummies(pd.cut(...)) to get one-hot encoded data (common for ML applications) might more naturally be expected to continue to return a numeric type, which I believe is also the default for sklearn.preprocessing.OneHotEncoder. I actually plucked that approach right out of Wes' book I think.

Well, I'm not really that annoyed minus 90 minutes of debugging, and the fix is trivial for the consuming code (just specify the old default dtype to get_dummies), but I'll just place this comment here in case it helps others adapt their downstream code.

Fixes darshan-hpc#909 * make the library compatible with both `pandas 1.5.x` and `pandas 2.0.0rc0` by pinning the dtype we use for one-hot encoding our heatmap data * see related upstream comment and release notes (`get_dummies()` change): - pandas-dev/pandas#48022 (comment) - https://pandas.pydata.org/pandas-docs/version/2.0/whatsnew/v2.0.0.html

bashtage · 2023-03-01T07:26:05Z

It also lead to some not obvious errors in statsmodels when testing against the pre-prelease. It wasn't that hard to change since I know about this change, but would have been hard to determine had one not. When making the changes I half wondered if the default should have become double which would have always prevented the math issues that this was desired to fix, albeit at the cost of 8x storage (although one could always choose bool if storage was more important than simplicity of use).

This PR changes the default dtype for get_dummies to bool from uint8 to match pandas-2.0: pandas-dev/pandas#48022

ENH: Warn when dtype is not passed to get_dummies

1eb5cd6

kianelbo force-pushed the getdummies-default-dtype branch from ed5136a to 1eb5cd6 Compare August 10, 2022 12:20

MarcoGorelli requested changes Aug 10, 2022

View reviewed changes

pandas/core/reshape/encoding.py Outdated Show resolved Hide resolved

pandas/tests/reshape/test_get_dummies.py Outdated Show resolved Hide resolved

kianelbo added 2 commits August 10, 2022 18:33

Edit get_dummies' dtype warning

efa678b

Add whatsnew entry for issue pandas-dev#45848

472fa28

MarcoGorelli reviewed Aug 10, 2022

View reviewed changes

doc/source/whatsnew/v1.5.0.rst Outdated Show resolved Hide resolved

kianelbo and others added 3 commits August 10, 2022 20:02

Fix dtype warning test

2ead750

Suppress warnings in docs

ddcc7d3

Edit whatsnew entry

81dbb87

Co-authored-by: Marco Edward Gorelli <marcogorelli@protonmail.com>

mroeschke added the Warnings Warnings that appear or should be added to pandas label Aug 10, 2022

kianelbo added 2 commits August 23, 2022 12:55

Merge branch 'main' into 'getdummies-default-dtype'

45d9c79

Fix find_stack_level in get_dummies dtype warning

f97df66

kianelbo requested a review from MarcoGorelli August 23, 2022 12:24

MarcoGorelli approved these changes Aug 23, 2022

View reviewed changes

WillAyd reviewed Aug 23, 2022

View reviewed changes

MarcoGorelli requested changes Aug 23, 2022

View reviewed changes

jreback requested changes Aug 24, 2022

View reviewed changes

MarcoGorelli requested changes Aug 24, 2022

View reviewed changes

datapythonista mentioned this pull request Sep 8, 2022

RLS: 1.5 #45223

Closed

MarcoGorelli requested changes Sep 18, 2022

View reviewed changes

Merge branch 'main' into getdummies-default-dtype

707a222

kianelbo force-pushed the getdummies-default-dtype branch 4 times, most recently from 2ef8633 to 9f7fbc4 Compare September 22, 2022 16:04

Change the default dtype of get_dummies to bool

15aeb3e

kianelbo force-pushed the getdummies-default-dtype branch from 9f7fbc4 to 15aeb3e Compare September 22, 2022 16:07

Merge branch 'main' into 'getdummies-default-dtype'

a5f709d

kianelbo requested a review from MarcoGorelli September 24, 2022 09:32

MarcoGorelli changed the title ~~ENH: Warn when dtype is not passed to get_dummies~~ ENH: change get_dummies default dtype to bool Sep 25, 2022

WillAyd approved these changes Oct 6, 2022

View reviewed changes

Merge branch 'main' into getdummies-default-dtype

6e90b45

mroeschke reviewed Oct 6, 2022

View reviewed changes

doc/source/whatsnew/v1.5.1.rst Outdated Show resolved Hide resolved

MarcoGorelli modified the milestones: 1.5.1, 1.6 Oct 7, 2022

kianelbo and others added 3 commits October 7, 2022 12:44

Merge branch 'main' into getdummies-default-dtype

ce37f33

Move the changelog entry to v1.6.0.rst

7cef2fc

Merge branch 'main' into getdummies-default-dtype

9285bf1

mroeschke reviewed Oct 10, 2022

View reviewed changes

doc/source/whatsnew/v1.6.0.rst Outdated Show resolved Hide resolved

kianelbo added 2 commits October 11, 2022 10:57

Merge branch 'main' into getdummies-default-dtype

d7e6490

Move whatsnew entry to 'Other API changes'

8a93cc9

kianelbo requested review from mroeschke and removed request for jreback October 11, 2022 13:40

mroeschke approved these changes Oct 11, 2022

View reviewed changes

mroeschke merged commit bfdf223 into pandas-dev:main Oct 11, 2022

kianelbo deleted the getdummies-default-dtype branch October 11, 2022 16:56

mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022

jtilly mentioned this pull request Oct 20, 2022

fixing pandas 1.6.0 and 2.0.0 breaking changes Quantco/glum#580

Merged

1 task

jrbourbeau mentioned this pull request Dec 12, 2022

get_dummies compatibility for pandas 2.0 dask/dask#9752

Merged

tylerjereddy mentioned this pull request Feb 28, 2023

MAINT, BUG: pandas 2.0 compatibility darshan-hpc/darshan#912

Merged

galipremsagar mentioned this pull request Apr 19, 2023

[REVIEW] Change default dtype for get_dummies to bool rapidsai/cudf#13174

Merged

3 tasks

galipremsagar added a commit to rapidsai/cudf that referenced this pull request Apr 22, 2023

Change default dtype for get_dummies to bool (#13174)

27e18c8

This PR changes the default dtype for get_dummies to bool from uint8 to match pandas-2.0: pandas-dev/pandas#48022

hcho3 mentioned this pull request Apr 13, 2024

ArrayInterface handler for cuDF DataFrame cannot yet handle Boolean columns dmlc/xgboost#10181

Open

komo-fr mentioned this pull request Jan 2, 2025

ENH: Change default dtype of str.get_dummies() to bool #60641

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: change get_dummies default dtype to bool #48022

ENH: change get_dummies default dtype to bool #48022

kianelbo commented Aug 10, 2022 •

edited

Loading

MarcoGorelli left a comment

MarcoGorelli left a comment

MarcoGorelli left a comment •

edited

Loading

WillAyd Aug 23, 2022

MarcoGorelli Aug 23, 2022 •

edited

Loading

WillAyd Aug 24, 2022

jreback Aug 24, 2022

MarcoGorelli Aug 24, 2022

MarcoGorelli Aug 23, 2022 •

edited

Loading

jreback left a comment

MarcoGorelli Aug 24, 2022

MarcoGorelli left a comment •

edited

Loading

WillAyd left a comment

mroeschke commented Oct 11, 2022

tylerjereddy commented Feb 28, 2023

bashtage commented Mar 1, 2023

ENH: change get_dummies default dtype to bool #48022

ENH: change get_dummies default dtype to bool #48022

Conversation

kianelbo commented Aug 10, 2022 • edited Loading

MarcoGorelli left a comment

Choose a reason for hiding this comment

MarcoGorelli left a comment

Choose a reason for hiding this comment

MarcoGorelli left a comment • edited Loading

Choose a reason for hiding this comment

WillAyd Aug 23, 2022

Choose a reason for hiding this comment

MarcoGorelli Aug 23, 2022 • edited Loading

Choose a reason for hiding this comment

WillAyd Aug 24, 2022

Choose a reason for hiding this comment

jreback Aug 24, 2022

Choose a reason for hiding this comment

MarcoGorelli Aug 24, 2022

Choose a reason for hiding this comment

MarcoGorelli Aug 23, 2022 • edited Loading

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

MarcoGorelli Aug 24, 2022

Choose a reason for hiding this comment

MarcoGorelli left a comment • edited Loading

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

mroeschke commented Oct 11, 2022

tylerjereddy commented Feb 28, 2023

bashtage commented Mar 1, 2023

kianelbo commented Aug 10, 2022 •

edited

Loading

MarcoGorelli left a comment •

edited

Loading

MarcoGorelli Aug 23, 2022 •

edited

Loading

MarcoGorelli Aug 23, 2022 •

edited

Loading

MarcoGorelli left a comment •

edited

Loading