Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix get dummies unicode error #22131

Merged
merged 5 commits into from
Aug 2, 2018

Conversation

Scorpil
Copy link
Contributor

@Scorpil Scorpil commented Jul 30, 2018

df = pd.DataFrame({'x': [u'ä']})
result = pd.get_dummies(df)
expected = pd.DataFrame({u'x_ä': [1]}, dtype=np.uint8)
assert_frame_equal(result, expected)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one would pass even without a fix, but I've included it for completeness.

Copy link
Member

@gfyoung gfyoung Aug 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More coverage is (almost) never a problem for us 🙂

@Scorpil Scorpil changed the title Fix get dummies unicode error BUG: Fix get dummies unicode error Jul 30, 2018
@codecov
Copy link

codecov bot commented Jul 30, 2018

Codecov Report

Merging #22131 into master will decrease coverage by <.01%.
The diff coverage is 87.5%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22131      +/-   ##
==========================================
- Coverage   92.06%   92.06%   -0.01%     
==========================================
  Files         169      169              
  Lines       50689    50693       +4     
==========================================
+ Hits        46667    46670       +3     
- Misses       4022     4023       +1
Flag Coverage Δ
#multiple 90.47% <87.5%> (-0.01%) ⬇️
#single 42.32% <12.5%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/reshape/reshape.py 99.57% <87.5%> (-0.22%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 615615a...662cac3. Read the comment docs.

@Scorpil Scorpil force-pushed the fix_get_dummies_unicode_error branch from b6995f9 to 07975d6 Compare July 30, 2018 20:14
@Scorpil
Copy link
Contributor Author

Scorpil commented Jul 31, 2018

@gfyoung @jreback this PR is ready for merge, please check.

expected = pd.DataFrame({u'x_ä': [1]}, dtype=np.uint8)
assert_frame_equal(result, expected)

df = pd.DataFrame({'x': ['a']})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parametrize this test?

else '{prefix}{sep}{level}' for v in levels]
dummy_cols = [dummy_str.format(prefix=prefix, sep=prefix_sep, level=v)
for dummy_str, v in zip(dummy_strs, levels)]
py2_prefix_is_unicode = isinstance(prefix, text_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make a little helper function here so that you can do this as a list-comprehension

@@ -923,11 +923,17 @@ def get_empty_Frame(data, sparse):

number_of_cols = len(levels)

py2_prefix_sep_is_unicode = isinstance(prefix_sep, text_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this explicit by also using PY2

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode 2/3 Compat labels Jul 31, 2018
@@ -302,6 +302,26 @@ def test_dataframe_dummies_with_categorical(self, df, sparse, dtype):
expected.sort_index(axis=1)
assert_frame_equal(result, expected)

def test_dataframe_dummies_unicode(self):
df = pd.DataFrame(({u'ä': ['a']}))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference issue number as a comment above this line.

@Scorpil Scorpil force-pushed the fix_get_dummies_unicode_error branch from 50baa9a to 15f2946 Compare August 2, 2018 12:41
@jreback jreback added this to the 0.24.0 milestone Aug 2, 2018
@jreback jreback merged commit 5076ebe into pandas-dev:master Aug 2, 2018
@jreback
Copy link
Contributor

jreback commented Aug 2, 2018

thanks @Scorpil !

dberenbaum pushed a commit to dberenbaum/pandas that referenced this pull request Aug 3, 2018
minggli added a commit to minggli/pandas that referenced this pull request Aug 5, 2018
* master: (47 commits)
  Run tests in conda build [ci skip] (pandas-dev#22190)
  TST: Check DatetimeIndex.drop on DST boundary (pandas-dev#22165)
  CI: Fix Travis failures due to lint.sh on pandas/core/strings.py (pandas-dev#22184)
  Documentation: typo fixes in MultiIndex / Advanced Indexing (pandas-dev#22179)
  DOC: added .join to 'see also' in Series.str.cat (pandas-dev#22175)
  DOC: updated Series.str.contains see also section (pandas-dev#22176)
  0.23.4 whatsnew (pandas-dev#22177)
  fix: scalar timestamp assignment (pandas-dev#19843) (pandas-dev#19973)
  BUG: Fix get dummies unicode error (pandas-dev#22131)
  Fixed py36-only syntax [ci skip] (pandas-dev#22167)
  DEPR: pd.read_table (pandas-dev#21954)
  DEPR: Removing previously deprecated datetools module (pandas-dev#6581) (pandas-dev#19119)
  BUG: Matplotlib scatter datetime (pandas-dev#22039)
  CLN: Use public method to capture UTC offsets (pandas-dev#22164)
  implement tslibs/src to make tslibs self-contained (pandas-dev#22152)
  Fix categorical from codes nan 21767 (pandas-dev#21775)
  BUG: Better handling of invalid na_option argument for groupby.rank(pandas-dev#22124) (pandas-dev#22125)
  use memoryviews instead of ndarrays (pandas-dev#22147)
  Remove depr. warning in SeriesGroupBy.count (pandas-dev#22155)
  API: Default to_* methods to compression='infer' (pandas-dev#22011)
  ...
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pd.get_dummies incorrectly encodes unicode characters in dataframe column names
3 participants