Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Categorical data fails to load from hdf when all columns are NaN #18652

Merged
merged 13 commits into from
Dec 10, 2017
Merged

BUG: Categorical data fails to load from hdf when all columns are NaN #18652

merged 13 commits into from
Dec 10, 2017

Conversation

ssche
Copy link
Contributor

@ssche ssche commented Dec 6, 2017

ssche and others added 2 commits December 5, 2017 13:47
* Handle all-NaN columns differently when building metadata for categorical axes on saving hdf5 file
* Categorical axes fail test case comparison due to type difference (even though there isn't a visibly type difference)
* Change empty category to `Index([], dtype=np.float64)` instead of `[]`.
* Remove printouts in test case.
@pep8speaks
Copy link

pep8speaks commented Dec 6, 2017

Hello @ssche! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on December 10, 2017 at 15:54 Hours UTC

@jschendel
Copy link
Member

Thanks. The whatsnew entry should be added to pandas/doc/source/whatsnew/v0.22.0.txt under the I/O subsection within the Bug Fixes section.

@jreback jreback changed the title Gh18413 BUG: Categorical data fails to load from hdf when all columns are NaN Dec 6, 2017
@jreback jreback added Categorical Categorical Data Type IO HDF5 read_hdf, HDFStore Bug labels Dec 6, 2017
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. you could add this to 0.21.1 bug fixes in IO section.

})
df['a'] = df.a.astype('category')
df['b'] = df.b.astype('category')
expected = df.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to copy here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v0.22.0.txt or 0.21.1? - I added it to 0.22.0 as per @jschendel's request.

Copy link
Member

@jschendel jschendel Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go with what @jreback said, he'd know better than me. I was just basing 0.22.0 off the issue originally having the "Next Major Release" milestone, but that can change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@ssche
Copy link
Contributor Author

ssche commented Dec 6, 2017

I’m keen to get this accepted. What’s the next step?

@@ -91,6 +91,7 @@ I/O
- Bug in :meth:`DataFrame.to_msgpack` when serializing data of the numpy.bool_ datatype (:issue:`18390`)
- Bug in :func:`read_json` not decoding when reading line deliminted JSON from S3 (:issue:`17200`)
- Bug in :func:`pandas.io.json.json_normalize` to avoid modification of ``meta`` (:issue:`18610`)
- Bug when storing NaN-only categorical columns in hdf5 store (:issue:`18413`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug when reading ......in a :class:`HDFStore`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done.

# be read back.
df = pd.DataFrame({
'a': ['a', 'b', 'c', np.nan],
'b': [np.nan, np.nan, np.nan, np.nan],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add another column with an array like pd.Series([None]* 3, dtype=object). this might fail your test because the original array was an all-null object type (and no float). but let's see.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, seems to work fine.

* Added additional all-None Series
* Provided more detail in whatsnew description
@codecov
Copy link

codecov bot commented Dec 8, 2017

Codecov Report

Merging #18652 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18652      +/-   ##
==========================================
- Coverage   91.59%   91.57%   -0.02%     
==========================================
  Files         153      153              
  Lines       51221    51223       +2     
==========================================
- Hits        46917    46910       -7     
- Misses       4304     4313       +9
Flag Coverage Δ
#multiple 89.43% <0%> (-0.01%) ⬇️
#single 40.68% <100%> (-0.1%) ⬇️
Impacted Files Coverage Δ
pandas/io/pytables.py 92.84% <100%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.81% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 13f6267...e6aad40. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 8, 2017

Codecov Report

Merging #18652 into master will decrease coverage by 0.04%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18652      +/-   ##
==========================================
- Coverage   91.61%   91.57%   -0.05%     
==========================================
  Files         153      153              
  Lines       51305    51307       +2     
==========================================
- Hits        47001    46982      -19     
- Misses       4304     4325      +21
Flag Coverage Δ
#multiple 89.43% <0%> (-0.03%) ⬇️
#single 40.71% <100%> (-0.1%) ⬇️
Impacted Files Coverage Δ
pandas/io/pytables.py 92.84% <100%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/plotting/_converter.py 64.78% <0%> (-1.74%) ⬇️
pandas/core/frame.py 97.81% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1355df6...b2ac7c4. Read the comment docs.

@jreback jreback added this to the 0.21.1 milestone Dec 10, 2017
@jreback
Copy link
Contributor

jreback commented Dec 10, 2017

@ssche had a linting error. this should be ok. ping on green.

@jreback jreback merged commit 2db1cc0 into pandas-dev:master Dec 10, 2017
@jreback
Copy link
Contributor

jreback commented Dec 10, 2017

thanks @ssche

@ssche ssche deleted the gh18413 branch December 10, 2017 20:42
TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Dec 11, 2017
TomAugspurger pushed a commit that referenced this pull request Dec 12, 2017
yarikoptic added a commit to neurodebian/pandas that referenced this pull request Feb 22, 2018
Version 0.22.0

* tag 'v0.22.0': (777 commits)
  RLS: v0.22.0
  DOC: Fix min_count docstring (pandas-dev#19005)
  DOC: More 0.22.0 updates (pandas-dev#19002)
  TST: Remove pow test in expressions
  COMPAT: Avoid td.skip decorator
  DOC: 0.22.0 release docs (pandas-dev#18983)
  DOC: Include 0.22.0 whatsnew
  Breaking changes for sum / prod of empty / all-NA (pandas-dev#18921)
  ENH: Added a min_count keyword to stat funcs (pandas-dev#18876)
  RLS: v0.21.1
  DOC: Add date to whatsnew (pandas-dev#18740)
  DOC: Include 0.21.1 whatsnew
  DOC: Update relase notes (pandas-dev#18739)
  CFG: Ignore W503
  DOC: fix options table (pandas-dev#18730)
  ENH: support non default indexes in writing to Parquet (pandas-dev#18629)
  BUG: Fix to_latex with longtable (pandas-dev#17959) (pandas-dev#17960)
  Parquet: Add error message for no engine found (pandas-dev#18717)
  BUG: Categorical data fails to load from hdf when all columns are NaN (pandas-dev#18652)
  DOC: clean-up whatsnew file for 0.21.1 (pandas-dev#18690)
  ...
yarikoptic added a commit to neurodebian/pandas that referenced this pull request Feb 22, 2018
* releases: (777 commits)
  RLS: v0.22.0
  DOC: Fix min_count docstring (pandas-dev#19005)
  DOC: More 0.22.0 updates (pandas-dev#19002)
  TST: Remove pow test in expressions
  COMPAT: Avoid td.skip decorator
  DOC: 0.22.0 release docs (pandas-dev#18983)
  DOC: Include 0.22.0 whatsnew
  Breaking changes for sum / prod of empty / all-NA (pandas-dev#18921)
  ENH: Added a min_count keyword to stat funcs (pandas-dev#18876)
  RLS: v0.21.1
  DOC: Add date to whatsnew (pandas-dev#18740)
  DOC: Include 0.21.1 whatsnew
  DOC: Update relase notes (pandas-dev#18739)
  CFG: Ignore W503
  DOC: fix options table (pandas-dev#18730)
  ENH: support non default indexes in writing to Parquet (pandas-dev#18629)
  BUG: Fix to_latex with longtable (pandas-dev#17959) (pandas-dev#17960)
  Parquet: Add error message for no engine found (pandas-dev#18717)
  BUG: Categorical data fails to load from hdf when all columns are NaN (pandas-dev#18652)
  DOC: clean-up whatsnew file for 0.21.1 (pandas-dev#18690)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Categorical data fails to load from hdf when all columns are NaN
5 participants