Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: allow saving wide dataframes to hdf with format table #26135

Closed
wants to merge 34 commits into from
Closed

ENH: allow saving wide dataframes to hdf with format table #26135

wants to merge 34 commits into from

Conversation

P-Tillmann
Copy link
Contributor

based on #11788
closes #6245

This PR allows to save wide dataframes to hdf. It will break forward compatibility, old versions of pandas will not be able to read hdfs from new versions.
The column is saved to a vlarray with object atomic. This is rather slow for very wide dfs but does not require any type checks.
One test fails because the saved column is not compressed but the compression level is checked for all nodes. I think this is fine but i wanted to consult with you before rewriting the test.

Two tests were added to check if a wide df can be saved and if it is appendable.

Please let me know what you think and if there is additional need for test coverage.

@jreback jreback added the IO HDF5 read_hdf, HDFStore label Apr 19, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in principle this change is ok, but this must be able to read existing arrays; so would need to commit some samples to the repo to test for this. once you can pass all of these tests I can have a look.

@codecov
Copy link

codecov bot commented Apr 23, 2019

Codecov Report

Merging #26135 into master will increase coverage by 0.01%.
The diff coverage is 97.82%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26135      +/-   ##
==========================================
+ Coverage   91.99%      92%   +0.01%     
==========================================
  Files         175      175              
  Lines       52387    52414      +27     
==========================================
+ Hits        48191    48222      +31     
+ Misses       4196     4192       -4
Flag Coverage Δ
#multiple 90.55% <89.13%> (+0.01%) ⬆️
#single 40.79% <97.82%> (-0.09%) ⬇️
Impacted Files Coverage Δ
pandas/io/pytables.py 90.36% <97.82%> (+0.13%) ⬆️
pandas/io/gbq.py 75% <0%> (-12.5%) ⬇️
pandas/core/frame.py 96.9% <0%> (-0.12%) ⬇️
pandas/io/excel/_base.py 92.82% <0%> (-0.07%) ⬇️
pandas/tseries/holiday.py 93.17% <0%> (-0.04%) ⬇️
pandas/core/base.py 98.2% <0%> (ø) ⬆️
pandas/core/reshape/tile.py 97.67% <0%> (ø) ⬆️
pandas/core/internals/blocks.py 94.08% <0%> (ø) ⬆️
pandas/core/dtypes/cast.py 91.5% <0%> (+0.13%) ⬆️
pandas/core/groupby/ops.py 95.97% <0%> (+1.73%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 872a23b...05aac5b. Read the comment docs.

@codecov
Copy link

codecov bot commented Apr 23, 2019

Codecov Report

Merging #26135 into master will decrease coverage by 1.26%.
The diff coverage is 97.82%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26135      +/-   ##
==========================================
- Coverage      93%   91.74%   -1.27%     
==========================================
  Files         182      174       -8     
  Lines       50311    50808     +497     
==========================================
- Hits        46793    46615     -178     
- Misses       3518     4193     +675
Flag Coverage Δ
#multiple 90.25% <89.13%> (-1.42%) ⬇️
#single 41.75% <97.82%> (-0.74%) ⬇️
Impacted Files Coverage Δ
pandas/io/pytables.py 90.36% <97.82%> (+0.01%) ⬆️
pandas/plotting/_misc.py 38.23% <0%> (-26.63%) ⬇️
pandas/io/gbq.py 78.94% <0%> (-21.06%) ⬇️
pandas/io/gcs.py 80% <0%> (-20%) ⬇️
pandas/io/s3.py 89.47% <0%> (-10.53%) ⬇️
pandas/core/groupby/base.py 91.83% <0%> (-8.17%) ⬇️
pandas/io/excel/_xlrd.py 94.54% <0%> (-5.46%) ⬇️
pandas/core/indexing.py 90.53% <0%> (-4.53%) ⬇️
pandas/util/_decorators.py 91.34% <0%> (-4.01%) ⬇️
pandas/plotting/_core.py 83.89% <0%> (-3.76%) ⬇️
... and 171 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ebcfee4...f725d20. Read the comment docs.

@P-Tillmann
Copy link
Contributor Author

@jreback
I included tests for several kinds of different column types that were stored with the current release version of pandas.
But I think i need some help with the failing tests. There are two tests failing:

  1. "Docstring validation" and "Testing docstring validaton script" are failing. code_checks.sh runs fine locally. I don't understand how my changes are effecting these.

  2. In one of the travis enviroments most pytables test fail with "ValueError: cannot set WRITEABLE flag to True of this array" This seems to be related to ValueError: cannot set WRITEABLE flag to True of this array #24839. I can reproduce it locally by creating an environment with the corresponding environment.yml The tests are passed fine if I manually upgrade pytables to 3.5.1 (which was fixed to work with numpy 1.6).
    If i understand correctly xfail_non_writeable was introduced in TST: xfail non-writeable pytables tests with numpy 1.16x #25517 to prevent code that triggers this bug to fail the tests. By using vlarrays for all columns when format='table' this is now triggered for most tests. Does it makes sense to mask all the failing tests with xfail_non_writeable? Or is it more reasonable to not merge this PR until pytables 3.5.1 is the standard in all travis environments?

@jreback
Copy link
Contributor

jreback commented May 12, 2019

can you merge master and update

@P-Tillmann
Copy link
Contributor Author

@jreback Updated.

  1. "Docstring validation" and "Testing docstring validaton script" is ok now.
  2. "ValueError: cannot set WRITEABLE flag to True of this array" still leads to lots of fails in one of the test environments.

@jreback
Copy link
Contributor

jreback commented May 19, 2019

something is not right with your patch; merge upstream/master

@P-Tillmann
Copy link
Contributor Author

Hi Jeff,
thanks for your feedback. I need some help understanding what the issue is. I just pulled upstream master and it merged without conflict. I don't have any local changes. Can you specify were the problem is?

@jreback
Copy link
Contributor

jreback commented May 20, 2019

maybe u didn’t push
the PR has 377 changed files

@P-Tillmann
Copy link
Contributor Author

I see, thanks for the help. You were right, i didn't push.

@jreback
Copy link
Contributor

jreback commented Jun 27, 2019

can you merge master; note we moved the test_pytables to a subdirectory

@jbrockmendel
Copy link
Member

@P-Tillmann can you rebase

@P-Tillmann
Copy link
Contributor Author

@jreback @jbrockmendel I updated to current master. But travis stil has an environment with pytables "cannot set WRITEABLE flag" bug for vlarrays. And since this PR uses a lot of vlarrays it fails almost all tests.

@WillAyd
Copy link
Member

WillAyd commented Aug 28, 2019

@P-Tillmann is there a min version for pytables where that was fixed?

@WillAyd
Copy link
Member

WillAyd commented Sep 13, 2019

@P-Tillmann is this still active?

@WillAyd
Copy link
Member

WillAyd commented Sep 20, 2019

Closing as stale but if this is still relevant please ping and can reopen

@WillAyd WillAyd closed this Sep 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: write Table meta-data (non_index_axes) as a CArray (rather than as meta-data)
4 participants