ENH: allow saving wide dataframes to hdf with format table #26135

P-Tillmann · 2019-04-18T15:28:30Z

based on #11788
closes #6245

This PR allows to save wide dataframes to hdf. It will break forward compatibility, old versions of pandas will not be able to read hdfs from new versions.
The column is saved to a vlarray with object atomic. This is rather slow for very wide dfs but does not require any type checks.
One test fails because the saved column is not compressed but the compression level is checked for all nodes. I think this is fine but i wanted to consult with you before rewriting the test.

Two tests were added to check if a wide df can be saved and if it is appendable.

Please let me know what you think and if there is additional need for test coverage.

...

…as vlarray with object atom. pre-cleanup.

jreback

in principle this change is ok, but this must be able to read existing arrays; so would need to commit some samples to the repo to test for this. once you can pass all of these tests I can have a look.

…on of tests.

…64 explicitly to match stored table.

codecov · 2019-04-23T09:53:49Z

Codecov Report

Merging #26135 into master will increase coverage by 0.01%.
The diff coverage is 97.82%.

@@            Coverage Diff             @@
##           master   #26135      +/-   ##
==========================================
+ Coverage   91.99%      92%   +0.01%     
==========================================
  Files         175      175              
  Lines       52387    52414      +27     
==========================================
+ Hits        48191    48222      +31     
+ Misses       4196     4192       -4

Flag	Coverage Δ
#multiple	`90.55% <89.13%> (+0.01%)`	⬆️
#single	`40.79% <97.82%> (-0.09%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/pytables.py	`90.36% <97.82%> (+0.13%)`	⬆️
pandas/io/gbq.py	`75% <0%> (-12.5%)`	⬇️
pandas/core/frame.py	`96.9% <0%> (-0.12%)`	⬇️
pandas/io/excel/_base.py	`92.82% <0%> (-0.07%)`	⬇️
pandas/tseries/holiday.py	`93.17% <0%> (-0.04%)`	⬇️
pandas/core/base.py	`98.2% <0%> (ø)`	⬆️
pandas/core/reshape/tile.py	`97.67% <0%> (ø)`	⬆️
pandas/core/internals/blocks.py	`94.08% <0%> (ø)`	⬆️
pandas/core/dtypes/cast.py	`91.5% <0%> (+0.13%)`	⬆️
pandas/core/groupby/ops.py	`95.97% <0%> (+1.73%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 872a23b...05aac5b. Read the comment docs.

codecov · 2019-04-23T09:53:58Z

Codecov Report

Merging #26135 into master will decrease coverage by 1.26%.
The diff coverage is 97.82%.

@@            Coverage Diff             @@
##           master   #26135      +/-   ##
==========================================
- Coverage      93%   91.74%   -1.27%     
==========================================
  Files         182      174       -8     
  Lines       50311    50808     +497     
==========================================
- Hits        46793    46615     -178     
- Misses       3518     4193     +675

Flag	Coverage Δ
#multiple	`90.25% <89.13%> (-1.42%)`	⬇️
#single	`41.75% <97.82%> (-0.74%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/pytables.py	`90.36% <97.82%> (+0.01%)`	⬆️
pandas/plotting/_misc.py	`38.23% <0%> (-26.63%)`	⬇️
pandas/io/gbq.py	`78.94% <0%> (-21.06%)`	⬇️
pandas/io/gcs.py	`80% <0%> (-20%)`	⬇️
pandas/io/s3.py	`89.47% <0%> (-10.53%)`	⬇️
pandas/core/groupby/base.py	`91.83% <0%> (-8.17%)`	⬇️
pandas/io/excel/_xlrd.py	`94.54% <0%> (-5.46%)`	⬇️
pandas/core/indexing.py	`90.53% <0%> (-4.53%)`	⬇️
pandas/util/_decorators.py	`91.34% <0%> (-4.01%)`	⬇️
pandas/plotting/_core.py	`83.89% <0%> (-3.76%)`	⬇️
... and 171 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ebcfee4...f725d20. Read the comment docs.

P-Tillmann · 2019-04-23T14:33:50Z

@jreback
I included tests for several kinds of different column types that were stored with the current release version of pandas.
But I think i need some help with the failing tests. There are two tests failing:

"Docstring validation" and "Testing docstring validaton script" are failing. code_checks.sh runs fine locally. I don't understand how my changes are effecting these.
In one of the travis enviroments most pytables test fail with "ValueError: cannot set WRITEABLE flag to True of this array" This seems to be related to ValueError: cannot set WRITEABLE flag to True of this array #24839. I can reproduce it locally by creating an environment with the corresponding environment.yml The tests are passed fine if I manually upgrade pytables to 3.5.1 (which was fixed to work with numpy 1.6).
If i understand correctly xfail_non_writeable was introduced in TST: xfail non-writeable pytables tests with numpy 1.16x #25517 to prevent code that triggers this bug to fail the tests. By using vlarrays for all columns when format='table' this is now triggered for most tests. Does it makes sense to mask all the failing tests with xfail_non_writeable? Or is it more reasonable to not merge this PR until pytables 3.5.1 is the standard in all travis environments?

jreback · 2019-05-12T21:15:01Z

can you merge master and update

…as vlarray with object atom. pre-cleanup.

…on of tests.

…64 explicitly to match stored table.

…e_pytables

P-Tillmann · 2019-05-17T16:34:40Z

@jreback Updated.

"Docstring validation" and "Testing docstring validaton script" is ok now.
"ValueError: cannot set WRITEABLE flag to True of this array" still leads to lots of fails in one of the test environments.

jreback · 2019-05-19T18:49:30Z

something is not right with your patch; merge upstream/master

…de_pytables

P-Tillmann · 2019-05-20T10:52:18Z

Hi Jeff,
thanks for your feedback. I need some help understanding what the issue is. I just pulled upstream master and it merged without conflict. I don't have any local changes. Can you specify were the problem is?

jreback · 2019-05-20T11:15:30Z

maybe u didn’t push
the PR has 377 changed files

P-Tillmann · 2019-05-20T11:28:51Z

I see, thanks for the help. You were right, i didn't push.

jreback · 2019-06-27T03:36:13Z

can you merge master; note we moved the test_pytables to a subdirectory

jbrockmendel · 2019-07-16T21:17:24Z

@P-Tillmann can you rebase

P-Tillmann · 2019-07-26T12:09:38Z

@jreback @jbrockmendel I updated to current master. But travis stil has an environment with pytables "cannot set WRITEABLE flag" bug for vlarrays. And since this PR uses a lot of vlarrays it fails almost all tests.

WillAyd · 2019-08-28T16:16:45Z

@P-Tillmann is there a min version for pytables where that was fixed?

WillAyd · 2019-09-13T01:37:07Z

@P-Tillmann is this still active?

WillAyd · 2019-09-20T14:41:04Z

Closing as stale but if this is still relevant please ping and can reopen

P-Tillmann and others added 7 commits July 27, 2017 11:43

Merge pull request #1 from pandas-dev/master

cd35777

...

added two test cases for storing wide dataframes in table format

633be78

Added support for wide tables with format 'table'. Columns are saved …

3ba10ef

…as vlarray with object atom. pre-cleanup.

cleanup

4c20cdd

Merge remote-tracking branch 'upstream/master'

7ef9e30

Accidently worked on old pandas version. Resolved merge conflicts

adf378e

Linting, cleanup and replaced string_types with str

2ecc05e

jreback added the IO HDF5 read_hdf, HDFStore label Apr 19, 2019

jreback requested changes Apr 19, 2019

View reviewed changes

P-Tillmann added 6 commits April 20, 2019 14:03

Fixed tables import

6451f8c

changed test to only check compression filter on table data, not columns

059bbc1

added tests for reading columns from legacy tables. Rearranged positi…

1c1f872

…on of tests.

Linting

e4d81bf

added legacy hdf file for tests

37efd62

Numpy in windows creates int32 arrays by default. Need to cast to int…

05aac5b

…64 explicitly to match stored table.

Peter Tillmann added 2 commits April 23, 2019 14:00

Merge remote-tracking branch 'upstream/master'

8cd08f3

Merge remote-tracking branch 'upstream/master'

ead6518

Peter Tillmann and others added 9 commits May 17, 2019 16:51

added two test cases for storing wide dataframes in table format

c539d9d

Added support for wide tables with format 'table'. Columns are saved …

c553ee5

…as vlarray with object atom. pre-cleanup.

cleanup

872552b

Linting, cleanup and replaced string_types with str

ee3cdba

Fixed tables import

a2c2764

changed test to only check compression filter on table data, not columns

f8c94cb

added tests for reading columns from legacy tables. Rearranged positi…

8484293

…on of tests.

Linting

c3db771

added legacy hdf file for tests

95b193c

P-Tillmann and others added 4 commits May 17, 2019 17:10

Numpy in windows creates int32 arrays by default. Need to cast to int…

3684fa6

…64 explicitly to match stored table.

celanup

54c1657

Merge branch 'master' of github.com:P-Tillmann/pandas into wide_pytables

99ef34b

Merge branch 'wide_pytables' of github.com:P-Tillmann/pandas into wid…

d3414f2

…e_pytables

Merge branch 'master' of https://github.com/pandas-dev/pandas into wi…

3488e1c

…de_pytables

Peter Tillmann added 5 commits July 26, 2019 11:11

Rebased to upstream

f903f29

Included unsaved changes for rebase. Fixed typo. Corrected Docstring.

96d0ec6

Black refromatting

4d0466e

Fix for blosc compression test case

af10a71

black reformating

f725d20

WillAyd closed this Sep 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: allow saving wide dataframes to hdf with format table #26135

ENH: allow saving wide dataframes to hdf with format table #26135

P-Tillmann commented Apr 18, 2019

jreback left a comment

codecov bot commented Apr 23, 2019

codecov bot commented Apr 23, 2019 •

edited

Loading

P-Tillmann commented Apr 23, 2019

jreback commented May 12, 2019

P-Tillmann commented May 17, 2019

jreback commented May 19, 2019

P-Tillmann commented May 20, 2019

jreback commented May 20, 2019

P-Tillmann commented May 20, 2019

jreback commented Jun 27, 2019

jbrockmendel commented Jul 16, 2019

P-Tillmann commented Jul 26, 2019

WillAyd commented Aug 28, 2019

WillAyd commented Sep 13, 2019

WillAyd commented Sep 20, 2019

ENH: allow saving wide dataframes to hdf with format table #26135

ENH: allow saving wide dataframes to hdf with format table #26135

Conversation

P-Tillmann commented Apr 18, 2019

jreback left a comment

Choose a reason for hiding this comment

codecov bot commented Apr 23, 2019

Codecov Report

codecov bot commented Apr 23, 2019 • edited Loading

Codecov Report

P-Tillmann commented Apr 23, 2019

jreback commented May 12, 2019

P-Tillmann commented May 17, 2019

jreback commented May 19, 2019

P-Tillmann commented May 20, 2019

jreback commented May 20, 2019

P-Tillmann commented May 20, 2019

jreback commented Jun 27, 2019

jbrockmendel commented Jul 16, 2019

P-Tillmann commented Jul 26, 2019

WillAyd commented Aug 28, 2019

WillAyd commented Sep 13, 2019

WillAyd commented Sep 20, 2019

codecov bot commented Apr 23, 2019 •

edited

Loading