REF: Refactor sparse HDF5 read / write #28456

TomAugspurger · 2019-09-16T14:34:49Z

In preparation for the removal of SparseSeries and SparseDataFrame, we
read into a Series[sparse] / DataFrame[sparse]. #28425

cc @jreback @jorisvandenbossche.

I've just done the minimal amount to get this working. It may shed some light on how we can design an API for dispatching writing for 3rd party EAs (I think we should ask the EA for the values (arrays) to write, and the bits of extra metadata. Pandas should take care of the actual writing though. But that's for another day).

In preparation for the removal of SparseSeries and SparseDataFrame, we read into a Series[sparse] / DataFrame[sparse].

jbrockmendel · 2019-09-16T16:24:51Z

pandas/io/pytables.py

@@ -3021,6 +3026,15 @@ def write_array(self, key, value, items=None):
                elif is_timedelta64_dtype(value.dtype):
                    self._handle.create_array(self.group, key, value.view("i8"))
                    getattr(self.group, key)._v_attrs.value_type = "timedelta64"
+                elif is_sparse(value):
+                    # TODO: think about EA API for this.
+                    # value._write_hdf5(self)


is value._write_hd5(self) a commented-out line or part of the TODO comment above?

Part of the TODO.

jorisvandenbossche

Thanks! Added a question to better understand the changes

jorisvandenbossche · 2019-09-16T18:28:08Z

pandas/io/pytables.py

+                sp_index = self.read_index("{}_sp_index".format(key))
+                ret = SparseArray(
+                    ret, sparse_index=sp_index, fill_value=self.attrs.fill_value
+                )


It seems the old "SparseSeriesFixed" used the "sp_values" key to write the sparse values. How does this code work with that? (or why does it not need to read this key specifically?)

Let me check (perhaps we don't have test coverage for it)

or why does it not need to read this key specifically?

IIRC, I needed this because DataFrame.write eventually calls this on a sparse ExtensionBlock. When I didn't prefix this with the key, I got name errors from reading the wrong value (I kept overwriting sp_index on each column of the dataframe).

Hmm, not tested...

# write >>> s = pd.Series([1, None, 2, 3]).to_sparse() >>> s.to_hdf("foo.h5", "ss")

Reading

In [2]: pd.read_hdf("foo.h5", "ss") --------------------------------------------------------------------------- NoSuchNodeError Traceback (most recent call last) <ipython-input-2-a3883e54184d> in <module> ----> 1 pd.read_hdf("foo.h5", "ss") ~/sandbox/pandas/pandas/io/pytables.py in read_hdf(path_or_buf, key, mode, **kwargs) 399 ) 400 key = candidate_only_group._v_pathname --> 401 return store.select(key, auto_close=auto_close, **kwargs) 402 except (ValueError, TypeError, KeyError): 403 # if there is an error, close the store ~/sandbox/pandas/pandas/io/pytables.py in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs) 780 ) 781 --> 782 return it.get_result() 783 784 def select_as_coordinates(self, key, where=None, start=None, stop=None, **kwargs): ~/sandbox/pandas/pandas/io/pytables.py in get_result(self, coordinates) 1642 1643 # directly return the result -> 1644 results = self.func(self.start, self.stop, where) 1645 self.close() 1646 return results ~/sandbox/pandas/pandas/io/pytables.py in func(_start, _stop, _where) 764 # function to call on iteration 765 def func(_start, _stop, _where): --> 766 return s.read(start=_start, stop=_stop, where=_where, columns=columns) 767 768 # create the iterator ~/sandbox/pandas/pandas/io/pytables.py in read(self, **kwargs) 3083 kwargs = self.validate_read(kwargs) 3084 index = self.read_index("index", **kwargs) -> 3085 values = self.read_array("values", **kwargs) 3086 return Series(values, index=index, name=self.name) 3087 ~/sandbox/pandas/pandas/io/pytables.py in read_array(self, key, start, stop) 2725 import tables 2726 -> 2727 node = getattr(self.group, key) 2728 attrs = node._v_attrs 2729 ~/Envs/pandas-dev/lib/python3.7/site-packages/tables/group.py in __getattr__(self, name) 837 self._g_add_children_names() 838 return self.__dict__[name] --> 839 return self._f_get_child(name) 840 841 def __setattr__(self, name, value): ~/Envs/pandas-dev/lib/python3.7/site-packages/tables/group.py in _f_get_child(self, childname) 709 self._g_check_open() 710 --> 711 self._g_check_has_child(childname) 712 713 childpath = join_path(self._v_pathname, childname) ~/Envs/pandas-dev/lib/python3.7/site-packages/tables/group.py in _g_check_has_child(self, name) 396 raise NoSuchNodeError( 397 "group ``%s`` does not have a child named ``%s``" --> 398 % (self._v_pathname, name)) 399 return node_type 400 NoSuchNodeError: group ``/ss`` does not have a child named ``values``

That's raising on values, but it's the same issue. Previously we used sp_values, now we use {key}. Will investigate, and add a failing test.

It might be worth to keep the SparseSeries/SparseFrameFixed classes that you removed (or at least the read part of it), but adjust it to create a Series[sparse] instead of SparseSeries. That seems the easiest to me to handle the legacy hdf5 files that contain such sparse data

TomAugspurger · 2019-09-16T19:37:50Z

Hmm, can someone confirm something for me on master / pandas 0.25.1?

df = pd.DataFrame({"A": [1, None, 2, 3]}).to_sparse()
df.to_hdf("foo.h5", "sdf")
pd.read_hdf("foo.h5", "sdf")

I see NotImplementedError: start and/or stop are not supported in fixed Sparse reading, so reading / writing SparseDataFrame is already broken?

jorisvandenbossche · 2019-09-16T19:40:15Z

Hmm, yes, I see the same (with another example). If it is actually not working (and also not even tested, as you noticed), then that simplifies things I think ;)

TomAugspurger · 2019-09-16T19:43:06Z

Oh, SparseSeries reading is broken too, not just SparseFrame? So things are simplified quite a bit indeed :)

I have a local fix for SparseSeries, but it's quite ugly. I think for now I'll ignore reading old sparse HDF5 files, and only support reading new files. I'll add additional tests for newly created files though.

This reverts commit 2d5c299.

jorisvandenbossche · 2019-09-16T19:48:24Z

Given that it didn't work before, I would say this is necessarily a blocker for #28425. But if this is already working mostly for the new way, fine with adding this capability.

TomAugspurger · 2019-09-16T19:52:32Z

Good point.

This solution doesn't feel 100% correct. In particularly, I don't like dynamically modifying self.attributes when we write the array.

jreback · 2019-09-16T20:06:59Z

i would likely just remove sparse entirely from HDF
it was barely supported to begin with

TomAugspurger · 2019-09-17T18:46:02Z

Yes, my preference right now is probably not to merge this PR and to followup with a proper API for dispatching array / attribute writing for all ExtensionArrays.

jorisvandenbossche · 2019-09-17T20:42:00Z

Sounds good to me

TomAugspurger · 2019-09-18T14:23:32Z

Going to close this for now. This should be covered as part of #20612 when that's picked up.

REF: Refactor sparse HDF5 read / write

9fc0d8c

In preparation for the removal of SparseSeries and SparseDataFrame, we read into a Series[sparse] / DataFrame[sparse].

TomAugspurger added IO HDF5 read_hdf, HDFStore Sparse Sparse Data Type labels Sep 16, 2019

issue no

1f885f2

TomAugspurger added this to the 1.0 milestone Sep 16, 2019

jbrockmendel reviewed Sep 16, 2019

View reviewed changes

jorisvandenbossche reviewed Sep 16, 2019

View reviewed changes

wip

2d5c299

Revert "wip"

227f47f

This reverts commit 2d5c299.

tests

62cd058

TomAugspurger closed this Sep 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF: Refactor sparse HDF5 read / write #28456

REF: Refactor sparse HDF5 read / write #28456

TomAugspurger commented Sep 16, 2019

jbrockmendel Sep 16, 2019

TomAugspurger Sep 16, 2019

jorisvandenbossche left a comment

jorisvandenbossche Sep 16, 2019

TomAugspurger Sep 16, 2019

TomAugspurger Sep 16, 2019

jorisvandenbossche Sep 16, 2019

TomAugspurger commented Sep 16, 2019

jorisvandenbossche commented Sep 16, 2019

TomAugspurger commented Sep 16, 2019

jorisvandenbossche commented Sep 16, 2019

TomAugspurger commented Sep 16, 2019

jreback commented Sep 16, 2019

TomAugspurger commented Sep 17, 2019

jorisvandenbossche commented Sep 17, 2019

TomAugspurger commented Sep 18, 2019

REF: Refactor sparse HDF5 read / write #28456

REF: Refactor sparse HDF5 read / write #28456

Conversation

TomAugspurger commented Sep 16, 2019

jbrockmendel Sep 16, 2019

Choose a reason for hiding this comment

TomAugspurger Sep 16, 2019

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Sep 16, 2019

Choose a reason for hiding this comment

TomAugspurger Sep 16, 2019

Choose a reason for hiding this comment

TomAugspurger Sep 16, 2019

Choose a reason for hiding this comment

jorisvandenbossche Sep 16, 2019

Choose a reason for hiding this comment

TomAugspurger commented Sep 16, 2019

jorisvandenbossche commented Sep 16, 2019

TomAugspurger commented Sep 16, 2019

jorisvandenbossche commented Sep 16, 2019

TomAugspurger commented Sep 16, 2019

jreback commented Sep 16, 2019

TomAugspurger commented Sep 17, 2019

jorisvandenbossche commented Sep 17, 2019

TomAugspurger commented Sep 18, 2019