Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: Refactor sparse HDF5 read / write #28456

Closed
wants to merge 5 commits into from

Conversation

TomAugspurger
Copy link
Contributor

In preparation for the removal of SparseSeries and SparseDataFrame, we
read into a Series[sparse] / DataFrame[sparse]. #28425

cc @jreback @jorisvandenbossche.

I've just done the minimal amount to get this working. It may shed some light on how we can design an API for dispatching writing for 3rd party EAs (I think we should ask the EA for the values (arrays) to write, and the bits of extra metadata. Pandas should take care of the actual writing though. But that's for another day).

In preparation for the removal of SparseSeries and SparseDataFrame, we
read into a Series[sparse] / DataFrame[sparse].
@TomAugspurger TomAugspurger added IO HDF5 read_hdf, HDFStore Sparse Sparse Data Type labels Sep 16, 2019
@TomAugspurger TomAugspurger added this to the 1.0 milestone Sep 16, 2019
@@ -3021,6 +3026,15 @@ def write_array(self, key, value, items=None):
elif is_timedelta64_dtype(value.dtype):
self._handle.create_array(self.group, key, value.view("i8"))
getattr(self.group, key)._v_attrs.value_type = "timedelta64"
elif is_sparse(value):
# TODO: think about EA API for this.
# value._write_hdf5(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is value._write_hd5(self) a commented-out line or part of the TODO comment above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the TODO.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Added a question to better understand the changes

sp_index = self.read_index("{}_sp_index".format(key))
ret = SparseArray(
ret, sparse_index=sp_index, fill_value=self.attrs.fill_value
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the old "SparseSeriesFixed" used the "sp_values" key to write the sparse values. How does this code work with that? (or why does it not need to read this key specifically?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check (perhaps we don't have test coverage for it)

or why does it not need to read this key specifically?

IIRC, I needed this because DataFrame.write eventually calls this on a sparse ExtensionBlock. When I didn't prefix this with the key, I got name errors from reading the wrong value (I kept overwriting sp_index on each column of the dataframe).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, not tested...

# write
>>> s = pd.Series([1, None, 2, 3]).to_sparse()
>>> s.to_hdf("foo.h5", "ss")

Reading

In [2]: pd.read_hdf("foo.h5", "ss")
---------------------------------------------------------------------------
NoSuchNodeError                           Traceback (most recent call last)
<ipython-input-2-a3883e54184d> in <module>
----> 1 pd.read_hdf("foo.h5", "ss")

~/sandbox/pandas/pandas/io/pytables.py in read_hdf(path_or_buf, key, mode, **kwargs)
    399                     )
    400             key = candidate_only_group._v_pathname
--> 401         return store.select(key, auto_close=auto_close, **kwargs)
    402     except (ValueError, TypeError, KeyError):
    403         # if there is an error, close the store

~/sandbox/pandas/pandas/io/pytables.py in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
    780         )
    781
--> 782         return it.get_result()
    783
    784     def select_as_coordinates(self, key, where=None, start=None, stop=None, **kwargs):

~/sandbox/pandas/pandas/io/pytables.py in get_result(self, coordinates)
   1642
   1643         # directly return the result
-> 1644         results = self.func(self.start, self.stop, where)
   1645         self.close()
   1646         return results

~/sandbox/pandas/pandas/io/pytables.py in func(_start, _stop, _where)
    764         # function to call on iteration
    765         def func(_start, _stop, _where):
--> 766             return s.read(start=_start, stop=_stop, where=_where, columns=columns)
    767
    768         # create the iterator

~/sandbox/pandas/pandas/io/pytables.py in read(self, **kwargs)
   3083         kwargs = self.validate_read(kwargs)
   3084         index = self.read_index("index", **kwargs)
-> 3085         values = self.read_array("values", **kwargs)
   3086         return Series(values, index=index, name=self.name)
   3087

~/sandbox/pandas/pandas/io/pytables.py in read_array(self, key, start, stop)
   2725         import tables
   2726
-> 2727         node = getattr(self.group, key)
   2728         attrs = node._v_attrs
   2729

~/Envs/pandas-dev/lib/python3.7/site-packages/tables/group.py in __getattr__(self, name)
    837             self._g_add_children_names()
    838             return self.__dict__[name]
--> 839         return self._f_get_child(name)
    840
    841     def __setattr__(self, name, value):

~/Envs/pandas-dev/lib/python3.7/site-packages/tables/group.py in _f_get_child(self, childname)
    709         self._g_check_open()
    710
--> 711         self._g_check_has_child(childname)
    712
    713         childpath = join_path(self._v_pathname, childname)

~/Envs/pandas-dev/lib/python3.7/site-packages/tables/group.py in _g_check_has_child(self, name)
    396             raise NoSuchNodeError(
    397                 "group ``%s`` does not have a child named ``%s``"
--> 398                 % (self._v_pathname, name))
    399         return node_type
    400

NoSuchNodeError: group ``/ss`` does not have a child named ``values``

That's raising on values, but it's the same issue. Previously we used sp_values, now we use {key}. Will investigate, and add a failing test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth to keep the SparseSeries/SparseFrameFixed classes that you removed (or at least the read part of it), but adjust it to create a Series[sparse] instead of SparseSeries. That seems the easiest to me to handle the legacy hdf5 files that contain such sparse data

@TomAugspurger
Copy link
Contributor Author

Hmm, can someone confirm something for me on master / pandas 0.25.1?

df = pd.DataFrame({"A": [1, None, 2, 3]}).to_sparse()
df.to_hdf("foo.h5", "sdf")
pd.read_hdf("foo.h5", "sdf")

I see NotImplementedError: start and/or stop are not supported in fixed Sparse reading, so reading / writing SparseDataFrame is already broken?

@jorisvandenbossche
Copy link
Member

Hmm, yes, I see the same (with another example). If it is actually not working (and also not even tested, as you noticed), then that simplifies things I think ;)

@TomAugspurger
Copy link
Contributor Author

Oh, SparseSeries reading is broken too, not just SparseFrame? So things are simplified quite a bit indeed :)

I have a local fix for SparseSeries, but it's quite ugly. I think for now I'll ignore reading old sparse HDF5 files, and only support reading new files. I'll add additional tests for newly created files though.

This reverts commit 2d5c299.
@jorisvandenbossche
Copy link
Member

Given that it didn't work before, I would say this is necessarily a blocker for #28425. But if this is already working mostly for the new way, fine with adding this capability.

@TomAugspurger
Copy link
Contributor Author

Good point.

This solution doesn't feel 100% correct. In particularly, I don't like dynamically modifying self.attributes when we write the array.

@jreback
Copy link
Contributor

jreback commented Sep 16, 2019

i would likely just remove sparse entirely from HDF
it was barely supported to begin with

@TomAugspurger
Copy link
Contributor Author

Yes, my preference right now is probably not to merge this PR and to followup with a proper API for dispatching array / attribute writing for all ExtensionArrays.

@jorisvandenbossche
Copy link
Member

Sounds good to me

@TomAugspurger
Copy link
Contributor Author

Going to close this for now. This should be covered as part of #20612 when that's picked up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO HDF5 read_hdf, HDFStore Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants