-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: Refactor sparse HDF5 read / write #28456
Conversation
In preparation for the removal of SparseSeries and SparseDataFrame, we read into a Series[sparse] / DataFrame[sparse].
@@ -3021,6 +3026,15 @@ def write_array(self, key, value, items=None): | |||
elif is_timedelta64_dtype(value.dtype): | |||
self._handle.create_array(self.group, key, value.view("i8")) | |||
getattr(self.group, key)._v_attrs.value_type = "timedelta64" | |||
elif is_sparse(value): | |||
# TODO: think about EA API for this. | |||
# value._write_hdf5(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is value._write_hd5(self)
a commented-out line or part of the TODO comment above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of the TODO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Added a question to better understand the changes
sp_index = self.read_index("{}_sp_index".format(key)) | ||
ret = SparseArray( | ||
ret, sparse_index=sp_index, fill_value=self.attrs.fill_value | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the old "SparseSeriesFixed" used the "sp_values" key to write the sparse values. How does this code work with that? (or why does it not need to read this key specifically?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me check (perhaps we don't have test coverage for it)
or why does it not need to read this key specifically?
IIRC, I needed this because DataFrame.write
eventually calls this on a sparse ExtensionBlock. When I didn't prefix this with the key
, I got name errors from reading the wrong value (I kept overwriting sp_index
on each column of the dataframe).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, not tested...
# write
>>> s = pd.Series([1, None, 2, 3]).to_sparse()
>>> s.to_hdf("foo.h5", "ss")
Reading
In [2]: pd.read_hdf("foo.h5", "ss")
---------------------------------------------------------------------------
NoSuchNodeError Traceback (most recent call last)
<ipython-input-2-a3883e54184d> in <module>
----> 1 pd.read_hdf("foo.h5", "ss")
~/sandbox/pandas/pandas/io/pytables.py in read_hdf(path_or_buf, key, mode, **kwargs)
399 )
400 key = candidate_only_group._v_pathname
--> 401 return store.select(key, auto_close=auto_close, **kwargs)
402 except (ValueError, TypeError, KeyError):
403 # if there is an error, close the store
~/sandbox/pandas/pandas/io/pytables.py in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
780 )
781
--> 782 return it.get_result()
783
784 def select_as_coordinates(self, key, where=None, start=None, stop=None, **kwargs):
~/sandbox/pandas/pandas/io/pytables.py in get_result(self, coordinates)
1642
1643 # directly return the result
-> 1644 results = self.func(self.start, self.stop, where)
1645 self.close()
1646 return results
~/sandbox/pandas/pandas/io/pytables.py in func(_start, _stop, _where)
764 # function to call on iteration
765 def func(_start, _stop, _where):
--> 766 return s.read(start=_start, stop=_stop, where=_where, columns=columns)
767
768 # create the iterator
~/sandbox/pandas/pandas/io/pytables.py in read(self, **kwargs)
3083 kwargs = self.validate_read(kwargs)
3084 index = self.read_index("index", **kwargs)
-> 3085 values = self.read_array("values", **kwargs)
3086 return Series(values, index=index, name=self.name)
3087
~/sandbox/pandas/pandas/io/pytables.py in read_array(self, key, start, stop)
2725 import tables
2726
-> 2727 node = getattr(self.group, key)
2728 attrs = node._v_attrs
2729
~/Envs/pandas-dev/lib/python3.7/site-packages/tables/group.py in __getattr__(self, name)
837 self._g_add_children_names()
838 return self.__dict__[name]
--> 839 return self._f_get_child(name)
840
841 def __setattr__(self, name, value):
~/Envs/pandas-dev/lib/python3.7/site-packages/tables/group.py in _f_get_child(self, childname)
709 self._g_check_open()
710
--> 711 self._g_check_has_child(childname)
712
713 childpath = join_path(self._v_pathname, childname)
~/Envs/pandas-dev/lib/python3.7/site-packages/tables/group.py in _g_check_has_child(self, name)
396 raise NoSuchNodeError(
397 "group ``%s`` does not have a child named ``%s``"
--> 398 % (self._v_pathname, name))
399 return node_type
400
NoSuchNodeError: group ``/ss`` does not have a child named ``values``
That's raising on values
, but it's the same issue. Previously we used sp_values
, now we use {key}
. Will investigate, and add a failing test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth to keep the SparseSeries/SparseFrameFixed classes that you removed (or at least the read part of it), but adjust it to create a Series[sparse] instead of SparseSeries. That seems the easiest to me to handle the legacy hdf5 files that contain such sparse data
Hmm, can someone confirm something for me on master / pandas 0.25.1? df = pd.DataFrame({"A": [1, None, 2, 3]}).to_sparse()
df.to_hdf("foo.h5", "sdf")
pd.read_hdf("foo.h5", "sdf") I see |
Hmm, yes, I see the same (with another example). If it is actually not working (and also not even tested, as you noticed), then that simplifies things I think ;) |
Oh, SparseSeries reading is broken too, not just SparseFrame? So things are simplified quite a bit indeed :) I have a local fix for SparseSeries, but it's quite ugly. I think for now I'll ignore reading old sparse HDF5 files, and only support reading new files. I'll add additional tests for newly created files though. |
This reverts commit 2d5c299.
Given that it didn't work before, I would say this is necessarily a blocker for #28425. But if this is already working mostly for the new way, fine with adding this capability. |
Good point. This solution doesn't feel 100% correct. In particularly, I don't like dynamically modifying |
i would likely just remove sparse entirely from HDF |
Yes, my preference right now is probably not to merge this PR and to followup with a proper API for dispatching array / attribute writing for all ExtensionArrays. |
Sounds good to me |
Going to close this for now. This should be covered as part of #20612 when that's picked up. |
In preparation for the removal of SparseSeries and SparseDataFrame, we
read into a Series[sparse] / DataFrame[sparse]. #28425
cc @jreback @jorisvandenbossche.
I've just done the minimal amount to get this working. It may shed some light on how we can design an API for dispatching writing for 3rd party EAs (I think we should ask the EA for the values (arrays) to write, and the bits of extra metadata. Pandas should take care of the actual writing though. But that's for another day).