-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Allow storing ExtensionArrays in containers #19520
ENH: Allow storing ExtensionArrays in containers #19520
Conversation
pandas/core/arrays/base.py
Outdated
@@ -219,7 +217,7 @@ def _formatting_values(self): | |||
# type: () -> np.ndarray | |||
# At the moment, this has to be an array since we use result.dtype | |||
"""An array of values to be printed in, e.g. the Series repr""" | |||
raise AbstractMethodError(self) | |||
raise np.array(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested by @jorisvandenbossche in #19268, this should be an OK default implementation.
pandas/core/dtypes/missing.py
Outdated
|
||
values = getattr(obj, 'values', obj) | ||
dtype = values.dtype | ||
|
||
if is_string_dtype(dtype): | ||
if isinstance(values, ExtensionArray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be able to replace this with is_extension_array_dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will break if the extension array implements a .values
(due to two lines above). Eg if you would use that name instead of .data
. I won't find that the best name for it, so ideally an ExtensionArray author does not do that, but we maybe shouldn't assume that?
(or either document in the interface that that attribute should not be present)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing: won't this catch Categorical? (because there is still handling for categoricals some lines below, which might be removable)
@@ -2826,7 +2827,7 @@ def reindexer(value): | |||
value = maybe_cast_to_datetime(value, value.dtype) | |||
|
|||
# return internal types directly | |||
if is_extension_type(value): | |||
if is_extension_type(value) or is_extension_array_dtype(value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an unfortunate intermediate stage where some of our internal extension types (sparse, datetime w/ tz) are not yet actual extension array types. We'll be able to clean this up this eventually.
pandas/core/indexes/base.py
Outdated
@@ -1038,6 +1039,31 @@ def _to_embed(self, keep_tz=False, dtype=None): | |||
|
|||
return self.values.copy() | |||
|
|||
def _as_best_array(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a need for a method like this. It may be good to add to Series as well, and maybe make public.
Essentially, we want a clear way to get the "best" .values
. For Indexes with ExtensionArray types (Categorical
for now, Interval
, Period
soon) that's the extension array. Otherwise, it's an ndarray (or I think `DatetimeIndex for datetime w/ tz right now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this will also depend on decision if we will change return value for .values
for series with period and interval objects)
For Series we already have something like ._values
. Won't that be the equivalent there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so. Maybe I should re-use that convention.
Unfortunately, DatetimeIndex[tz]._values
returns an ndarray
, w/o the TZ, where it "should" return a DatetimeIndex
(and eventually an ExtensionArray). I'll see how badly changing that breaks things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes ths should be equiv of _values
. DTI should be fixed as a pre-cursor PR. no problem making a different method name for this, but then simply rename _values to that name (I don't personally like _as_best_array), maybe (_as_array) is better. Should again be a pre-cursor independent PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #19548
pandas/core/indexes/base.py
Outdated
[a, a, b] | ||
Categories (2, object): [a, b] | ||
|
||
>>> pd.IntervalIndex.from_breaks([0, 1, 2])._as_best_array() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, this example isn't true (yet TomAugspurger/pandas@pandas-array-upstream+fu1...TomAugspurger:pandas-array-upstream+fu1+interval)
pandas/core/internals.py
Outdated
values = self._maybe_coerce_values(values) | ||
super().__init__(values, placement, ndim) | ||
|
||
def _maybe_coerce_values(self, values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xref #19492
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some brief comment on first part of the diff (battery is running out on the train :-))
pandas/core/dtypes/missing.py
Outdated
|
||
values = getattr(obj, 'values', obj) | ||
dtype = values.dtype | ||
|
||
if is_string_dtype(dtype): | ||
if isinstance(values, ExtensionArray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will break if the extension array implements a .values
(due to two lines above). Eg if you would use that name instead of .data
. I won't find that the best name for it, so ideally an ExtensionArray author does not do that, but we maybe shouldn't assume that?
(or either document in the interface that that attribute should not be present)
pandas/core/dtypes/missing.py
Outdated
|
||
values = getattr(obj, 'values', obj) | ||
dtype = values.dtype | ||
|
||
if is_string_dtype(dtype): | ||
if isinstance(values, ExtensionArray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing: won't this catch Categorical? (because there is still handling for categoricals some lines below, which might be removable)
pandas/core/indexes/base.py
Outdated
@@ -1038,6 +1039,31 @@ def _to_embed(self, keep_tz=False, dtype=None): | |||
|
|||
return self.values.copy() | |||
|
|||
def _as_best_array(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this will also depend on decision if we will change return value for .values
for series with period and interval objects)
For Series we already have something like ._values
. Won't that be the equivalent there?
Codecov Report
@@ Coverage Diff @@
## master #19520 +/- ##
==========================================
+ Coverage 91.58% 91.63% +0.04%
==========================================
Files 150 150
Lines 48892 48931 +39
==========================================
+ Hits 44780 44840 +60
+ Misses 4112 4091 -21
Continue to review full report at Codecov.
|
Hello @TomAugspurger! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on February 22, 2018 at 16:31 Hours UTC |
pandas/core/arrays/base.py
Outdated
@@ -18,19 +20,20 @@ class ExtensionArray(object): | |||
|
|||
* __getitem__ | |||
* __len__ | |||
* __iter__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche we removed this earlier since collections implementing __len__
and __getitem__
are iterable. But (strangely?) isinstance(arr, collections.Iterable)
is False, so these aren't considered "list-like" unless they actually implement __iter__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But do we use this somewhere? (the expectation that isinstance(arr, collections.Iterable) is True)
As the docs https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable says this is not a correct way to test that something is iterable, exactly for the reason you just mentioned.
Anyway, if we add it, I would add a default implementation (return iter(self)
?) and not require subclasses to provide one manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But do we use this somewhere?
Yes, sorry. We use it in Series.__init__
, via is_list_like(data)
.
Will add the default implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, iter(self)
will cause infinite recursion, and we can't assume anything about how the data is stored. That's unfortunate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could fall back to
for i in range(len(self)):
yield self[i]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry. We use it in Series.init, via is_list_like(data).
Or we could fix is_list_like
..
Wait, iter(self) will cause infinite recursion, and we can't assume anything about how the data is stored.
Why would it cause infinite recursion?
With iter(self)
we don't assume anything about how the data is stored, I don't know how iter exactly works under the hood, but I assume it just uses __getitem__
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we could fix is_list_like
Prior to #17654 we did that, but it was fragile to things like str
(the class, not an instance) being considered iterable. Not sure if changing it back is an improvement.
Why would it cause infinite recursion?
def __iter__(self):
return iter(self)
will call self.__iter__
, which calls self.__iter__
, ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, yes of course :-)
Anyhow, the other way is a fine default implementation
Added a test-case "JSONArray" that stores mappings. |
elif isinstance(item, np.ndarray) and item.dtype == 'bool': | ||
return type(self)([x for x, m in zip(self, item) if m]) | ||
elif isinstance(item, collections.Sequence): | ||
return type(self)([self.data[i] for i in item]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment (I think) this is not in our requirements / specification (although numpy arrays do this as well). Was this needed for test to get passed? (i.e. do we use this somewhere internally, instead of using take
?)
Anyway, I would either add to specs, or remove it here to test that it is not required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I accidentally relied on this for the test
def test_take_sequence(self, data):
result = pd.Series(data[[0, 1, 3]])
Note that it's calling data.__getitem__
, not Series.__getitem__
.
I don't think pandas uses it anywhere. Will rewrite that test.
Meta-discussion, what methods should we consider "necessary" to work. Currently we have
My current TODO list includes
Am I missing anything obvious? (cc @shoyer, @chris-b1 if you have opinions on this point). |
For any kind of ops, my "minimum" is a (decent) error message for now. I think Further:
|
A good one, yes :) But for now the only correctness issue is NA-handling. For now we can (maybe) drop missing values, convert to an object array, and then do |
@@ -236,6 +245,7 @@ def _concat_same_type(cls, to_concat): | |||
""" | |||
raise AbstractMethodError(cls) | |||
|
|||
@property | |||
def _can_hold_na(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interface change: This should have been a property.
pandas/core/algorithms.py
Outdated
@@ -542,7 +543,7 @@ def value_counts(values, sort=True, ascending=False, normalize=False, | |||
|
|||
else: | |||
|
|||
if is_categorical_dtype(values) or is_sparse(values): | |||
if is_extension_array_dtype(values) or is_sparse(values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_sparse should be recognized as an extension array type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's on my todo list. I'll do it after interval.
pandas/core/arrays/base.py
Outdated
def __iter__(self): | ||
"""Iterate over elements. | ||
|
||
This needs to be implemented so that pandas recognizes extension arrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be a better doc-string here
pandas/core/arrays/base.py
Outdated
as list-like. The default implementation makes successive calls to | ||
``__getitem__``, which may be slower than necessary. | ||
""" | ||
for i in range(len(self)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you not just raising AbstractMethodError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is tied intimately with the boxing concept of scalars, but this should be decoupled from getitem (yes its a valid implementation, but you would never actually do it this way), rather you would iterate on some underlying value and box it, not index into and as a side effect use getitem to box.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather you would iterate on some underlying value
We don't know how the subclass is storing things though, so there's no .values
to iterate over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so my point is that you cannot implement a default here, make the sub-classes do it. yes it might repeat some code, but its much better there i think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you cannot
Yes, we can (see the code). You "don't want" it, that something different :)
But I don't really understand why the above is not a good default implementation. I think a lot subclasses will do exactly this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And some context: If I remove __iter__
, the object will still be iterable with essentially this implementation since it defines __getitem__
and __len__
. We just need to implement it explicitly so that is_iterable
works correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it still might be better raise here by default and force the subclass to implement this (even if this impl). It is SO important for subclasses to pay attention to this, it should just be required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't do much better than this implementation.
To give you some context: in geopandas, I have a cython method that converts a GeometryArray to numpy object array of Geometry objects, which takes ca 80ms on an array of 10000 points. Simply doing np.array([a[i] for i in range(len(a))], dtype=object)
is ca 108 ms. This small difference is because by definition in each iteration I need to create a python object.
@@ -219,7 +228,7 @@ def _formatting_values(self): | |||
# type: () -> np.ndarray | |||
# At the moment, this has to be an array since we use result.dtype | |||
"""An array of values to be printed in, e.g. the Series repr""" | |||
raise AbstractMethodError(self) | |||
return np.array(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NO, this should be implemented only by subclasses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very sensible default IMO, subclasses can always override if needed
pandas/core/arrays/base.py
Outdated
if dropna: | ||
self = self[~self.isna()] | ||
|
||
return value_counts(np.array(self)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too tied to materializing using ndarary. rather this should raise AbstractMethodError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may have a difference in philosophy here then. I'm trying to follow the lead of standard library in providing default implementations for as much as possible, even if they're sub-optimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I find a bit "strange" about this method in adding it to the required interface of an extension array, is that it returns a Series
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's strange about that? The fact that it's being added, or that it's returning a Series? I'm not sure what else it would return.
Ideally, we'll someday have an Index[extension_dtype]
, so that you get a Series with an exetension-array backed index. That's seems like the most natural return value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to wrap this on the return, IOW call EA(....) here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, not real happy with this interface, but I guess ok for now. let's revisit this at somepoint in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger I suppose you want this in here, so you can implement a more efficient method on IPArray?
Because we could also, for now, keep this implementation in pd.value_counts
(so for people who want this functionality, they can use it) instead of adding it as a method on ExtensionArray interface itself. And we can revisit when talking about unique/factorize/duplicated/.. algos support for ExtensionArrays in general?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly added it since that's what Categorical
does. I can see if removing it breaks anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure, we're OK with pd.value_counts
and Series[EA]
currently failing for 3rd party EAs? By removing that from the interface, we'll see
AttributeError: 'DecimalArray' object has no attribute 'value_counts'
I could make that error message better, but I'd prefer waiting on that till we flesh out what we want pd.value_counts(EA)
to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in principle you can easily add this fallback to do a value_counts on the array of objects there?
pandas/core/indexes/base.py
Outdated
@@ -1038,6 +1039,31 @@ def _to_embed(self, keep_tz=False, dtype=None): | |||
|
|||
return self.values.copy() | |||
|
|||
def _as_best_array(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes ths should be equiv of _values
. DTI should be fixed as a pre-cursor PR. no problem making a different method name for this, but then simply rename _values to that name (I don't personally like _as_best_array), maybe (_as_array) is better. Should again be a pre-cursor independent PR.
pandas/core/internals.py
Outdated
|
||
def _maybe_coerce_values(self, values): | ||
# Unboxes Series / Index | ||
# Doesn't change any underlying dtypes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this not just _as_best_array()
(or new name), this kind of if logic is polluting (yes its all over the place, a major goal of this ExtensionArray is to clean this up, I would argue this is THE goal).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make a real doc-string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, since there's no parent.
We only use I'm not sure what the history is there. Looking more closely, there's another difference between our In [13]: DecimalDtype().is_dtype(DecimalArray([]))
Out[13]: False
In [14]: pd.Categorical([]).dtype.is_dtype(pd.Categorical([]))
Out[14]: True
In [15]: pd.Categorical([]).dtype.is_dtype(pd.Series(pd.Categorical([])))
Out[15]: True So the differences are that
I can fix the |
Moved from PandasExtensionDtype to ExtensionDtype with one modification: catch TypeError explicitly.
Latest commit refactored |
How does that new version check for class objects? |
Ugh, it doesn't. I meant to remove that docstring. We weren't actually
testing for that AFAICT, so I'm not sure
why I added it to begin with.
…On Wed, Feb 21, 2018 at 9:19 AM, Joris Van den Bossche < ***@***.***> wrote:
How does that new version check for class objects?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19520 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIsGj7qvPEdlxKFhY0IG04xJNKJJ7ks5tXDP9gaJpZM4R30Wl>
.
|
CI passes now with this setup.
@jreback to summarize the changes since you lasted reviewed
- ExtensionArray.value_counts was removed (which IIRC you weren't a fan of
the implementation anyway)
- ExtensionDtype.__eq__ was defined to support Series.apply and
astype(object)
- Made ExtensionDtype.is_dtype match how our `is_dtype` is supposed to work
1. Unboxes dtype from things with a `.dtype` attribute
2. Doesn't return True for *classes*, only instances.
On Wed, Feb 21, 2018 at 9:24 AM, Tom Augspurger <tom.augspurger88@gmail.com>
wrote:
… Ugh, it doesn't. I meant to remove that docstring. We weren't actually
testing for that AFAICT, so I'm not sure
why I added it to begin with.
On Wed, Feb 21, 2018 at 9:19 AM, Joris Van den Bossche <
***@***.***> wrote:
> How does that new version check for class objects?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#19520 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ABQHIsGj7qvPEdlxKFhY0IG04xJNKJJ7ks5tXDP9gaJpZM4R30Wl>
> .
>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on the code just a few comments and some unanwered queries that I had before. The tests needs reorg as I have suggested.
pandas/core/algorithms.py
Outdated
Parameters | ||
---------- | ||
arr : ndarray | ||
Input array | ||
arr : ndarray, ExtensionArray, DatetimeIndex, IntervalIndex, SparseArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
pandas/core/algorithms.py
Outdated
|
||
Returns | ||
------- | ||
subarray : object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
array
|
||
.. code-block:: python | ||
|
||
def take(self, indexer, allow_fill=True, fill_value=None): | ||
mask = indexer == -1 | ||
result = self.data.take(indexer) | ||
result[mask] = self._fill_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, is there a reason we don't define _na_value
? (e.g we do on indexes) as a property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally there was a self._fill_value
, but because this was not used internally, we left it out for the moment. The discussion about that is a bit spread, but see eg #19520 (comment) (4th bullet point is about fill_value, and comment below with Tom's answer) and #19520 (comment) and the comments below that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok that's fine. I think we do need / want a way to set this and a property is good, and _na_value
is used elsewhere (eg. in Index)
pandas/core/frame.py
Outdated
value = value.copy() | ||
|
||
elif isinstance(value, Index) or is_sequence(value): | ||
from pandas.core.series import _sanitize_index | ||
|
||
# turn me into an ndarray | ||
value = _sanitize_index(value, self.index, copy=False) | ||
if not isinstance(value, (np.ndarray, Index)): | ||
if not isinstance(value, (np.ndarray, Index, ExtensionArray)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
values = self._maybe_coerce_values(values) | ||
super(ExtensionBlock, self).__init__(values, placement, ndim) | ||
|
||
def _maybe_coerce_values(self, values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this called anywhere else? (e.g. IOW I think can eliminate some code if maybe this was defined in the superclass)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is only called in __init__
, and was made into a separate method on your question. It's not defined in a superclass, so cannot be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually Block.init will call self._maybe_coerce_values
and we can remove every subclasses's init. But there's some other work to be done first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
pandas/core/series.py
Outdated
if dtype is not None: | ||
data = data.astype(dtype) | ||
|
||
# need to copy to avoid aliasing issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I think you can move it (IOW if dtype is None), astype will copy (so if dtype is specified this would copy twice)
"Categorical unless " | ||
"dtype='category'") | ||
if not data.dtype.is_dtype(dtype): | ||
raise ValueError("Cannot specify a dtype '{}' with an " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is much nicer
pandas/core/series.py
Outdated
@@ -2558,8 +2571,12 @@ def _reindex_indexer(self, new_index, indexer, copy): | |||
return self.copy() | |||
return self | |||
|
|||
# be subclass-friendly | |||
new_values = algorithms.take_1d(self.get_values(), indexer) | |||
if is_sparse(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no my point is it should be handled there. I would simply move it there now (where you can do the is_sparse check). The point is the .get_values()
should be there not in this part of the code.
pandas/tests/extension/base.py
Outdated
from pandas.core.dtypes.dtypes import ExtensionDtype | ||
|
||
|
||
class BaseDtypeTests(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do this for indexes now. this is WAY too cumbersome to have tests like this all in one file. We have spent an enormous amount of time splitting them out. Please just split them out. You can have classes if you want. What I am after is short conscise test files that are completely obvious.
pandas/tests/extension/decimal/test_indexing.py
pandas/tests/extension/decimal/test_construction.py
etc.
pandas/tests/extension/test_json.py
Outdated
else: | ||
if not isinstance(value, (type(self), | ||
collections.Sequence)): | ||
# broadcast value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a crazy test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That isn't a test, it's the implementation.
@@ -5,12 +5,12 @@ | |||
|
|||
import pandas as pd | |||
from pandas.core.internals import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to the extension modules
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some small comments. tests look better. I guess since the actual testnig code in an extension array subclass is small (as it calles the base class) its ok in 1 file for now, and can always be expanded.
|
||
.. code-block:: python | ||
|
||
def take(self, indexer, allow_fill=True, fill_value=None): | ||
mask = indexer == -1 | ||
result = self.data.take(indexer) | ||
result[mask] = self._fill_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok that's fine. I think we do need / want a way to set this and a property is good, and _na_value
is used elsewhere (eg. in Index)
pandas/core/dtypes/missing.py
Outdated
if is_string_like_dtype(dtype): | ||
result = np.zeros(values.shape, dtype=bool) | ||
else: | ||
result = np.empty(shape, dtype=bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment on what the expected is in the else, e.g. an object array of non-strings
values = self._maybe_coerce_values(values) | ||
super(ExtensionBlock, self).__init__(values, placement, ndim) | ||
|
||
def _maybe_coerce_values(self, values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
||
import numpy as np | ||
|
||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason you don't name this array.py?
comments for object vs. str missing values
looks fine. merge on green. maybe liberally use TODO(EA) where anything is in question so can come back later. |
All green. Thanks all. |
Woohoo! It seems you already have branches for all follow-ups? :) (#19696) Anything we can still help with, or mainly reviewing upcoming PRs? |
I'll post there what I have.
…On Fri, Feb 23, 2018 at 6:13 AM, Joris Van den Bossche < ***@***.***> wrote:
Woohoo!
It seems you already have branches for all follow-ups? :) (#19696
<#19696>) Anything we can
still help with, or mainly reviewing upcoming PRs?
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#19520 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIt2ATl5rVagWiB6TwL6K56z4OILTks5tXqtmgaJpZM4R30Wl>
.
|
* ENH: non-interval changes * COMPAT: py2 Super * BUG: Use original object for extension array * Consistent boxing / unboxing NumPy compat * 32-bit compat * Add a test array * linting * Default __iter__ * Tests for value_counts * Implement value_counts * Py2 compat * Fixed dropna * Test fixups * Started setitem * REF/Clean: Internal / External values * Move to index base * Setitem tests, decimal example * Compat * Fixed extension block tests. The only "API change" was that you can't just inherit from NonConsolidatableMixin, which is OK since 1. it's a mixin 2. geopandas also inherits from Block * Clarify binop tests Make it clearer which bit might raise * TST: Removed ops tests * Cleanup unique handling * Simplify object concat * Use values for intersection I think eventually we'll want to ndarray_values for this, but it'll require a bit more work to support. Currently, using ndarary_values causes occasional failures on categorical. * hmm * More failing tests * remove bad test * better setitem * Dropna works. * Restore xfail test * Test Categorical * Xfail setitem tests * TST: Skip JSON tests on py2 * Additional testing * More tests * ndarray_values * API: Default ExtensionArray.astype (cherry picked from commit 943a915562b72bed147c857de927afa0daf31c1a) (cherry picked from commit fbf0a06) * Simplify concat_as_object * Py2 compat (cherry picked from commit b20e12c) * Set-ops ugliness * better docstrings * tolist * linting * Moved dtypes (cherry picked from commit d136227) * clean * cleanup * NumPy compat * Use base _values for CategoricalIndex * Update dev docs * cleanup * cleanup (cherry picked from commit 2425621) * cleanup * Linting * Precision in tests * Linting * Move to extension * Push _ndarray_values to ExtensionArray Now IndexOpsMixin._ndarray_values will dispatch all the way down to the EA. Subclasses like Categorical can override it as they see fit. * Clean up tolist * Move test locations * Fixed test * REF: Update per comments * lint * REF: Use _values for size and shape * PERF: Implement size, shape for IntervalIndex * PERF: Avoid materializing values for PeriodIndex shape, size * Cleanup * Override nbytes * Remove unused change * Docs * Test cleanpu * Always set PANDAS_TESTING_MODE * Revert "Always set PANDAS_TESTING_MODE" This reverts commit a312ba5. * Explicitly catch warnings or not * fastparquet warnings * Unicode literals strikes again. Only catch fp warning for newer numpy * Restore circle env var * More parquet test catching * No stacklevel * Lower bound on FP * Exact bound for FP * Don't use fastpath for ExtensionBlock make_block * Consistently use _values * TST: Additional constructor tests * CLN: de-nested a bit * _fill_value handling * Handle user provided dtype in constructors. When the dtype matches, we allow it to proceed. When the dtype would require coercion, we raise. * Document ExtensionBlock._maybe_coerce_values Also changes to use _values as we should * Created ABCExtensionArray * TST: Tests for is_object_dtype and is_string_dtype and EAs * fixup! Handle user provided dtype in constructors. * Doc for setitem * Split base tests * Revert test_parquet changes * API: Removed _fill_value from the interface * Push coercion to extension dtype till later * Linting * ERR: Better error message for coercion to 3rd party dtypes * CLN: Make take_nd EA aware * Revert sparse changes * Other _typ for ABCExtensionArray * Test cleanup and expansion. Tests for concating and aligning frames * Copy if copy * TST: remove self param for fixture * Remove unnescessary EA handling in Series ctor * API: Removed value_counts Moved setitem notes to comment * More doc notes * Handle expanding a DataFrame with an EA * Added ExtensionDtype.__eq__ Support for astype * linting * REF: is_dtype_equal refactor Moved from PandasExtensionDtype to ExtensionDtype with one modification: catch TypeError explicitly. * Remove reference to dtype being a class * move * Moved sparse check to take_nd * Docstring * Split tests * Revert index change * Copy changes * Simplify EA implementation names comments for object vs. str missing values * Linting
Followup number 1 to #19268
This adds changes to frame, series, dtypes, etc. (all the places other that
core/internals.py
) to support 3rd party extension arrays.These changes were developed to support IntervalArrays (TomAugspurger/pandas@pandas-array-upstream+fu1...TomAugspurger:pandas-array-upstream+fu1+interval separate PR) and https://github.com/continuumio/pandas-ip (external).
Also closes #19585