ENH: Allow storing ExtensionArrays in containers #19520

TomAugspurger · 2018-02-02T21:43:26Z

Followup number 1 to #19268

This adds changes to frame, series, dtypes, etc. (all the places other that core/internals.py) to support 3rd party extension arrays.

These changes were developed to support IntervalArrays (TomAugspurger/pandas@pandas-array-upstream+fu1...TomAugspurger:pandas-array-upstream+fu1+interval separate PR) and https://github.com/continuumio/pandas-ip (external).

Also closes #19585

TomAugspurger · 2018-02-02T21:44:52Z

pandas/core/arrays/base.py

@@ -219,7 +217,7 @@ def _formatting_values(self):
        # type: () -> np.ndarray
        # At the moment, this has to be an array since we use result.dtype
        """An array of values to be printed in, e.g. the Series repr"""
-        raise AbstractMethodError(self)
+        raise np.array(self)


Suggested by @jorisvandenbossche in #19268, this should be an OK default implementation.

TomAugspurger · 2018-02-02T21:45:38Z

pandas/core/dtypes/missing.py


    values = getattr(obj, 'values', obj)
    dtype = values.dtype

-    if is_string_dtype(dtype):
+    if isinstance(values, ExtensionArray):


Should be able to replace this with is_extension_array_dtype

This will break if the extension array implements a .values (due to two lines above). Eg if you would use that name instead of .data. I won't find that the best name for it, so ideally an ExtensionArray author does not do that, but we maybe shouldn't assume that?
(or either document in the interface that that attribute should not be present)

Another thing: won't this catch Categorical? (because there is still handling for categoricals some lines below, which might be removable)

TomAugspurger · 2018-02-02T21:46:41Z

pandas/core/frame.py

@@ -2826,7 +2827,7 @@ def reindexer(value):
            value = maybe_cast_to_datetime(value, value.dtype)

        # return internal types directly
-        if is_extension_type(value):
+        if is_extension_type(value) or is_extension_array_dtype(value):


This is an unfortunate intermediate stage where some of our internal extension types (sparse, datetime w/ tz) are not yet actual extension array types. We'll be able to clean this up this eventually.

TomAugspurger · 2018-02-02T21:49:13Z

pandas/core/indexes/base.py

@@ -1038,6 +1039,31 @@ def _to_embed(self, keep_tz=False, dtype=None):

        return self.values.copy()

+    def _as_best_array(self):


I found a need for a method like this. It may be good to add to Series as well, and maybe make public.

Essentially, we want a clear way to get the "best" .values. For Indexes with ExtensionArray types (Categorical for now, Interval, Period soon) that's the extension array. Otherwise, it's an ndarray (or I think `DatetimeIndex for datetime w/ tz right now).

(this will also depend on decision if we will change return value for .values for series with period and interval objects)
For Series we already have something like ._values. Won't that be the equivalent there?

Yes, I think so. Maybe I should re-use that convention.

Unfortunately, DatetimeIndex[tz]._values returns an ndarray, w/o the TZ, where it "should" return a DatetimeIndex (and eventually an ExtensionArray). I'll see how badly changing that breaks things.

yes ths should be equiv of _values. DTI should be fixed as a pre-cursor PR. no problem making a different method name for this, but then simply rename _values to that name (I don't personally like _as_best_array), maybe (_as_array) is better. Should again be a pre-cursor independent PR.

Opened #19548

TomAugspurger · 2018-02-02T21:49:39Z

pandas/core/indexes/base.py

+        [a, a, b]
+        Categories (2, object): [a, b]
+
+        >>> pd.IntervalIndex.from_breaks([0, 1, 2])._as_best_array()


Whoops, this example isn't true (yet TomAugspurger/pandas@pandas-array-upstream+fu1...TomAugspurger:pandas-array-upstream+fu1+interval)

TomAugspurger · 2018-02-02T21:50:56Z

pandas/core/internals.py

+        values = self._maybe_coerce_values(values)
+        super().__init__(values, placement, ndim)
+
+    def _maybe_coerce_values(self, values):


xref #19492

jorisvandenbossche

some brief comment on first part of the diff (battery is running out on the train :-))

jorisvandenbossche · 2018-02-02T22:16:15Z

pandas/core/dtypes/missing.py


    values = getattr(obj, 'values', obj)
    dtype = values.dtype

-    if is_string_dtype(dtype):
+    if isinstance(values, ExtensionArray):


This will break if the extension array implements a .values (due to two lines above). Eg if you would use that name instead of .data. I won't find that the best name for it, so ideally an ExtensionArray author does not do that, but we maybe shouldn't assume that?
(or either document in the interface that that attribute should not be present)

jorisvandenbossche · 2018-02-02T22:17:22Z

pandas/core/dtypes/missing.py


    values = getattr(obj, 'values', obj)
    dtype = values.dtype

-    if is_string_dtype(dtype):
+    if isinstance(values, ExtensionArray):


Another thing: won't this catch Categorical? (because there is still handling for categoricals some lines below, which might be removable)

jorisvandenbossche · 2018-02-02T22:22:04Z

pandas/core/indexes/base.py

@@ -1038,6 +1039,31 @@ def _to_embed(self, keep_tz=False, dtype=None):

        return self.values.copy()

+    def _as_best_array(self):


(this will also depend on decision if we will change return value for .values for series with period and interval objects)
For Series we already have something like ._values. Won't that be the equivalent there?

codecov · 2018-02-03T13:40:21Z

Codecov Report

Merging #19520 into master will increase coverage by 0.04%.
The diff coverage is 94.79%.

@@            Coverage Diff             @@
##           master   #19520      +/-   ##
==========================================
+ Coverage   91.58%   91.63%   +0.04%     
==========================================
  Files         150      150              
  Lines       48892    48931      +39     
==========================================
+ Hits        44780    44840      +60     
+ Misses       4112     4091      -21

Flag	Coverage Δ
#multiple	`90.01% <94.79%> (+0.05%)`	⬆️
#single	`41.77% <61.45%> (-0.02%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/dtypes/common.py	`94.16% <ø> (-1.22%)`	⬇️
pandas/core/dtypes/dtypes.py	`95.9% <ø> (-0.19%)`	⬇️
pandas/core/indexing.py	`93.11% <ø> (ø)`	⬆️
pandas/core/indexes/base.py	`96.36% <100%> (-0.09%)`	⬇️
pandas/core/dtypes/base.py	`91.66% <100%> (+44.04%)`	⬆️
pandas/core/internals.py	`95.53% <100%> (+0.47%)`	⬆️
pandas/core/dtypes/generic.py	`100% <100%> (ø)`	⬆️
pandas/core/algorithms.py	`94.18% <100%> (+0.01%)`	⬆️
pandas/core/frame.py	`97.23% <100%> (ø)`	⬆️
pandas/core/arrays/base.py	`75.67% <100%> (+15.67%)`	⬆️
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cd1b168...ea5562b. Read the comment docs.

NumPy compat

pep8speaks · 2018-02-05T12:56:51Z

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on February 22, 2018 at 16:31 Hours UTC

TomAugspurger · 2018-02-05T12:58:18Z

pandas/core/arrays/base.py

@@ -18,19 +20,20 @@ class ExtensionArray(object):

    * __getitem__
    * __len__
+    * __iter__


@jorisvandenbossche we removed this earlier since collections implementing __len__ and __getitem__ are iterable. But (strangely?) isinstance(arr, collections.Iterable) is False, so these aren't considered "list-like" unless they actually implement __iter__.

But do we use this somewhere? (the expectation that isinstance(arr, collections.Iterable) is True)

As the docs https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable says this is not a correct way to test that something is iterable, exactly for the reason you just mentioned.

Anyway, if we add it, I would add a default implementation (return iter(self) ?) and not require subclasses to provide one manually.

But do we use this somewhere?

Yes, sorry. We use it in Series.__init__, via is_list_like(data).

Will add the default implementation.

Wait, iter(self) will cause infinite recursion, and we can't assume anything about how the data is stored. That's unfortunate.

I guess we could fall back to

for i in range(len(self)): yield self[i]

Yes, sorry. We use it in Series.init, via is_list_like(data).

Or we could fix is_list_like ..

Wait, iter(self) will cause infinite recursion, and we can't assume anything about how the data is stored.

Why would it cause infinite recursion?
With iter(self) we don't assume anything about how the data is stored, I don't know how iter exactly works under the hood, but I assume it just uses __getitem__ ?

Or we could fix is_list_like

Prior to #17654 we did that, but it was fragile to things like str (the class, not an instance) being considered iterable. Not sure if changing it back is an improvement.

Why would it cause infinite recursion?

def __iter__(self): return iter(self)

will call self.__iter__, which calls self.__iter__, ...

ah, yes of course :-)
Anyhow, the other way is a fine default implementation

TomAugspurger · 2018-02-05T13:02:33Z

Added a test-case "JSONArray" that stores mappings.

jorisvandenbossche · 2018-02-05T13:44:58Z

pandas/tests/extension_arrays/test_json.py

+        elif isinstance(item, np.ndarray) and item.dtype == 'bool':
+            return type(self)([x for x, m in zip(self, item) if m])
+        elif isinstance(item, collections.Sequence):
+            return type(self)([self.data[i] for i in item])


At the moment (I think) this is not in our requirements / specification (although numpy arrays do this as well). Was this needed for test to get passed? (i.e. do we use this somewhere internally, instead of using take ?)

Anyway, I would either add to specs, or remove it here to test that it is not required.

Sorry, I accidentally relied on this for the test

def test_take_sequence(self, data): result = pd.Series(data[[0, 1, 3]])

Note that it's calling data.__getitem__, not Series.__getitem__.

I don't think pandas uses it anywhere. Will rewrite that test.

TomAugspurger · 2018-02-05T14:11:07Z

Meta-discussion, what methods should we consider "necessary" to work. Currently we have

constructors
slicing (__getitem__, loc, iloc)
reindex
isna
concat
memory_usage

My current TODO list includes

dropna (will do here)
fillna
value_counts
groupby (not too hard, but separate PR I think). This include algos like factorize and unique.
reductions (separate PR)
ops (separate PR)
cumulative ops (separate PR)

Am I missing anything obvious? (cc @shoyer, @chris-b1 if you have opinions on this point).

jorisvandenbossche · 2018-02-05T14:20:10Z

For any kind of ops, my "minimum" is a (decent) error message for now.
Although with my block-based implementation in geopandas, I had working comparison ops (at least equality, as greater/smaller than doesn't make sense for geometries), and only errors for arithmetic or reducing ops.

I think value_counts will also need some kind of factorize/unique implementation?

Further:

astype: for now a good error message, but do we want a way to enable this in the future? (an astype on ExtensionArray ?)

TomAugspurger · 2018-02-05T21:38:14Z

I think value_counts will also need some kind of factorize/unique implementation?

A good one, yes :) But for now the only correctness issue is NA-handling. For now we can (maybe) drop missing values, convert to an object array, and then do value_counts.

TomAugspurger · 2018-02-05T21:39:04Z

pandas/core/arrays/base.py

@@ -236,6 +245,7 @@ def _concat_same_type(cls, to_concat):
        """
        raise AbstractMethodError(cls)

+    @property
    def _can_hold_na(self):


Interface change: This should have been a property.

jreback · 2018-02-06T11:02:47Z

pandas/core/algorithms.py

@@ -542,7 +543,7 @@ def value_counts(values, sort=True, ascending=False, normalize=False,

    else:

-        if is_categorical_dtype(values) or is_sparse(values):
+        if is_extension_array_dtype(values) or is_sparse(values):


is_sparse should be recognized as an extension array type

It's on my todo list. I'll do it after interval.

jreback · 2018-02-06T11:03:12Z

pandas/core/arrays/base.py

+    def __iter__(self):
+        """Iterate over elements.
+
+        This needs to be implemented so that pandas recognizes extension arrays


can be a better doc-string here

jreback · 2018-02-06T11:03:44Z

pandas/core/arrays/base.py

+        as list-like. The default implementation makes successive calls to
+        ``__getitem__``, which may be slower than necessary.
+        """
+        for i in range(len(self)):


why are you not just raising AbstractMethodError?

this is tied intimately with the boxing concept of scalars, but this should be decoupled from getitem (yes its a valid implementation, but you would never actually do it this way), rather you would iterate on some underlying value and box it, not index into and as a side effect use getitem to box.

rather you would iterate on some underlying value

We don't know how the subclass is storing things though, so there's no .values to iterate over.

so my point is that you cannot implement a default here, make the sub-classes do it. yes it might repeat some code, but its much better there i think.

you cannot

Yes, we can (see the code). You "don't want" it, that something different :)
But I don't really understand why the above is not a good default implementation. I think a lot subclasses will do exactly this.

And some context: If I remove __iter__, the object will still be iterable with essentially this implementation since it defines __getitem__ and __len__. We just need to implement it explicitly so that is_iterable works correctly.

I think it still might be better raise here by default and force the subclass to implement this (even if this impl). It is SO important for subclasses to pay attention to this, it should just be required.

You can't do much better than this implementation.
To give you some context: in geopandas, I have a cython method that converts a GeometryArray to numpy object array of Geometry objects, which takes ca 80ms on an array of 10000 points. Simply doing np.array([a[i] for i in range(len(a))], dtype=object) is ca 108 ms. This small difference is because by definition in each iteration I need to create a python object.

jreback · 2018-02-06T11:06:19Z

pandas/core/arrays/base.py

@@ -219,7 +228,7 @@ def _formatting_values(self):
        # type: () -> np.ndarray
        # At the moment, this has to be an array since we use result.dtype
        """An array of values to be printed in, e.g. the Series repr"""
-        raise AbstractMethodError(self)
+        return np.array(self)


NO, this should be implemented only by subclasses.

This is a very sensible default IMO, subclasses can always override if needed

jreback · 2018-02-06T11:07:03Z

pandas/core/arrays/base.py

+        if dropna:
+            self = self[~self.isna()]
+
+        return value_counts(np.array(self))


This is too tied to materializing using ndarary. rather this should raise AbstractMethodError

We may have a difference in philosophy here then. I'm trying to follow the lead of standard library in providing default implementations for as much as possible, even if they're sub-optimal.

What I find a bit "strange" about this method in adding it to the required interface of an extension array, is that it returns a Series

What's strange about that? The fact that it's being added, or that it's returning a Series? I'm not sure what else it would return.

Ideally, we'll someday have an Index[extension_dtype], so that you get a Series with an exetension-array backed index. That's seems like the most natural return value.

you need to wrap this on the return, IOW call EA(....) here

hmm, not real happy with this interface, but I guess ok for now. let's revisit this at somepoint in the future.

@TomAugspurger I suppose you want this in here, so you can implement a more efficient method on IPArray?
Because we could also, for now, keep this implementation in pd.value_counts (so for people who want this functionality, they can use it) instead of adding it as a method on ExtensionArray interface itself. And we can revisit when talking about unique/factorize/duplicated/.. algos support for ExtensionArrays in general?

I mostly added it since that's what Categorical does. I can see if removing it breaks anything.

Just to make sure, we're OK with pd.value_counts and Series[EA] currently failing for 3rd party EAs? By removing that from the interface, we'll see

AttributeError: 'DecimalArray' object has no attribute 'value_counts'

I could make that error message better, but I'd prefer waiting on that till we flesh out what we want pd.value_counts(EA) to do.

But in principle you can easily add this fallback to do a value_counts on the array of objects there?

jreback · 2018-02-06T11:10:48Z

pandas/core/indexes/base.py

@@ -1038,6 +1039,31 @@ def _to_embed(self, keep_tz=False, dtype=None):

        return self.values.copy()

+    def _as_best_array(self):


yes ths should be equiv of _values. DTI should be fixed as a pre-cursor PR. no problem making a different method name for this, but then simply rename _values to that name (I don't personally like _as_best_array), maybe (_as_array) is better. Should again be a pre-cursor independent PR.

jreback · 2018-02-06T11:12:26Z

pandas/core/internals.py

+
+    def _maybe_coerce_values(self, values):
+        # Unboxes Series / Index
+        # Doesn't change any underlying dtypes.


why is this not just _as_best_array() (or new name), this kind of if logic is polluting (yes its all over the place, a major goal of this ExtensionArray is to clean this up, I would argue this is THE goal).

can you make a real doc-string

Good point, since there's no parent.

TomAugspurger · 2018-02-21T13:20:42Z

We only use is_dtype in a handful of places currently (59 times, 46 of which are tests).

I'm not sure what the history is there.

Looking more closely, there's another difference between our is_dtype like Categorical and the default ExtensionArray: ours unbox things with a .dtype attribute.

In [13]: DecimalDtype().is_dtype(DecimalArray([]))
Out[13]: False

In [14]: pd.Categorical([]).dtype.is_dtype(pd.Categorical([]))
Out[14]: True

In [15]: pd.Categorical([]).dtype.is_dtype(pd.Series(pd.Categorical([])))
Out[15]: True

So the differences are that

is_dtype (should) unbox things with a .dtype attribute while __eq__ doesn't
is_dtype can be equal to a class, while __eq__ can only be equal to instances.

I can fix the EDT.is_dtype not unboxing things here or elsewhere. Probably best to do it here.

Moved from PandasExtensionDtype to ExtensionDtype with one modification: catch TypeError explicitly.

TomAugspurger · 2018-02-21T15:11:04Z

Latest commit refactored EDT.is_Dtype to just be what we had for PandasExtensionDtype. I'm not sure why we didn't do that originally. The only change I made when moving it from PandasExtensionDtype to ExtensionDtype was change a bare except: to an except TypeError. The docstring notes that construct_from_string is only expected to raise a TypeError, and all of our internal EDTs make sure to catch and re-raise exceptions as TypeErrors.

jorisvandenbossche · 2018-02-21T15:19:21Z

How does that new version check for class objects?

TomAugspurger · 2018-02-21T15:25:21Z

Ugh, it doesn't. I meant to remove that docstring. We weren't actually testing for that AFAICT, so I'm not sure why I added it to begin with.

…

On Wed, Feb 21, 2018 at 9:19 AM, Joris Van den Bossche < ***@***.***> wrote: How does that new version check for class objects? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19520 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIsGj7qvPEdlxKFhY0IG04xJNKJJ7ks5tXDP9gaJpZM4R30Wl> .

TomAugspurger · 2018-02-21T19:52:34Z

CI passes now with this setup. @jreback to summarize the changes since you lasted reviewed - ExtensionArray.value_counts was removed (which IIRC you weren't a fan of the implementation anyway) - ExtensionDtype.__eq__ was defined to support Series.apply and astype(object) - Made ExtensionDtype.is_dtype match how our `is_dtype` is supposed to work 1. Unboxes dtype from things with a `.dtype` attribute 2. Doesn't return True for *classes*, only instances. On Wed, Feb 21, 2018 at 9:24 AM, Tom Augspurger <tom.augspurger88@gmail.com> wrote:

…

Ugh, it doesn't. I meant to remove that docstring. We weren't actually testing for that AFAICT, so I'm not sure why I added it to begin with. On Wed, Feb 21, 2018 at 9:19 AM, Joris Van den Bossche < ***@***.***> wrote: > How does that new version check for class objects? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#19520 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ABQHIsGj7qvPEdlxKFhY0IG04xJNKJJ7ks5tXDP9gaJpZM4R30Wl> > . >

…ream+fu1

jreback

on the code just a few comments and some unanwered queries that I had before. The tests needs reorg as I have suggested.

jreback · 2018-02-22T01:22:14Z

pandas/core/algorithms.py

    Parameters
    ----------
-    arr : ndarray
-        Input array
+    arr : ndarray, ExtensionArray, DatetimeIndex, IntervalIndex, SparseArray


jreback · 2018-02-22T01:22:24Z

pandas/core/algorithms.py

+
+    Returns
+    -------
+    subarray : object


jreback · 2018-02-22T01:23:53Z

pandas/core/arrays/base.py


        .. code-block:: python

           def take(self, indexer, allow_fill=True, fill_value=None):
               mask = indexer == -1
               result = self.data.take(indexer)
-               result[mask] = self._fill_value


hmm, is there a reason we don't define _na_value? (e.g we do on indexes) as a property?

Originally there was a self._fill_value, but because this was not used internally, we left it out for the moment. The discussion about that is a bit spread, but see eg #19520 (comment) (4th bullet point is about fill_value, and comment below with Tom's answer) and #19520 (comment) and the comments below that

ok that's fine. I think we do need / want a way to set this and a property is good, and _na_value is used elsewhere (eg. in Index)

jreback · 2018-02-22T01:25:46Z

pandas/core/frame.py

            value = value.copy()

        elif isinstance(value, Index) or is_sequence(value):
            from pandas.core.series import _sanitize_index

            # turn me into an ndarray
            value = _sanitize_index(value, self.index, copy=False)
-            if not isinstance(value, (np.ndarray, Index)):
+            if not isinstance(value, (np.ndarray, Index, ExtensionArray)):


jreback · 2018-02-22T01:26:51Z

pandas/core/internals.py

+        values = self._maybe_coerce_values(values)
+        super(ExtensionBlock, self).__init__(values, placement, ndim)
+
+    def _maybe_coerce_values(self, values):


is this called anywhere else? (e.g. IOW I think can eliminate some code if maybe this was defined in the superclass)

No, this is only called in __init__, and was made into a separate method on your question. It's not defined in a superclass, so cannot be removed.

#19492

Eventually Block.init will call self._maybe_coerce_values and we can remove every subclasses's init. But there's some other work to be done first.

jreback · 2018-02-22T01:28:10Z

pandas/core/series.py

+                if dtype is not None:
+                    data = data.astype(dtype)
+
+                # need to copy to avoid aliasing issues


yes I think you can move it (IOW if dtype is None), astype will copy (so if dtype is specified this would copy twice)

jreback · 2018-02-22T01:28:34Z

pandas/core/series.py

-                                     "Categorical unless "
-                                     "dtype='category'")
+                if not data.dtype.is_dtype(dtype):
+                    raise ValueError("Cannot specify a dtype '{}' with an "


this is much nicer

jreback · 2018-02-22T01:29:20Z

pandas/core/series.py

@@ -2558,8 +2571,12 @@ def _reindex_indexer(self, new_index, indexer, copy):
                return self.copy()
            return self

-        # be subclass-friendly
-        new_values = algorithms.take_1d(self.get_values(), indexer)
+        if is_sparse(self):


no my point is it should be handled there. I would simply move it there now (where you can do the is_sparse check). The point is the .get_values() should be there not in this part of the code.

jreback · 2018-02-22T01:31:42Z

pandas/tests/extension/base.py

+from pandas.core.dtypes.dtypes import ExtensionDtype
+
+
+class BaseDtypeTests(object):


we do this for indexes now. this is WAY too cumbersome to have tests like this all in one file. We have spent an enormous amount of time splitting them out. Please just split them out. You can have classes if you want. What I am after is short conscise test files that are completely obvious.

pandas/tests/extension/decimal/test_indexing.py
pandas/tests/extension/decimal/test_construction.py

etc.

jreback · 2018-02-22T01:33:22Z

pandas/tests/extension/test_json.py

+        else:
+            if not isinstance(value, (type(self),
+                                      collections.Sequence)):
+                # broadcast value


this is a crazy test.

That isn't a test, it's the implementation.

jreback · 2018-02-22T01:33:44Z

pandas/tests/internals/test_external_block.py

@@ -5,12 +5,12 @@

 import pandas as pd
 from pandas.core.internals import (


move to the extension modules

jreback

some small comments. tests look better. I guess since the actual testnig code in an extension array subclass is small (as it calles the base class) its ok in 1 file for now, and can always be expanded.

jreback · 2018-02-22T15:03:27Z

pandas/core/arrays/base.py


        .. code-block:: python

           def take(self, indexer, allow_fill=True, fill_value=None):
               mask = indexer == -1
               result = self.data.take(indexer)
-               result[mask] = self._fill_value


ok that's fine. I think we do need / want a way to set this and a property is good, and _na_value is used elsewhere (eg. in Index)

jreback · 2018-02-22T15:04:36Z

pandas/core/dtypes/missing.py

+        if is_string_like_dtype(dtype):
+            result = np.zeros(values.shape, dtype=bool)
+        else:
+            result = np.empty(shape, dtype=bool)


can you add a comment on what the expected is in the else, e.g. an object array of non-strings

jreback · 2018-02-22T15:04:59Z

pandas/core/internals.py

+        values = self._maybe_coerce_values(values)
+        super(ExtensionBlock, self).__init__(values, placement, ndim)
+
+    def _maybe_coerce_values(self, values):


jreback · 2018-02-22T15:06:41Z

pandas/tests/extension/decimal/decimal_array.py

+
+import numpy as np
+
+import pandas as pd


is there a reason you don't name this array.py?

comments for object vs. str missing values

jreback · 2018-02-22T15:22:38Z

looks fine. merge on green. maybe liberally use TODO(EA) where anything is in question so can come back later.

TomAugspurger · 2018-02-23T11:47:17Z

All green. Thanks all.

jorisvandenbossche · 2018-02-23T12:13:20Z

Woohoo!

It seems you already have branches for all follow-ups? :) (#19696) Anything we can still help with, or mainly reviewing upcoming PRs?

TomAugspurger · 2018-02-23T12:26:36Z

I'll post there what I have.

…

On Fri, Feb 23, 2018 at 6:13 AM, Joris Van den Bossche < ***@***.***> wrote: Woohoo! It seems you already have branches for all follow-ups? :) (#19696 <#19696>) Anything we can still help with, or mainly reviewing upcoming PRs? — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#19520 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIt2ATl5rVagWiB6TwL6K56z4OILTks5tXqtmgaJpZM4R30Wl> .

* ENH: non-interval changes * COMPAT: py2 Super * BUG: Use original object for extension array * Consistent boxing / unboxing NumPy compat * 32-bit compat * Add a test array * linting * Default __iter__ * Tests for value_counts * Implement value_counts * Py2 compat * Fixed dropna * Test fixups * Started setitem * REF/Clean: Internal / External values * Move to index base * Setitem tests, decimal example * Compat * Fixed extension block tests. The only "API change" was that you can't just inherit from NonConsolidatableMixin, which is OK since 1. it's a mixin 2. geopandas also inherits from Block * Clarify binop tests Make it clearer which bit might raise * TST: Removed ops tests * Cleanup unique handling * Simplify object concat * Use values for intersection I think eventually we'll want to ndarray_values for this, but it'll require a bit more work to support. Currently, using ndarary_values causes occasional failures on categorical. * hmm * More failing tests * remove bad test * better setitem * Dropna works. * Restore xfail test * Test Categorical * Xfail setitem tests * TST: Skip JSON tests on py2 * Additional testing * More tests * ndarray_values * API: Default ExtensionArray.astype (cherry picked from commit 943a915562b72bed147c857de927afa0daf31c1a) (cherry picked from commit fbf0a06) * Simplify concat_as_object * Py2 compat (cherry picked from commit b20e12c) * Set-ops ugliness * better docstrings * tolist * linting * Moved dtypes (cherry picked from commit d136227) * clean * cleanup * NumPy compat * Use base _values for CategoricalIndex * Update dev docs * cleanup * cleanup (cherry picked from commit 2425621) * cleanup * Linting * Precision in tests * Linting * Move to extension * Push _ndarray_values to ExtensionArray Now IndexOpsMixin._ndarray_values will dispatch all the way down to the EA. Subclasses like Categorical can override it as they see fit. * Clean up tolist * Move test locations * Fixed test * REF: Update per comments * lint * REF: Use _values for size and shape * PERF: Implement size, shape for IntervalIndex * PERF: Avoid materializing values for PeriodIndex shape, size * Cleanup * Override nbytes * Remove unused change * Docs * Test cleanpu * Always set PANDAS_TESTING_MODE * Revert "Always set PANDAS_TESTING_MODE" This reverts commit a312ba5. * Explicitly catch warnings or not * fastparquet warnings * Unicode literals strikes again. Only catch fp warning for newer numpy * Restore circle env var * More parquet test catching * No stacklevel * Lower bound on FP * Exact bound for FP * Don't use fastpath for ExtensionBlock make_block * Consistently use _values * TST: Additional constructor tests * CLN: de-nested a bit * _fill_value handling * Handle user provided dtype in constructors. When the dtype matches, we allow it to proceed. When the dtype would require coercion, we raise. * Document ExtensionBlock._maybe_coerce_values Also changes to use _values as we should * Created ABCExtensionArray * TST: Tests for is_object_dtype and is_string_dtype and EAs * fixup! Handle user provided dtype in constructors. * Doc for setitem * Split base tests * Revert test_parquet changes * API: Removed _fill_value from the interface * Push coercion to extension dtype till later * Linting * ERR: Better error message for coercion to 3rd party dtypes * CLN: Make take_nd EA aware * Revert sparse changes * Other _typ for ABCExtensionArray * Test cleanup and expansion. Tests for concating and aligning frames * Copy if copy * TST: remove self param for fixture * Remove unnescessary EA handling in Series ctor * API: Removed value_counts Moved setitem notes to comment * More doc notes * Handle expanding a DataFrame with an EA * Added ExtensionDtype.__eq__ Support for astype * linting * REF: is_dtype_equal refactor Moved from PandasExtensionDtype to ExtensionDtype with one modification: catch TypeError explicitly. * Remove reference to dtype being a class * move * Moved sparse check to take_nd * Docstring * Split tests * Revert index change * Copy changes * Simplify EA implementation names comments for object vs. str missing values * Linting

ENH: non-interval changes

b7069ef

TomAugspurger added Enhancement Dtype Conversions Unexpected or buggy dtype conversions Internals Related to non-user accessible pandas implementation labels Feb 2, 2018

TomAugspurger commented Feb 2, 2018

View reviewed changes

jorisvandenbossche reviewed Feb 2, 2018

View reviewed changes

TomAugspurger added 2 commits February 3, 2018 07:00

COMPAT: py2 Super

9cd92c7

BUG: Use original object for extension array

9211bbd

TomAugspurger added 3 commits February 3, 2018 14:21

Consistent boxing / unboxing

80f83a6

NumPy compat

32-bit compat

ca004d8

Add a test array

5d4a686

TomAugspurger commented Feb 5, 2018

View reviewed changes

linting

9f4ad42

jorisvandenbossche reviewed Feb 5, 2018

View reviewed changes

Default __iter__

b1db4e8

TomAugspurger added 4 commits February 5, 2018 10:29

Tests for value_counts

00d6bb3

Implement value_counts

1608e3d

Py2 compat

52e2180

Fixed dropna

e6d06e2

TomAugspurger commented Feb 5, 2018

View reviewed changes

Test fixups

d356f19

jreback requested changes Feb 6, 2018

View reviewed changes

TomAugspurger mentioned this pull request Feb 6, 2018

CLN: Single way to get underlying values for Index / Series #19548

Closed

Started setitem

a6ae340

REF: is_dtype_equal refactor

f07c166

Moved from PandasExtensionDtype to ExtensionDtype with one modification: catch TypeError explicitly.

Remove reference to dtype being a class

79d43b1

Merge remote-tracking branch 'upstream/master' into pandas-array-upst…

91c629b

…ream+fu1

jreback requested changes Feb 22, 2018

View reviewed changes

TomAugspurger added 6 commits February 21, 2018 19:55

move

a1ebf53

Moved sparse check to take_nd

aa57cad

Docstring

c82748c

Split tests

e919343

Revert index change

1ea74da

Copy changes

0c41a34

jreback requested changes Feb 22, 2018

View reviewed changes

Simplify EA implementation names

009bece

comments for object vs. str missing values

jreback approved these changes Feb 22, 2018

View reviewed changes

jorisvandenbossche approved these changes Feb 22, 2018

View reviewed changes

Linting

ea5562b

mrocklin mentioned this pull request Feb 22, 2018

Refactor GeoPandas as an extension array type in Pandas geopandas/geopandas#680

Open

TomAugspurger merged commit 01e99de into pandas-dev:master Feb 23, 2018

TomAugspurger deleted the pandas-array-upstream+fu1 branch February 23, 2018 11:47

tswast mentioned this pull request Jan 25, 2022

FR: dtype for Decimal data googleapis/python-db-dtypes-pandas#49

Open

		@@ -1038,6 +1039,31 @@ def _to_embed(self, keep_tz=False, dtype=None):

		return self.values.copy()

		def _as_best_array(self):

		from pandas.core.dtypes.dtypes import ExtensionDtype


		class BaseDtypeTests(object):

		@@ -5,12 +5,12 @@

		import pandas as pd
		from pandas.core.internals import (

ENH: Allow storing ExtensionArrays in containers #19520

ENH: Allow storing ExtensionArrays in containers #19520

Conversation

TomAugspurger commented Feb 2, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Feb 3, 2018 • edited Loading

Codecov Report

pep8speaks commented Feb 5, 2018 • edited Loading

Comment last updated on February 22, 2018 at 16:31 Hours UTC

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Feb 5, 2018 • edited Loading

Choose a reason for hiding this comment

TomAugspurger commented Feb 5, 2018

Choose a reason for hiding this comment

TomAugspurger Feb 5, 2018 • edited Loading

Choose a reason for hiding this comment

TomAugspurger commented Feb 5, 2018 • edited Loading

jorisvandenbossche commented Feb 5, 2018

TomAugspurger commented Feb 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Feb 21, 2018 • edited Loading

TomAugspurger commented Feb 21, 2018

jorisvandenbossche commented Feb 21, 2018

TomAugspurger commented Feb 21, 2018 via email

TomAugspurger commented Feb 21, 2018 via email

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Feb 2, 2018 •

edited

Loading

codecov bot commented Feb 3, 2018 •

edited

Loading

pep8speaks commented Feb 5, 2018 •

edited

Loading

jorisvandenbossche Feb 5, 2018 •

edited

Loading

TomAugspurger Feb 5, 2018 •

edited

Loading

TomAugspurger commented Feb 5, 2018 •

edited

Loading

TomAugspurger commented Feb 21, 2018 •

edited

Loading