Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow storing ExtensionArrays in containers #19520

Merged
merged 139 commits into from
Feb 23, 2018

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Feb 2, 2018

Followup number 1 to #19268

This adds changes to frame, series, dtypes, etc. (all the places other that core/internals.py) to support 3rd party extension arrays.

These changes were developed to support IntervalArrays (TomAugspurger/pandas@pandas-array-upstream+fu1...TomAugspurger:pandas-array-upstream+fu1+interval separate PR) and https://github.com/continuumio/pandas-ip (external).

Also closes #19585

@TomAugspurger TomAugspurger added Enhancement Dtype Conversions Unexpected or buggy dtype conversions Internals Related to non-user accessible pandas implementation labels Feb 2, 2018
@@ -219,7 +217,7 @@ def _formatting_values(self):
# type: () -> np.ndarray
# At the moment, this has to be an array since we use result.dtype
"""An array of values to be printed in, e.g. the Series repr"""
raise AbstractMethodError(self)
raise np.array(self)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested by @jorisvandenbossche in #19268, this should be an OK default implementation.


values = getattr(obj, 'values', obj)
dtype = values.dtype

if is_string_dtype(dtype):
if isinstance(values, ExtensionArray):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be able to replace this with is_extension_array_dtype

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break if the extension array implements a .values (due to two lines above). Eg if you would use that name instead of .data. I won't find that the best name for it, so ideally an ExtensionArray author does not do that, but we maybe shouldn't assume that?
(or either document in the interface that that attribute should not be present)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing: won't this catch Categorical? (because there is still handling for categoricals some lines below, which might be removable)

@@ -2826,7 +2827,7 @@ def reindexer(value):
value = maybe_cast_to_datetime(value, value.dtype)

# return internal types directly
if is_extension_type(value):
if is_extension_type(value) or is_extension_array_dtype(value):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unfortunate intermediate stage where some of our internal extension types (sparse, datetime w/ tz) are not yet actual extension array types. We'll be able to clean this up this eventually.

@@ -1038,6 +1039,31 @@ def _to_embed(self, keep_tz=False, dtype=None):

return self.values.copy()

def _as_best_array(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a need for a method like this. It may be good to add to Series as well, and maybe make public.

Essentially, we want a clear way to get the "best" .values. For Indexes with ExtensionArray types (Categorical for now, Interval, Period soon) that's the extension array. Otherwise, it's an ndarray (or I think `DatetimeIndex for datetime w/ tz right now).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this will also depend on decision if we will change return value for .values for series with period and interval objects)
For Series we already have something like ._values. Won't that be the equivalent there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. Maybe I should re-use that convention.

Unfortunately, DatetimeIndex[tz]._values returns an ndarray, w/o the TZ, where it "should" return a DatetimeIndex (and eventually an ExtensionArray). I'll see how badly changing that breaks things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes ths should be equiv of _values. DTI should be fixed as a pre-cursor PR. no problem making a different method name for this, but then simply rename _values to that name (I don't personally like _as_best_array), maybe (_as_array) is better. Should again be a pre-cursor independent PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #19548

[a, a, b]
Categories (2, object): [a, b]

>>> pd.IntervalIndex.from_breaks([0, 1, 2])._as_best_array()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values = self._maybe_coerce_values(values)
super().__init__(values, placement, ndim)

def _maybe_coerce_values(self, values):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref #19492

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some brief comment on first part of the diff (battery is running out on the train :-))


values = getattr(obj, 'values', obj)
dtype = values.dtype

if is_string_dtype(dtype):
if isinstance(values, ExtensionArray):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break if the extension array implements a .values (due to two lines above). Eg if you would use that name instead of .data. I won't find that the best name for it, so ideally an ExtensionArray author does not do that, but we maybe shouldn't assume that?
(or either document in the interface that that attribute should not be present)


values = getattr(obj, 'values', obj)
dtype = values.dtype

if is_string_dtype(dtype):
if isinstance(values, ExtensionArray):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing: won't this catch Categorical? (because there is still handling for categoricals some lines below, which might be removable)

@@ -1038,6 +1039,31 @@ def _to_embed(self, keep_tz=False, dtype=None):

return self.values.copy()

def _as_best_array(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this will also depend on decision if we will change return value for .values for series with period and interval objects)
For Series we already have something like ._values. Won't that be the equivalent there?

@codecov
Copy link

codecov bot commented Feb 3, 2018

Codecov Report

Merging #19520 into master will increase coverage by 0.04%.
The diff coverage is 94.79%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19520      +/-   ##
==========================================
+ Coverage   91.58%   91.63%   +0.04%     
==========================================
  Files         150      150              
  Lines       48892    48931      +39     
==========================================
+ Hits        44780    44840      +60     
+ Misses       4112     4091      -21
Flag Coverage Δ
#multiple 90.01% <94.79%> (+0.05%) ⬆️
#single 41.77% <61.45%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/dtypes/common.py 94.16% <ø> (-1.22%) ⬇️
pandas/core/dtypes/dtypes.py 95.9% <ø> (-0.19%) ⬇️
pandas/core/indexing.py 93.11% <ø> (ø) ⬆️
pandas/core/indexes/base.py 96.36% <100%> (-0.09%) ⬇️
pandas/core/dtypes/base.py 91.66% <100%> (+44.04%) ⬆️
pandas/core/internals.py 95.53% <100%> (+0.47%) ⬆️
pandas/core/dtypes/generic.py 100% <100%> (ø) ⬆️
pandas/core/algorithms.py 94.18% <100%> (+0.01%) ⬆️
pandas/core/frame.py 97.23% <100%> (ø) ⬆️
pandas/core/arrays/base.py 75.67% <100%> (+15.67%) ⬆️
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cd1b168...ea5562b. Read the comment docs.

@pep8speaks
Copy link

pep8speaks commented Feb 5, 2018

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on February 22, 2018 at 16:31 Hours UTC

@@ -18,19 +20,20 @@ class ExtensionArray(object):

* __getitem__
* __len__
* __iter__
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche we removed this earlier since collections implementing __len__ and __getitem__ are iterable. But (strangely?) isinstance(arr, collections.Iterable) is False, so these aren't considered "list-like" unless they actually implement __iter__.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But do we use this somewhere? (the expectation that isinstance(arr, collections.Iterable) is True)

As the docs https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable says this is not a correct way to test that something is iterable, exactly for the reason you just mentioned.

Anyway, if we add it, I would add a default implementation (return iter(self) ?) and not require subclasses to provide one manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But do we use this somewhere?

Yes, sorry. We use it in Series.__init__, via is_list_like(data).

Will add the default implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, iter(self) will cause infinite recursion, and we can't assume anything about how the data is stored. That's unfortunate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could fall back to

for i in range(len(self)):
   yield self[i]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry. We use it in Series.init, via is_list_like(data).

Or we could fix is_list_like ..

Wait, iter(self) will cause infinite recursion, and we can't assume anything about how the data is stored.

Why would it cause infinite recursion?
With iter(self) we don't assume anything about how the data is stored, I don't know how iter exactly works under the hood, but I assume it just uses __getitem__ ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we could fix is_list_like

Prior to #17654 we did that, but it was fragile to things like str (the class, not an instance) being considered iterable. Not sure if changing it back is an improvement.

Why would it cause infinite recursion?

def __iter__(self):
    return iter(self)

will call self.__iter__, which calls self.__iter__, ...

Copy link
Member

@jorisvandenbossche jorisvandenbossche Feb 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, yes of course :-)
Anyhow, the other way is a fine default implementation

@TomAugspurger
Copy link
Contributor Author

Added a test-case "JSONArray" that stores mappings.

elif isinstance(item, np.ndarray) and item.dtype == 'bool':
return type(self)([x for x, m in zip(self, item) if m])
elif isinstance(item, collections.Sequence):
return type(self)([self.data[i] for i in item])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment (I think) this is not in our requirements / specification (although numpy arrays do this as well). Was this needed for test to get passed? (i.e. do we use this somewhere internally, instead of using take ?)

Anyway, I would either add to specs, or remove it here to test that it is not required.

Copy link
Contributor Author

@TomAugspurger TomAugspurger Feb 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I accidentally relied on this for the test

    def test_take_sequence(self, data):
        result = pd.Series(data[[0, 1, 3]])

Note that it's calling data.__getitem__, not Series.__getitem__.

I don't think pandas uses it anywhere. Will rewrite that test.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Feb 5, 2018

Meta-discussion, what methods should we consider "necessary" to work. Currently we have

  • constructors
  • slicing (__getitem__, loc, iloc)
  • reindex
  • isna
  • concat
  • memory_usage

My current TODO list includes

  • dropna (will do here)
  • fillna
  • value_counts
  • groupby (not too hard, but separate PR I think). This include algos like factorize and unique.
  • reductions (separate PR)
  • ops (separate PR)
  • cumulative ops (separate PR)

Am I missing anything obvious? (cc @shoyer, @chris-b1 if you have opinions on this point).

@jorisvandenbossche
Copy link
Member

For any kind of ops, my "minimum" is a (decent) error message for now.
Although with my block-based implementation in geopandas, I had working comparison ops (at least equality, as greater/smaller than doesn't make sense for geometries), and only errors for arithmetic or reducing ops.

I think value_counts will also need some kind of factorize/unique implementation?

Further:

  • astype: for now a good error message, but do we want a way to enable this in the future? (an astype on ExtensionArray ?)

@TomAugspurger
Copy link
Contributor Author

I think value_counts will also need some kind of factorize/unique implementation?

A good one, yes :) But for now the only correctness issue is NA-handling. For now we can (maybe) drop missing values, convert to an object array, and then do value_counts.

@@ -236,6 +245,7 @@ def _concat_same_type(cls, to_concat):
"""
raise AbstractMethodError(cls)

@property
def _can_hold_na(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interface change: This should have been a property.

@@ -542,7 +543,7 @@ def value_counts(values, sort=True, ascending=False, normalize=False,

else:

if is_categorical_dtype(values) or is_sparse(values):
if is_extension_array_dtype(values) or is_sparse(values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_sparse should be recognized as an extension array type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's on my todo list. I'll do it after interval.

def __iter__(self):
"""Iterate over elements.

This needs to be implemented so that pandas recognizes extension arrays
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be a better doc-string here

as list-like. The default implementation makes successive calls to
``__getitem__``, which may be slower than necessary.
"""
for i in range(len(self)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you not just raising AbstractMethodError?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is tied intimately with the boxing concept of scalars, but this should be decoupled from getitem (yes its a valid implementation, but you would never actually do it this way), rather you would iterate on some underlying value and box it, not index into and as a side effect use getitem to box.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather you would iterate on some underlying value

We don't know how the subclass is storing things though, so there's no .values to iterate over.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so my point is that you cannot implement a default here, make the sub-classes do it. yes it might repeat some code, but its much better there i think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you cannot

Yes, we can (see the code). You "don't want" it, that something different :)
But I don't really understand why the above is not a good default implementation. I think a lot subclasses will do exactly this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And some context: If I remove __iter__, the object will still be iterable with essentially this implementation since it defines __getitem__ and __len__. We just need to implement it explicitly so that is_iterable works correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it still might be better raise here by default and force the subclass to implement this (even if this impl). It is SO important for subclasses to pay attention to this, it should just be required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't do much better than this implementation.
To give you some context: in geopandas, I have a cython method that converts a GeometryArray to numpy object array of Geometry objects, which takes ca 80ms on an array of 10000 points. Simply doing np.array([a[i] for i in range(len(a))], dtype=object) is ca 108 ms. This small difference is because by definition in each iteration I need to create a python object.

@@ -219,7 +228,7 @@ def _formatting_values(self):
# type: () -> np.ndarray
# At the moment, this has to be an array since we use result.dtype
"""An array of values to be printed in, e.g. the Series repr"""
raise AbstractMethodError(self)
return np.array(self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NO, this should be implemented only by subclasses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very sensible default IMO, subclasses can always override if needed

if dropna:
self = self[~self.isna()]

return value_counts(np.array(self))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too tied to materializing using ndarary. rather this should raise AbstractMethodError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may have a difference in philosophy here then. I'm trying to follow the lead of standard library in providing default implementations for as much as possible, even if they're sub-optimal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I find a bit "strange" about this method in adding it to the required interface of an extension array, is that it returns a Series

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's strange about that? The fact that it's being added, or that it's returning a Series? I'm not sure what else it would return.

Ideally, we'll someday have an Index[extension_dtype], so that you get a Series with an exetension-array backed index. That's seems like the most natural return value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to wrap this on the return, IOW call EA(....) here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, not real happy with this interface, but I guess ok for now. let's revisit this at somepoint in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger I suppose you want this in here, so you can implement a more efficient method on IPArray?
Because we could also, for now, keep this implementation in pd.value_counts (so for people who want this functionality, they can use it) instead of adding it as a method on ExtensionArray interface itself. And we can revisit when talking about unique/factorize/duplicated/.. algos support for ExtensionArrays in general?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly added it since that's what Categorical does. I can see if removing it breaks anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure, we're OK with pd.value_counts and Series[EA] currently failing for 3rd party EAs? By removing that from the interface, we'll see

AttributeError: 'DecimalArray' object has no attribute 'value_counts'

I could make that error message better, but I'd prefer waiting on that till we flesh out what we want pd.value_counts(EA) to do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in principle you can easily add this fallback to do a value_counts on the array of objects there?

@@ -1038,6 +1039,31 @@ def _to_embed(self, keep_tz=False, dtype=None):

return self.values.copy()

def _as_best_array(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes ths should be equiv of _values. DTI should be fixed as a pre-cursor PR. no problem making a different method name for this, but then simply rename _values to that name (I don't personally like _as_best_array), maybe (_as_array) is better. Should again be a pre-cursor independent PR.


def _maybe_coerce_values(self, values):
# Unboxes Series / Index
# Doesn't change any underlying dtypes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this not just _as_best_array() (or new name), this kind of if logic is polluting (yes its all over the place, a major goal of this ExtensionArray is to clean this up, I would argue this is THE goal).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make a real doc-string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, since there's no parent.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Feb 21, 2018

We only use is_dtype in a handful of places currently (59 times, 46 of which are tests).

I'm not sure what the history is there.

Looking more closely, there's another difference between our is_dtype like Categorical and the default ExtensionArray: ours unbox things with a .dtype attribute.

In [13]: DecimalDtype().is_dtype(DecimalArray([]))
Out[13]: False

In [14]: pd.Categorical([]).dtype.is_dtype(pd.Categorical([]))
Out[14]: True

In [15]: pd.Categorical([]).dtype.is_dtype(pd.Series(pd.Categorical([])))
Out[15]: True

So the differences are that

  • is_dtype (should) unbox things with a .dtype attribute while __eq__ doesn't
  • is_dtype can be equal to a class, while __eq__ can only be equal to instances.

I can fix the EDT.is_dtype not unboxing things here or elsewhere. Probably best to do it here.

Moved from PandasExtensionDtype to ExtensionDtype with one modification:
catch TypeError explicitly.
@TomAugspurger
Copy link
Contributor Author

Latest commit refactored EDT.is_Dtype to just be what we had for PandasExtensionDtype. I'm not sure why we didn't do that originally. The only change I made when moving it from PandasExtensionDtype to ExtensionDtype was change a bare except: to an except TypeError. The docstring notes that construct_from_string is only expected to raise a TypeError, and all of our internal EDTs make sure to catch and re-raise exceptions as TypeErrors.

@jorisvandenbossche
Copy link
Member

How does that new version check for class objects?

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Feb 21, 2018 via email

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Feb 21, 2018 via email

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the code just a few comments and some unanwered queries that I had before. The tests needs reorg as I have suggested.

Parameters
----------
arr : ndarray
Input array
arr : ndarray, ExtensionArray, DatetimeIndex, IntervalIndex, SparseArray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?


Returns
-------
subarray : object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array


.. code-block:: python

def take(self, indexer, allow_fill=True, fill_value=None):
mask = indexer == -1
result = self.data.take(indexer)
result[mask] = self._fill_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, is there a reason we don't define _na_value? (e.g we do on indexes) as a property?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally there was a self._fill_value, but because this was not used internally, we left it out for the moment. The discussion about that is a bit spread, but see eg #19520 (comment) (4th bullet point is about fill_value, and comment below with Tom's answer) and #19520 (comment) and the comments below that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok that's fine. I think we do need / want a way to set this and a property is good, and _na_value is used elsewhere (eg. in Index)

value = value.copy()

elif isinstance(value, Index) or is_sequence(value):
from pandas.core.series import _sanitize_index

# turn me into an ndarray
value = _sanitize_index(value, self.index, copy=False)
if not isinstance(value, (np.ndarray, Index)):
if not isinstance(value, (np.ndarray, Index, ExtensionArray)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

values = self._maybe_coerce_values(values)
super(ExtensionBlock, self).__init__(values, placement, ndim)

def _maybe_coerce_values(self, values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this called anywhere else? (e.g. IOW I think can eliminate some code if maybe this was defined in the superclass)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is only called in __init__, and was made into a separate method on your question. It's not defined in a superclass, so cannot be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#19492

Eventually Block.init will call self._maybe_coerce_values and we can remove every subclasses's init. But there's some other work to be done first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

if dtype is not None:
data = data.astype(dtype)

# need to copy to avoid aliasing issues
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I think you can move it (IOW if dtype is None), astype will copy (so if dtype is specified this would copy twice)

"Categorical unless "
"dtype='category'")
if not data.dtype.is_dtype(dtype):
raise ValueError("Cannot specify a dtype '{}' with an "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is much nicer

@@ -2558,8 +2571,12 @@ def _reindex_indexer(self, new_index, indexer, copy):
return self.copy()
return self

# be subclass-friendly
new_values = algorithms.take_1d(self.get_values(), indexer)
if is_sparse(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no my point is it should be handled there. I would simply move it there now (where you can do the is_sparse check). The point is the .get_values() should be there not in this part of the code.

from pandas.core.dtypes.dtypes import ExtensionDtype


class BaseDtypeTests(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do this for indexes now. this is WAY too cumbersome to have tests like this all in one file. We have spent an enormous amount of time splitting them out. Please just split them out. You can have classes if you want. What I am after is short conscise test files that are completely obvious.

pandas/tests/extension/decimal/test_indexing.py
pandas/tests/extension/decimal/test_construction.py

etc.

else:
if not isinstance(value, (type(self),
collections.Sequence)):
# broadcast value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a crazy test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That isn't a test, it's the implementation.

@@ -5,12 +5,12 @@

import pandas as pd
from pandas.core.internals import (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to the extension modules

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some small comments. tests look better. I guess since the actual testnig code in an extension array subclass is small (as it calles the base class) its ok in 1 file for now, and can always be expanded.


.. code-block:: python

def take(self, indexer, allow_fill=True, fill_value=None):
mask = indexer == -1
result = self.data.take(indexer)
result[mask] = self._fill_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok that's fine. I think we do need / want a way to set this and a property is good, and _na_value is used elsewhere (eg. in Index)

if is_string_like_dtype(dtype):
result = np.zeros(values.shape, dtype=bool)
else:
result = np.empty(shape, dtype=bool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment on what the expected is in the else, e.g. an object array of non-strings

values = self._maybe_coerce_values(values)
super(ExtensionBlock, self).__init__(values, placement, ndim)

def _maybe_coerce_values(self, values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok


import numpy as np

import pandas as pd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason you don't name this array.py?

comments for object vs. str missing values
@jreback
Copy link
Contributor

jreback commented Feb 22, 2018

looks fine. merge on green. maybe liberally use TODO(EA) where anything is in question so can come back later.

@TomAugspurger
Copy link
Contributor Author

All green. Thanks all.

@TomAugspurger TomAugspurger merged commit 01e99de into pandas-dev:master Feb 23, 2018
@TomAugspurger TomAugspurger deleted the pandas-array-upstream+fu1 branch February 23, 2018 11:47
@jorisvandenbossche
Copy link
Member

Woohoo!

It seems you already have branches for all follow-ups? :) (#19696) Anything we can still help with, or mainly reviewing upcoming PRs?

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Feb 23, 2018 via email

harisbal pushed a commit to harisbal/pandas that referenced this pull request Feb 28, 2018
* ENH: non-interval changes

* COMPAT: py2 Super

* BUG: Use original object for extension array

* Consistent boxing / unboxing

NumPy compat

* 32-bit compat

* Add a test array

* linting

* Default __iter__

* Tests for value_counts

* Implement value_counts

* Py2 compat

* Fixed dropna

* Test fixups

* Started setitem

* REF/Clean: Internal / External values

* Move to index base

* Setitem tests, decimal example

* Compat

* Fixed extension block tests.

The only "API change" was that you can't just inherit from
NonConsolidatableMixin, which is OK since

1. it's a mixin
2. geopandas also inherits from Block

* Clarify binop tests

Make it clearer which bit might raise

* TST: Removed ops tests

* Cleanup unique handling

* Simplify object concat

* Use values for intersection

I think eventually we'll want to ndarray_values for this, but it'll
require a bit more work to support. Currently, using ndarary_values
causes occasional failures on categorical.

* hmm

* More failing tests

* remove bad test

* better setitem

* Dropna works.

* Restore xfail test

* Test Categorical

* Xfail setitem tests

* TST: Skip JSON tests on py2

* Additional testing

* More tests

* ndarray_values

* API: Default ExtensionArray.astype

(cherry picked from commit 943a915562b72bed147c857de927afa0daf31c1a)
(cherry picked from commit fbf0a06)

* Simplify concat_as_object

* Py2 compat

(cherry picked from commit b20e12c)

* Set-ops ugliness

* better docstrings

* tolist

* linting

* Moved dtypes

(cherry picked from commit d136227)

* clean

* cleanup

* NumPy compat

* Use base _values for CategoricalIndex

* Update dev docs

* cleanup

* cleanup

(cherry picked from commit 2425621)

* cleanup

* Linting

* Precision in tests

* Linting

* Move to extension

* Push _ndarray_values to ExtensionArray

Now IndexOpsMixin._ndarray_values will dispatch all the way down to the EA.
Subclasses like Categorical can override it as they see fit.

* Clean up tolist

* Move test locations

* Fixed test

* REF: Update per comments

* lint

* REF: Use _values for size and shape

* PERF: Implement size, shape for IntervalIndex

* PERF: Avoid materializing values for PeriodIndex shape, size

* Cleanup

* Override nbytes

* Remove unused change

* Docs

* Test cleanpu

* Always set PANDAS_TESTING_MODE

* Revert "Always set PANDAS_TESTING_MODE"

This reverts commit a312ba5.

* Explicitly catch warnings or not

* fastparquet warnings

* Unicode literals strikes again.

Only catch fp warning for newer numpy

* Restore circle env var

* More parquet test catching

* No stacklevel

* Lower bound on FP

* Exact bound for FP

* Don't use fastpath for ExtensionBlock make_block

* Consistently use _values

* TST: Additional constructor tests

* CLN: de-nested a bit

* _fill_value handling

* Handle user provided dtype in constructors.

When the dtype matches, we allow it to proceed.

When the dtype would require coercion, we raise.

* Document ExtensionBlock._maybe_coerce_values

Also changes to use _values as we should

* Created ABCExtensionArray

* TST: Tests for is_object_dtype and is_string_dtype and EAs

* fixup! Handle user provided dtype in constructors.

* Doc for setitem

* Split base tests

* Revert test_parquet changes

* API: Removed _fill_value from the interface

* Push coercion to extension dtype till later

* Linting

* ERR: Better error message for coercion to 3rd party dtypes

* CLN: Make take_nd EA aware

* Revert sparse changes

* Other _typ for ABCExtensionArray

* Test cleanup and expansion.

Tests for concating and aligning frames

* Copy if copy

* TST: remove self param for fixture

* Remove unnescessary EA handling in Series ctor

* API: Removed value_counts

Moved setitem notes to comment

* More doc notes

* Handle expanding a DataFrame with an EA

* Added ExtensionDtype.__eq__

Support for astype

* linting

* REF: is_dtype_equal refactor

Moved from PandasExtensionDtype to ExtensionDtype with one modification:
catch TypeError explicitly.

* Remove reference to dtype being a class

* move

* Moved sparse check to take_nd

* Docstring

* Split tests

* Revert index change

* Copy changes

* Simplify EA implementation names

comments for object vs. str missing values

* Linting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test Order Dependency between parquet tests and internals
5 participants