Add default repr for EAs #23601

TomAugspurger · 2018-11-09T15:31:00Z

In [4]: pd.core.arrays.period_array(['2000', '2001', None], freq='D')
Out[4]:
<PeriodArray>
['2000-01-01', '2001-01-01', 'NaT']
Length: 3, dtype: period[D]

In [5]: pd.core.arrays.period_array(['2000', '2001', None] * 100, freq='D')
Out[5]:
<PeriodArray>
['2000-01-01', '2001-01-01',        'NaT', '2000-01-01', '2001-01-01',
        'NaT', '2000-01-01', '2001-01-01',        'NaT', '2000-01-01',
 ...
        'NaT', '2000-01-01', '2001-01-01',        'NaT', '2000-01-01',
 '2001-01-01',        'NaT', '2000-01-01', '2001-01-01',        'NaT']
Length: 300, dtype: period[D]

In [6]: pd.core.arrays.integer_array([1, 2, None])
Out[6]:
<IntegerArray>
[1, 2, NaN]
Length: 3, dtype: Int64

In [7]: pd.core.arrays.integer_array([1, 2, None] * 1000)
Out[7]:
<IntegerArray>
[  1,   2, NaN,   1,   2, NaN,   1,   2, NaN,   1,
 ...
 NaN,   1,   2, NaN,   1,   2, NaN,   1,   2, NaN]
Length: 3000, dtype: Int64

pep8speaks · 2018-11-09T15:31:11Z

Hello @TomAugspurger! Thanks for submitting the PR.

There are no PEP8 issues in the file pandas/core/arrays/base.py !
There are no PEP8 issues in the file pandas/core/arrays/categorical.py !
There are no PEP8 issues in the file pandas/core/arrays/integer.py !
There are no PEP8 issues in the file pandas/core/arrays/interval.py !
There are no PEP8 issues in the file pandas/core/arrays/period.py !
There are no PEP8 issues in the file pandas/io/formats/printing.py !
There are no PEP8 issues in the file pandas/tests/arrays/interval/test_interval.py !
There are no PEP8 issues in the file pandas/tests/arrays/test_integer.py !
There are no PEP8 issues in the file pandas/tests/arrays/test_period.py !
There are no PEP8 issues in the file pandas/tests/extension/base/__init__.py !
There are no PEP8 issues in the file pandas/tests/extension/base/interface.py !
There are no PEP8 issues in the file pandas/tests/extension/base/printing.py !
There are no PEP8 issues in the file pandas/tests/extension/decimal/array.py !
There are no PEP8 issues in the file pandas/tests/extension/decimal/test_decimal.py !
There are no PEP8 issues in the file pandas/tests/extension/json/array.py !
There are no PEP8 issues in the file pandas/tests/extension/json/test_json.py !
There are no PEP8 issues in the file pandas/tests/extension/test_integer.py !
There are no PEP8 issues in the file pandas/tests/extension/test_interval.py !
There are no PEP8 issues in the file pandas/tests/extension/test_period.py !
There are no PEP8 issues in the file pandas/tests/extension/test_sparse.py !

TomAugspurger · 2018-11-09T15:38:15Z

In [3]: integer_array([1, 2, 3])
Out[3]:
<IntegerArray>
[1, 2, 3]
Length: 3, dtype: Int64

In [4]: period_array(['2000', '2001'], freq='D')
Out[4]:
<PeriodArray>
[2000-01-01, 2001-01-01]
Length: 2, dtype: period[D]

In [5]: IntervalArray.from_breaks([1, 2, 3])
Out[5]:
<IntervalArray>
[(1, 2], (2, 3]]
Length: 2, dtype: interval[int64]

In [6]: integer_array([1, 2, 3] * 1000)
Out[6]:
<IntegerArray>
[1, 2, 3, 1, 2, 3, 1, 2, 3, 1,
 ...
 3, 1, 2, 3, 1, 2, 3, 1, 2, 3]
Length: 3000, dtype: Int64

jorisvandenbossche

Do you want to add one for DatetimeArray and TimedeltaArray here as well?

Do we need a mechanism to indicate which attributes to print in addition to length and dtype? (in case we want to keep printing the freq for DatetimeArray/Timedelta)Array

Is there any control over the number of elements shown?

pandas/core/arrays/base.py

TomAugspurger · 2018-11-09T16:40:30Z

For DatetimeArray & Timedelta I was waiting to see what happens on #23587. If @jbrockmendel reverts the repr changes before merging then I'll add it here. Otherwise I'll just delete them after that's merged.

(in case we want to keep printing the freq for DatetimeArray/Timedelta)Array

Do we need a mechanism to indicate which attributes to print in addition to length and dtype? (in case we want to keep printing the freq for DatetimeArray/Timedelta)Array

Sure. I suppose that info doesn't belong in the dtype (right?) so we can add hooks for extra attrs.

Is there any control over the number of elements shown?

At the array level, no. But I think following the option at pd.options.display.max_seq_items is the right thing to do. I'll add a note to the docs.

jorisvandenbossche · 2018-11-09T16:44:29Z

The main difference for PeriodArray seems to be that the values are no longer quoted? (not a strong opinion here though, for Index reprs we use quotes and also numpy quotes dates)

TomAugspurger · 2018-11-09T16:49:55Z

diff --git a/pandas/core/arrays/period.py b/pandas/core/arrays/period.py
index f6996f8e6..4bc7841fe 100644
--- a/pandas/core/arrays/period.py
+++ b/pandas/core/arrays/period.py
@@ -330,6 +330,10 @@ class PeriodArray(dtl.DatetimeLikeArrayMixin, ExtensionArray):
     def end_time(self):
         return self.to_timestamp(how='end')
 
+    @property
+    def _formatter(self):
+        return "'{}'".format
+
     def __setitem__(
             self,
             key,   # type: Union[int, Sequence[int], Sequence[bool]]

will quote. Gonna mixup the py2 issues.

TomAugspurger · 2018-11-09T18:10:31Z

Update:

deprecated ExtensionArray._formatting_values
Change _formatter to a function that takes a flag for whether or we're printing inside a Series / DataFrame

In [5]: integer_array([1, 2, None] * 1000)
Out[5]:
<IntegerArray>
[  1,   2, nan,   1,   2, nan,   1,   2, nan,   1,
 ...
 nan,   1,   2, nan,   1,   2, nan,   1,   2, nan]
Length: 3000, dtype: Int64

In [6]: IntervalArray.from_breaks(list(range(1000)))
Out[6]:
<IntervalArray>
[    (0, 1],     (1, 2],     (2, 3],     (3, 4],     (4, 5],     (5, 6],
     (6, 7],     (7, 8],     (8, 9],    (9, 10],
 ...
 (989, 990], (990, 991], (991, 992], (992, 993], (993, 994], (994, 995],
 (995, 996], (996, 997], (997, 998], (998, 999]]
Length: 999, dtype: interval[int64]

In [7]: period_array(['2000', '2001'] * 1000, freq='D')
Out[7]:
<PeriodArray>
['2000-01-01', '2001-01-01', '2000-01-01', '2001-01-01', '2000-01-01',
 '2001-01-01', '2000-01-01', '2001-01-01', '2000-01-01', '2001-01-01',
 ...
 '2000-01-01', '2001-01-01', '2000-01-01', '2001-01-01', '2000-01-01',
 '2001-01-01', '2000-01-01', '2001-01-01', '2000-01-01', '2001-01-01']
Length: 2000, dtype: period[D]

jreback · 2018-11-09T18:24:57Z

deprecated: ExtensionArray._formatting_values

has this been around for a while? its a private attribute, why deprecate?

TomAugspurger · 2018-11-09T18:35:08Z

It's part of the interface and was around since 0.23.

TomAugspurger · 2018-11-09T20:49:38Z

This are into a bit larger of a refactor... I removed {Categorica,Period,Interval}ArrayFormatter in favor of a generic ExtensionArrayFormatter.

EAs will get control over formatting of individual values by overriding ExtensionArray._formatter.

pandas/tests/frame/test_repr_info.py

pandas/io/formats/format.py

jorisvandenbossche · 2018-12-01T14:13:58Z

@TomAugspurger I was thinking for a moment again on the quoting issue.

It would be nice to have a somewhat general rule about this, and also to reflect this in the base repr.

From observing the results, it seems that - mostly - we use a str representation in Series / DataFrame, and an (abbreviated) repr in Index / Array. With 'abbreviated' I mean for eg Timestamp, we use '2012-01-01' instead of Timestamp('2012-01-01') (so leaving out the object type) since the dtype already makes it clear they are all of that object type.

If we like this general pattern, we could also do this for the default repr (and sorry, that goes against what I commented earlier #23601 (comment) and what you changed). So instead of

def _formatter(self, boxed=False):
    return repr

we could have

def _formatter(self, boxed=False):
    if boxed:
        return str
    return repr

That would also fit a possible StringArray to not quote it in Series/DataFrame, but to show it quoted in the array repr.

Anyway, since this is configurable, and I think none of the internal ones inherits the base _formatter implementation, it is not that important, and certainly does not need to hold up this PR.

TomAugspurger · 2018-12-02T13:16:56Z

Implemented
#23601 (comment) in 2a60c15.

doc/source/whatsnew/v0.24.0.rst

pandas/core/arrays/base.py

jreback · 2018-12-02T16:32:32Z

pandas/core/arrays/base.py

+        ----------
+        boxed: bool, default False
+            An indicated for whether or not your array is being printed
+            within a Series, DataFrame, or Index (True), or just by


I would add this arg to the Index formatters as well for compatiblity.

pandas/core/arrays/categorical.py

pandas/core/arrays/sparse.py

pandas/core/arrays/period.py

pandas/io/formats/printing.py

jreback · 2018-12-02T16:36:41Z

pandas/io/formats/printing.py

        defaults to the class name of the obj

+        Pass ``False`` to indicate that subsequent lines should


can this be another parameter then? it seems like it is used for 2 purposes

jreback · 2018-12-02T16:37:51Z

pandas/io/formats/printing.py

-        summary += '],'
+
+        # right now close is either '' or ', '
+        # Now we want to include the ']', but not the maybe space.


so another difference this is highliting is that EA have the attributes on another line, while the Index does not (as they are args).

* docs * removed overloading of name=False * added indent_for_name

jreback · 2018-12-03T15:05:03Z

can you rebase

jreback · 2018-12-03T23:59:16Z

pandas/core/indexes/period.py

@@ -503,7 +503,7 @@ def __array_wrap__(self, result, context=None):

    @property
    def _formatter_func(self):
-        return lambda x: "'%s'" % x
+        return self.array._formatter(boxed=False)


why is this the only index sublcass that you need to do this for?

It's the only extension-array backed index that's using the formatting right now.

CategoricalIndex always uses the default formatter for the underlying categories (since it can be a container for any type, it dispatches the formatting).

IntervalIndex /IntervalArray don't use this formatting

Datetime / TImedelta will use this too, so this will be pushed up into DatetimeIndexOpsMixin later.

jreback · 2018-12-04T00:00:07Z

pandas/io/formats/format.py

-    def __init__(self, values, *args, **kwargs):
-        GenericArrayFormatter.__init__(self, values, *args, **kwargs)
+        if is_categorical_dtype(values.dtype):
+            # Categorical is special for now, so that we can preserve tzinfo


do we need a TODO here? this is until DatetimeArray is fully pushed?

That depends on whether we're willing to change __array__ for datetime-backed series / index (right now . I'm writing up an issue now to discuss that specific point.)

#23569 (comment) for that.

jreback · 2018-12-04T00:01:09Z

pandas/tests/arrays/test_integer.py

+    data = integer_array([1, 2, None] * 1000)
+    expected = (
+        "<IntegerArray>\n"
+        "[  1,   2, NaN,   1,   2, NaN,   1,   2, NaN,   1,\n"


these are somehow justified?

Justified, as in "vertically aligned"? Here's the repr

In [10]: data Out[10]: <IntegerArray> [ 1, 2, NaN, 1, 2, NaN, 1, 2, NaN, 1, ... NaN, 1, 2, NaN, 1, 2, NaN, 1, 2, NaN] Length: 3000, dtype: Int64

The NaN pattern there makes the formatting a bt strange, but I think unavoidable.

no this is ok, was just wondering about

jreback

couple of comments

TomAugspurger · 2018-12-04T12:41:15Z

I'm not sure how hard it would be, but in

<IntegerArray>
[  1,   2, NaN,   1,   2, NaN,   1,   2, NaN,   1,
 ...
 NaN,   1,   2, NaN,   1,   2, NaN,   1,   2, NaN]
Length: 3000, dtype: Int64

I don't like that we devote an entire line just to ... I wonder if we could instead do

<IntegerArray>
[  1,   2, NaN,   1,   2, NaN,   1,   2, NaN, ...,
 ...,   1,   2, NaN,   1,   2, NaN,   1,   2, NaN]
Length: 3000, dtype: Int64

Although, that makes the continuation harder to see (for me)...

Edit: thinking about it more, I prefer the one with ... on its own line.

I'll try to followup with a short one-line repr before the release.

jreback · 2018-12-04T12:57:50Z

this is fine: #23601 (comment)

we do this elsewhere, try this with a large Array and the own line makes a lot of sense.

TomAugspurger · 2018-12-04T13:03:54Z

👍 all good then?

jreback · 2018-12-04T13:09:37Z

thanks a lot @TomAugspurger ; this consolidates a lot of disparate code!

simonjayhawkins · 2019-04-05T16:33:55Z

pandas/io/formats/format.py


-    def _format_strings(self):
-        fmt_values = format_array(self.values.get_values(), self.formatter,
+        fmt_values = format_array(array,


@TomAugspurger : i'm struggling to resolve some formatting issues. what is the reason for calling format_array here. As far as I can tell is looping back round to create a GenericArrayFormatter instance with a formatter specified to pick up the display options.

i guess, to be more succinct, why is super()._format_strings() not used?

I am not that familiar with this code, but from a quick look: calling super()._format_strings() would be different, as this would call GenericArrayFormatter._format_strings, while the generic format_array can still result in using custom formatters like Datetime64(TZ)Formatter or Timedelta64Formatter, depending on what the values of the underlying EA are.

Although that most of those custom Formatter classes don't do much special if formatter is specified.

Eg Datetime64Formatter has this in _format_strings:

pandas/pandas/io/formats/format.py

Lines 1174 to 1175 in 181f972

if self.formatter is not None and callable(self.formatter):

return [self.formatter(x) for x in values]

so if ExtensionArrayFormatter is not inheriting from GenericArrayFormatter but calling format_array to dispatch to another ...ArrayFormatter class, why wouldn't the logic in ExtensionArrayFormatter be in format_array?

simonjayhawkins · 2019-04-05T18:43:38Z

pandas/core/arrays/integer.py

+    def _formatter(self, boxed=False):
+        def fmt(x):
+            if isna(x):
+                return 'NaN'


should NaN have been hardcoded here?

Why not? This is used for the Integer data display, where we currently use NaN.

(whether we should use rather 'NA' instead of 'NaN', that's another question)

just wondering whether it would be a problem with to_string(na_rep=...). will do some tests.

Ah, yes, that's a good reason. But in general, this _formatter does not follow display options at all, is that correct?
In which case this is something to think about in general.

wip

0fdbfd3

TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label Nov 9, 2018

TomAugspurger added this to the 0.24.0 milestone Nov 9, 2018

jorisvandenbossche reviewed Nov 9, 2018

View reviewed changes

pandas/core/arrays/base.py Outdated Show resolved Hide resolved

pandas/core/arrays/base.py Show resolved Hide resolved

Deprecate formatting_values

ace62aa

test for warning

6e76b51

jsexauer mentioned this pull request Nov 9, 2018

DEPR: Clean up list of deprecations from prior versions #6581

Closed

1 task

compat

fef04e6

jbrockmendel mentioned this pull request Nov 9, 2018

CLN: datetimelike arrays: isort, small reorg #23587

Merged

TomAugspurger added 5 commits November 9, 2018 13:22

na formatter

1885a97

clean

ecfcd72

Merge remote-tracking branch 'upstream/master' into ea-repr

4e0d91f

wip

37638cc

more cleanup

6e64b7b

TomAugspurger added 3 commits November 9, 2018 15:01

update docs, type

193747e

format

5a2e1e4

try this

1635b73

TomAugspurger commented Nov 9, 2018

View reviewed changes

pandas/tests/frame/test_repr_info.py Show resolved Hide resolved

TomAugspurger commented Nov 9, 2018

View reviewed changes

pandas/io/formats/format.py Show resolved Hide resolved

TomAugspurger added 2 commits November 9, 2018 16:05

updates

e2b1941

fixup interval

48e55cc

TomAugspurger added 3 commits December 2, 2018 07:11

Merge remote-tracking branch 'upstream/master' into ea-repr

c79ba0b

Use Array formatter in PeriodIndex

3825aeb

Use repr / str

2a60c15

jreback requested changes Dec 2, 2018

View reviewed changes

TomAugspurger added 3 commits December 3, 2018 07:34

Merge remote-tracking branch 'upstream/master' into ea-repr

bccf40d

Update for review

a7ef104

* docs * removed overloading of name=False * added indent_for_name

REF: removed trailing_comma argument

a3b1c92

Merge remote-tracking branch 'upstream/master' into ea-repr

e080023

TomAugspurger mentioned this pull request Dec 3, 2018

Implement DatetimeArray._from_sequence #24074

Merged

Merge remote-tracking branch 'upstream/master' into ea-repr

6ad113b

jreback reviewed Dec 3, 2018

View reviewed changes

jreback reviewed Dec 4, 2018

View reviewed changes

jreback approved these changes Dec 4, 2018

View reviewed changes

jreback merged commit 1573340 into pandas-dev:master Dec 4, 2018

datapythonista mentioned this pull request Dec 10, 2018

DOC: Fixed implicit imports for whatsnew (v >= version 20.0) #24199

Merged

4 tasks

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

Add default repr for EAs (pandas-dev#23601)

ce774cc

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

Add default repr for EAs (pandas-dev#23601)

d1b9134

simonjayhawkins reviewed Apr 5, 2019

View reviewed changes

jorisvandenbossche mentioned this pull request Jun 14, 2019

TST: test custom _formatter for ExtensionArray + revert ExtensionArrayFormatter removal #26845

Merged

jreback mentioned this pull request Nov 21, 2019

DEPR: deprecations log for removed issues #13777

Closed

		defaults to the class name of the obj

		Pass ``False`` to indicate that subsequent lines should

	if self.formatter is not None and callable(self.formatter):
	return [self.formatter(x) for x in values]

Add default repr for EAs #23601

Add default repr for EAs #23601

Conversation

TomAugspurger commented Nov 9, 2018 • edited Loading

pep8speaks commented Nov 9, 2018

TomAugspurger commented Nov 9, 2018

jorisvandenbossche left a comment

Choose a reason for hiding this comment

TomAugspurger commented Nov 9, 2018

jorisvandenbossche commented Nov 9, 2018

TomAugspurger commented Nov 9, 2018

TomAugspurger commented Nov 9, 2018

jreback commented Nov 9, 2018

TomAugspurger commented Nov 9, 2018

TomAugspurger commented Nov 9, 2018 • edited Loading

jorisvandenbossche commented Dec 1, 2018

TomAugspurger commented Dec 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger Dec 4, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

TomAugspurger commented Dec 4, 2018 • edited Loading

jreback commented Dec 4, 2018

TomAugspurger commented Dec 4, 2018

jreback commented Dec 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Nov 9, 2018 •

edited

Loading

TomAugspurger commented Nov 9, 2018 •

edited

Loading

TomAugspurger Dec 4, 2018 •

edited

Loading

TomAugspurger commented Dec 4, 2018 •

edited

Loading