Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add default repr for EAs #23601

Merged
merged 54 commits into from
Dec 4, 2018
Merged

Add default repr for EAs #23601

merged 54 commits into from
Dec 4, 2018

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Nov 9, 2018

Closes #22846
Closes #23590

In [4]: pd.core.arrays.period_array(['2000', '2001', None], freq='D')
Out[4]:
<PeriodArray>
['2000-01-01', '2001-01-01', 'NaT']
Length: 3, dtype: period[D]

In [5]: pd.core.arrays.period_array(['2000', '2001', None] * 100, freq='D')
Out[5]:
<PeriodArray>
['2000-01-01', '2001-01-01',        'NaT', '2000-01-01', '2001-01-01',
        'NaT', '2000-01-01', '2001-01-01',        'NaT', '2000-01-01',
 ...
        'NaT', '2000-01-01', '2001-01-01',        'NaT', '2000-01-01',
 '2001-01-01',        'NaT', '2000-01-01', '2001-01-01',        'NaT']
Length: 300, dtype: period[D]

In [6]: pd.core.arrays.integer_array([1, 2, None])
Out[6]:
<IntegerArray>
[1, 2, NaN]
Length: 3, dtype: Int64

In [7]: pd.core.arrays.integer_array([1, 2, None] * 1000)
Out[7]:
<IntegerArray>
[  1,   2, NaN,   1,   2, NaN,   1,   2, NaN,   1,
 ...
 NaN,   1,   2, NaN,   1,   2, NaN,   1,   2, NaN]
Length: 3000, dtype: Int64

@TomAugspurger TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label Nov 9, 2018
@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Nov 9, 2018
@pep8speaks
Copy link

Hello @TomAugspurger! Thanks for submitting the PR.

@TomAugspurger
Copy link
Contributor Author

In [3]: integer_array([1, 2, 3])
Out[3]:
<IntegerArray>
[1, 2, 3]
Length: 3, dtype: Int64

In [4]: period_array(['2000', '2001'], freq='D')
Out[4]:
<PeriodArray>
[2000-01-01, 2001-01-01]
Length: 2, dtype: period[D]

In [5]: IntervalArray.from_breaks([1, 2, 3])
Out[5]:
<IntervalArray>
[(1, 2], (2, 3]]
Length: 2, dtype: interval[int64]

In [6]: integer_array([1, 2, 3] * 1000)
Out[6]:
<IntegerArray>
[1, 2, 3, 1, 2, 3, 1, 2, 3, 1,
 ...
 3, 1, 2, 3, 1, 2, 3, 1, 2, 3]
Length: 3000, dtype: Int64

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to add one for DatetimeArray and TimedeltaArray here as well?

Do we need a mechanism to indicate which attributes to print in addition to length and dtype? (in case we want to keep printing the freq for DatetimeArray/Timedelta)Array

Is there any control over the number of elements shown?

pandas/core/arrays/base.py Outdated Show resolved Hide resolved
pandas/core/arrays/base.py Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor Author

For DatetimeArray & Timedelta I was waiting to see what happens on #23587. If @jbrockmendel reverts the repr changes before merging then I'll add it here. Otherwise I'll just delete them after that's merged.

(in case we want to keep printing the freq for DatetimeArray/Timedelta)Array

Do we need a mechanism to indicate which attributes to print in addition to length and dtype? (in case we want to keep printing the freq for DatetimeArray/Timedelta)Array

Sure. I suppose that info doesn't belong in the dtype (right?) so we can add hooks for extra attrs.

Is there any control over the number of elements shown?

At the array level, no. But I think following the option at pd.options.display.max_seq_items is the right thing to do. I'll add a note to the docs.

@jorisvandenbossche
Copy link
Member

The main difference for PeriodArray seems to be that the values are no longer quoted? (not a strong opinion here though, for Index reprs we use quotes and also numpy quotes dates)

@TomAugspurger
Copy link
Contributor Author

diff --git a/pandas/core/arrays/period.py b/pandas/core/arrays/period.py
index f6996f8e6..4bc7841fe 100644
--- a/pandas/core/arrays/period.py
+++ b/pandas/core/arrays/period.py
@@ -330,6 +330,10 @@ class PeriodArray(dtl.DatetimeLikeArrayMixin, ExtensionArray):
     def end_time(self):
         return self.to_timestamp(how='end')
 
+    @property
+    def _formatter(self):
+        return "'{}'".format
+
     def __setitem__(
             self,
             key,   # type: Union[int, Sequence[int], Sequence[bool]]

will quote. Gonna mixup the py2 issues.

@TomAugspurger
Copy link
Contributor Author

Update:

  1. deprecated ExtensionArray._formatting_values
  2. Change _formatter to a function that takes a flag for whether or we're printing inside a Series / DataFrame
In [5]: integer_array([1, 2, None] * 1000)
Out[5]:
<IntegerArray>
[  1,   2, nan,   1,   2, nan,   1,   2, nan,   1,
 ...
 nan,   1,   2, nan,   1,   2, nan,   1,   2, nan]
Length: 3000, dtype: Int64

In [6]: IntervalArray.from_breaks(list(range(1000)))
Out[6]:
<IntervalArray>
[    (0, 1],     (1, 2],     (2, 3],     (3, 4],     (4, 5],     (5, 6],
     (6, 7],     (7, 8],     (8, 9],    (9, 10],
 ...
 (989, 990], (990, 991], (991, 992], (992, 993], (993, 994], (994, 995],
 (995, 996], (996, 997], (997, 998], (998, 999]]
Length: 999, dtype: interval[int64]

In [7]: period_array(['2000', '2001'] * 1000, freq='D')
Out[7]:
<PeriodArray>
['2000-01-01', '2001-01-01', '2000-01-01', '2001-01-01', '2000-01-01',
 '2001-01-01', '2000-01-01', '2001-01-01', '2000-01-01', '2001-01-01',
 ...
 '2000-01-01', '2001-01-01', '2000-01-01', '2001-01-01', '2000-01-01',
 '2001-01-01', '2000-01-01', '2001-01-01', '2000-01-01', '2001-01-01']
Length: 2000, dtype: period[D]

@jreback
Copy link
Contributor

jreback commented Nov 9, 2018

deprecated: ExtensionArray._formatting_values

has this been around for a while? its a private attribute, why deprecate?

@TomAugspurger
Copy link
Contributor Author

It's part of the interface and was around since 0.23.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Nov 9, 2018

This are into a bit larger of a refactor... I removed {Categorica,Period,Interval}ArrayFormatter in favor of a generic ExtensionArrayFormatter.

EAs will get control over formatting of individual values by overriding ExtensionArray._formatter.

@jorisvandenbossche
Copy link
Member

@TomAugspurger I was thinking for a moment again on the quoting issue.

It would be nice to have a somewhat general rule about this, and also to reflect this in the base repr.

From observing the results, it seems that - mostly - we use a str representation in Series / DataFrame, and an (abbreviated) repr in Index / Array. With 'abbreviated' I mean for eg Timestamp, we use '2012-01-01' instead of Timestamp('2012-01-01') (so leaving out the object type) since the dtype already makes it clear they are all of that object type.

If we like this general pattern, we could also do this for the default repr (and sorry, that goes against what I commented earlier #23601 (comment) and what you changed). So instead of

def _formatter(self, boxed=False):
    return repr

we could have

def _formatter(self, boxed=False):
    if boxed:
        return str
    return repr

That would also fit a possible StringArray to not quote it in Series/DataFrame, but to show it quoted in the array repr.

Anyway, since this is configurable, and I think none of the internal ones inherits the base _formatter implementation, it is not that important, and certainly does not need to hold up this PR.

@TomAugspurger
Copy link
Contributor Author

Implemented
#23601 (comment) in 2a60c15.

doc/source/whatsnew/v0.24.0.rst Outdated Show resolved Hide resolved
pandas/core/arrays/base.py Outdated Show resolved Hide resolved
----------
boxed: bool, default False
An indicated for whether or not your array is being printed
within a Series, DataFrame, or Index (True), or just by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add this arg to the Index formatters as well for compatiblity.

pandas/core/arrays/categorical.py Outdated Show resolved Hide resolved
pandas/core/arrays/sparse.py Show resolved Hide resolved
pandas/core/arrays/period.py Show resolved Hide resolved
pandas/io/formats/printing.py Outdated Show resolved Hide resolved
defaults to the class name of the obj

Pass ``False`` to indicate that subsequent lines should
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be another parameter then? it seems like it is used for 2 purposes

summary += '],'

# right now close is either '' or ', '
# Now we want to include the ']', but not the maybe space.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so another difference this is highliting is that EA have the attributes on another line, while the Index does not (as they are args).

@jreback
Copy link
Contributor

jreback commented Dec 3, 2018

can you rebase

@@ -503,7 +503,7 @@ def __array_wrap__(self, result, context=None):

@property
def _formatter_func(self):
return lambda x: "'%s'" % x
return self.array._formatter(boxed=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this the only index sublcass that you need to do this for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the only extension-array backed index that's using the formatting right now.

  • CategoricalIndex always uses the default formatter for the underlying categories (since it can be a container for any type, it dispatches the formatting).
  • IntervalIndex /IntervalArray don't use this formatting

Datetime / TImedelta will use this too, so this will be pushed up into DatetimeIndexOpsMixin later.

def __init__(self, values, *args, **kwargs):
GenericArrayFormatter.__init__(self, values, *args, **kwargs)
if is_categorical_dtype(values.dtype):
# Categorical is special for now, so that we can preserve tzinfo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need a TODO here? this is until DatetimeArray is fully pushed?

Copy link
Contributor Author

@TomAugspurger TomAugspurger Dec 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That depends on whether we're willing to change __array__ for datetime-backed series / index (right now . I'm writing up an issue now to discuss that specific point.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#23569 (comment) for that.

data = integer_array([1, 2, None] * 1000)
expected = (
"<IntegerArray>\n"
"[ 1, 2, NaN, 1, 2, NaN, 1, 2, NaN, 1,\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are somehow justified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Justified, as in "vertically aligned"? Here's the repr

In [10]: data
Out[10]:
<IntegerArray>
[  1,   2, NaN,   1,   2, NaN,   1,   2, NaN,   1,
 ...
 NaN,   1,   2, NaN,   1,   2, NaN,   1,   2, NaN]
Length: 3000, dtype: Int64

The NaN pattern there makes the formatting a bt strange, but I think unavoidable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no this is ok, was just wondering about

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of comments

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 4, 2018

I'm not sure how hard it would be, but in

<IntegerArray>
[  1,   2, NaN,   1,   2, NaN,   1,   2, NaN,   1,
 ...
 NaN,   1,   2, NaN,   1,   2, NaN,   1,   2, NaN]
Length: 3000, dtype: Int64

I don't like that we devote an entire line just to ... I wonder if we could instead do

<IntegerArray>
[  1,   2, NaN,   1,   2, NaN,   1,   2, NaN, ...,
 ...,   1,   2, NaN,   1,   2, NaN,   1,   2, NaN]
Length: 3000, dtype: Int64

Although, that makes the continuation harder to see (for me)...

Edit: thinking about it more, I prefer the one with ... on its own line.


I'll try to followup with a short one-line repr before the release.

@jreback
Copy link
Contributor

jreback commented Dec 4, 2018

this is fine: #23601 (comment)

we do this elsewhere, try this with a large Array and the own line makes a lot of sense.

@TomAugspurger
Copy link
Contributor Author

👍 all good then?

@jreback jreback merged commit 1573340 into pandas-dev:master Dec 4, 2018
@jreback
Copy link
Contributor

jreback commented Dec 4, 2018

thanks a lot @TomAugspurger ; this consolidates a lot of disparate code!

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

def _format_strings(self):
fmt_values = format_array(self.values.get_values(), self.formatter,
fmt_values = format_array(array,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger : i'm struggling to resolve some formatting issues. what is the reason for calling format_array here. As far as I can tell is looping back round to create a GenericArrayFormatter instance with a formatter specified to pick up the display options.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess, to be more succinct, why is super()._format_strings() not used?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not that familiar with this code, but from a quick look: calling super()._format_strings() would be different, as this would call GenericArrayFormatter._format_strings, while the generic format_array can still result in using custom formatters like Datetime64(TZ)Formatter or Timedelta64Formatter, depending on what the values of the underlying EA are.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although that most of those custom Formatter classes don't do much special if formatter is specified.

Eg Datetime64Formatter has this in _format_strings:

if self.formatter is not None and callable(self.formatter):
return [self.formatter(x) for x in values]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if ExtensionArrayFormatter is not inheriting from GenericArrayFormatter but calling format_array to dispatch to another ...ArrayFormatter class, why wouldn't the logic in ExtensionArrayFormatter be in format_array?

def _formatter(self, boxed=False):
def fmt(x):
if isna(x):
return 'NaN'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should NaN have been hardcoded here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? This is used for the Integer data display, where we currently use NaN.

(whether we should use rather 'NA' instead of 'NaN', that's another question)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wondering whether it would be a problem with to_string(na_rep=...). will do some tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, that's a good reason. But in general, this _formatter does not follow display options at all, is that correct?
In which case this is something to think about in general.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants