Tidy representation when truncating dfs #7086

bjonen · 2014-05-09T13:06:30Z

This PR closes #5603

Dataframes are now truncated centrally (similar to pd.Series).

Before:

After:

Ipython notebook:

Simple Index:

MultiLevel Index:

jreback · 2014-05-09T13:15:51Z

can you put up a before/after in the top of the PR?

jreback · 2014-05-09T13:30:27Z

pandas/core/format.py

-            cols_to_show = self.columns
+            if max_cols > 1:
+                col_num = (max_cols // 2)
+                frame = frame.iloc[:,:col_num].join(frame.iloc[:,-col_num:])


hmm, you are joining these here? seems that you should do this all at once with the ... column, e.g.

concat([left,str_sep_columns,right],axis=1) (for truncate_h)

I could do something like this, but this gives me a circular reference because concat requires an NDFrame which imports the format module. Any idea how to get around this?

col_num = (max_cols // 2) left = frame.iloc[:,:col_num] right = frame.iloc[:,-col_num:] sep_col = DataFrame(data=['...'] * len(frame),index=frame.index,columns=['...']) frame = concat((left,sep_col,right), axis=1)

something like this:

col_num = (max_cols // 2) left = frame.iloc[:,:col_num] right = frame.iloc[:,-col_num:] sep_col = DataFrame(data=['...'] * (len(left)+len(right),columns=['...']) frame = concat((left,sep_col,right), axis=1, ignore_index=True) frame.index = Index(left.index.to_list() + ['...'] + right.index.to_list())

the index is going to change anyhow because you are introducing a string column (so might as well just ignore it and then add it after)

But as far as I can see, I cannot import DataFrame because frame.py imports format.py.

jreback · 2014-05-09T22:42:49Z

how long is the separating line? do we care? @jorisvandenbossche

also, are we changing defaults for max_rows / max_columns ?

jreback · 2014-05-10T12:18:05Z

also going to need a new sub-section in v0.14.0 showing pictures like this (FYI use np.arange when creating the frame instead of Nan's)

include the pictures as pngs (so u can get the before after)

bjonen · 2014-05-10T13:30:48Z

Should I place the pngs under pandas/doc/source and add them to the repo?

jreback · 2014-05-10T13:55:21Z

they go in pandas/doc/source/_static, see http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#v0-13-0-january-3-2014 (look at the code in v0.13.0.txt), to see how to display it (and add to the repo)

bjonen · 2014-05-10T15:17:19Z

Ok added a section in the docs:

jreback · 2014-05-10T15:25:16Z

ok, that looks good

@jorisvandenbossche @TomAugspurger @jseabold @cpcloud

?

jreback · 2014-05-10T15:31:10Z

pandas/tests/test_format.py

@@ -1545,7 +1559,7 @@ def test_info_repr(self):
        # Wide
        h, w = max_rows-1, max_cols+1
        df = pandas.DataFrame(dict((k,np.arange(1,1+h)) for k in np.arange(w)))


can you add several tests here which go thru various indexes for index and columns

e.g. run the test somthing like:

for index in [tm.makeStringIndex, tm.makeFloatIndex ......]: for columns in [tm.makeStringIndex, .... ]: df = DataFrame(.....,index=index(h), columns=columns(w))

their are 6 make Index types (String,Period,DateTime,Integer,Float,Unicode)

Going through all indices is quite slow. Takes 6 seconds for just this one test method compared to 3 seconds for the whole file previously.

Is that ok, or should I take out some combinations?

ok, so just do it for the index then (and not the columns too)...

jreback · 2014-05-10T15:31:44Z

@bjonen also, pls squash down to a smaller number of commits

TomAugspurger · 2014-05-10T18:58:57Z

You'll need to push your changes to the docs too.

I'd say change "and a ... signaled the end" to "and an ellipse (...) signaled the end"

jorisvandenbossche · 2014-05-10T22:04:42Z

Maybe some performance tests are needed? I can remeber there were some problems last time this was changed for 0.13?

jreback · 2014-05-10T22:41:36Z

@bjonen see here: https://github.com/pydata/pandas/wiki/Performance-Testing

pls post anything > 1.2

bjonen · 2014-05-10T22:49:00Z

Ok about the performance tests.

What about the other ways to output a df - to_html, to_latex etc.

I'm guessing that at least to_html is important to keep consistent with what to_string is doing. What do you guys think?

jreback · 2014-05-10T23:32:15Z

hmm; normally to_html output the full output because its in a notebook which handles the full output via scrollbars.

I guess down to 3 repr types now:

full (no truncations),
truncate (what this PR is about)
info (summary view)

their are various options to switch from full to info now if the repr is too big, but truncate
is now much more the default

I think we might have to go over all the relevant options and put it out there on the ML
(this issue is already out there: #6547)

@jorisvandenbossche @jseabold ?

bjonen · 2014-05-11T01:59:50Z

I don't use the notebook much so I don't know what the max_rows/cols defaults are, but from the code it seems that truncation also applies to html representations in the notebook.

format.py l. 852

        if len(self.frame) > self.max_rows:
            row = [''] + (['...'] * ncols)
            self.write_tr(row, indent, self.indent_delta, tags=None,
                          nindex_levels=1)

I am mostly done adapting this, except for the part that takes care of hierarchical indices.

jreback · 2014-05-11T11:15:02Z

hmm I think the html repr needs to be consistent (as it's doing the same sort of truncation)

jreback · 2014-05-12T17:19:22Z

@TomAugspurger @jorisvandenbossche does html need to do the same? or is in normally turned off for html? (as in the notebook you have scroll bars for really large displays), or it just displays the info view?

cpcloud · 2014-05-12T17:34:33Z

this is super minor but how crazy would it be to use the ⋮ (u'\u22ee') for the column ellipsis and ⋯ (u'\u22ef') for the row ellipsis? even crazier might be to put ⋱ (u'\u22f1') on the diagonal

jorisvandenbossche · 2014-05-12T19:47:28Z

Yes the html does the same (same rules for truncating or not, it also just follows the max_rows/max_columns options)

bjonen · 2014-05-12T19:49:12Z

@cpcloud I like the idea. I'll add it once everything is working as expected. Getting closer...

TomAugspurger · 2014-05-13T18:55:18Z

Would you be able to get the ellipse to center in the truncated area?

In [12]: df
Out[12]: 
                  0         1       ...         8         9
0         -0.047904  0.145739       ... -0.088891  1.131782
1         -2.326429  0.992864       ...  0.222462 -0.281965
.         .........  ........       ...  ........ .........
8          0.587134 -0.042532       ... -0.629570  0.690751
9         -1.339637 -0.612544       ...  0.500358 -1.195189

[10 rows x 10 columns]

(The formatting Github outputs is pretty close to what I actually see.)
It looks like it does center correctly in your examples from the original post.

jreback · 2014-05-14T20:00:05Z

@bjonen rebase and you are up

jorisvandenbossche · 2014-05-14T20:35:05Z

I can only look at it tomorrow, so if I have comments, I will do it after merge.

bjonen · 2014-05-15T08:46:15Z

I have a problem with the u function (I think it should return u'\xd7', see below). To speed up the merging I took out the row col summary from the tests https://github.com/pydata/pandas/pull/7086/files#diff-d01c1548861395ceef4d69029d266a21R785

import pandas.compat
pandas.compat.u('×')  # the multiplication sign
Out[57]: u'\xc3\x97'
print u'\xc3\x97'
Ã�
print u'\xd7'
×

bjonen · 2014-05-15T09:18:08Z

@jreback There seems to be another problem with py3. If you have an idea for a fix please go ahead. Otherwise I can look into it later tonight.

jreback · 2014-05-15T11:53:51Z

u() doesn't do anything in py2 (it just returns the input)
in py3 it just wraps unicode(..,'unicode_escape') around the arg

it doesn't 'convert' anything, just escapes it

Don't worry about doing special escapes (ala @cpcloud suggestIon), save that for another version.

jorisvandenbossche · 2014-05-15T13:22:01Z

Quickly tested it, and found a strange effect when using pd.options.display.max_rows = 5 (the ... are inserted between index 8 and 9 instead of between 1 and 8). Just in general for the rows it is inserted one row off.

See the notebook: http://nbviewer.ipython.org/github/jorisvandenbossche/scipy_notebooks/blob/master/pandas-pr-7086-repr_truncate_centrally.ipynb

Update: it's only in the notebook, not in the terminal.

jorisvandenbossche · 2014-05-15T14:40:51Z

And something else:

In the notebook always 3 dots (...) are used, while in the terminal not (there it depends on the width of the column?). Would it make sense to have this consistent? (also in the terminal always 3 dots). It was also like this before:

This PR:

In [19]: df = pd.DataFrame(np.arange(250).reshape(50,5))
In [21]: pd.options.display.max_rows = 2
In [22]: pd.options.display.max_columns = 2
In [23]: df
Out[23]:
        0 ...     4
0       0 ...     4
.       . ...     .
49    245 ...   249

[50 rows x 5 columns]

0.13:

In [7]: df
Out[7]:
   0  1
0  0  1 ...
1  5  6 ...
  .. ..

[50 rows x 5 columns]

Or was it for aesthetic reasons that is was chosen to change this? But I personally find the one dot in the first example a bit odd.

Also a Series always uses 3 dots.

bjonen · 2014-05-15T14:42:16Z

@jorisvandenbossche Thanks for the notebook

Here the +1 is the cause (https://github.com/pydata/pandas/pull/7086/files#diff-23878beaf55672cdc92c119f79fe492aR897). For hierarchical indices it's correct.

I'll fix it and push again.

jorisvandenbossche · 2014-05-15T14:45:55Z

@bjonen see also http://nbviewer.ipython.org/github/jorisvandenbossche/scipy_notebooks/blob/master/pandas-pr-7086-repr_truncate_centrally.ipynb#Here-also-something-strange: for another issue

bjonen · 2014-05-15T14:49:22Z

This is on purpose. If you have a single character column, this column will be expanded to three columns when truncating in the command line. A little wasteful, especially when you are dealing with many such columns. Also, I kind of like it because it underlines the line above where the split happens. In the notebook dots take up very little space so I chose the easy way there.

I see the point about the consistency though. I'm happy to change to 3 dots everywhere if you guys prefer it.

jreback · 2014-05-15T14:50:49Z

@bjonen you make a good point, but then again a single '.' might get confusing. I think let's make it 3 everywhere

jorisvandenbossche · 2014-05-15T14:54:05Z

I think in 0.13 it was as follows: always 3, unless it is as narrow column, then only 2 (as there is always place enough for 2 dots). For me that is also good, as I think 2 dots is better than 1 (it gives more the sense of continuation dots)

jorisvandenbossche · 2014-05-15T15:00:44Z

@bjonen What do you mean with "this column will be expanded to three columns"?

In 0.13, the dots don't determine the width of the column. I mean if you have a column of only 1's, the column is 3 chars wide in the terminal display, and this remains three if it is truncated, although the dots are wider (two chars wide, and a column of two chars (eg 10's) would be 4 characters wide in the terminal display)

bjonen · 2014-05-15T17:47:27Z

Ok, so the minimum column width is 3. I thought it was one and then the dots would "expand" the column.

However, looking at your example of 0.13 behavior from above, it seems that the dots were also reduced to two when the column had only one character. Otherwise one can get a full line of dots which looks weird I think.

Also in this PR, the index has a ..., which the previous version didn't.

bjonen · 2014-05-15T20:13:08Z

What do you guys prefer when truncating vertically

Constant 3 dots

In [9]: pd.DataFrame(index=[1]*10,columns=['a','b'],data=[[1]*2]*10)
Out[9]: 
     a   b 
1    1   1 
1    1   1 
... ... ...
1    1   1 
1    1   1 

[10 rows x 2 columns]

3 dots or if the element above is only one character then 2 dots. In this case, leave the html repr at 3 dots or change that too?

In [13]: pd.DataFrame(index=[1]*10,columns=['a','b'],data=[[1]*2]*10)
Out[13]: 
    a   b 
1   1   1 
1   1   1 
.. ... ...
1   1   1 
1   1   1 

[10 rows x 2 columns]

or something entirely different?

jorisvandenbossche · 2014-05-15T20:16:10Z

@bjonen Maybe follow the logic of 0.13: so always 3 dots, only when the column width is 3, then only 2 dots (so you don't get a continuous full line of dots)?

So in your example above this would be 3 times 2 dots

bjonen · 2014-05-16T07:46:41Z

I need to replace join with concat here https://github.com/pydata/pandas/pull/7086/files#diff-23878beaf55672cdc92c119f79fe492aR327 but I cannot do

from pandas.tools.merge import concat

in format.py because merge imports format. Any idea?
This is important since join will fail on columns with the same label.

jreback · 2014-05-16T10:19:30Z

do it just before u need it and it will work
u cannot do circular imports at the module level
but inside a function it is fine

jreback · 2014-05-16T13:51:16Z

pandas/core/format.py

+        self._chk_truncate()
+
+    def _chk_truncate(self):
+        truncate_h = self.max_cols and (len(self.columns) > self.max_cols)


just do a from pandas.tools.merge import concat inside _chk_truncate, it will work (putting at the TOP of the module will not as that is a circular import)

jreback · 2014-05-16T14:31:26Z

@bjonen as soon as you are ready, lmk

or seems ok now (at least for the RC). can you squash?

(can fix the concat/join issue after after as its internal)

bjonen · 2014-05-16T18:52:59Z

@jreback Thanks for the hint with the circular reference - switched to concat. Also updated the docs with the correct pictures and moved to the right section. From my side this is ready.

Anything else?

jreback · 2014-05-16T19:06:30Z

looks good! ping on green

@jorisvandenbossche ?

jreback · 2014-05-16T19:35:00Z

ok...bombs away

Tidy representation when truncating dfs

jreback · 2014-05-16T19:45:20Z

@bjonen thanks...this is really excellent!

pls look at the docs when they are built (travis builds them on master), and if you see anything pls submit a followup PR http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#display-changes

jreback · 2014-05-16T19:58:44Z

ok...docs are updated.....

@TomAugspurger @jorisvandenbossche @cpcloud @bjonen read over whatsnew (esp display) section and lmk if anything unclear!

cpcloud · 2014-05-16T20:04:53Z

@jreback looks good!

@bjonen nice work, liking the new repr!

jreback added Enhancement labels May 9, 2014

jreback modified the milestones: 0.14.1, 0.14.0 May 9, 2014

jreback reviewed May 9, 2014
View reviewed changes

jreback reviewed May 10, 2014
View reviewed changes

bjonen mentioned this pull request May 12, 2014

Truncated view for MultiIndex pd.Series broken #7101

Closed

jreback reviewed May 16, 2014
View reviewed changes

ENH: Centrally truncate DataFrames

d917e07

jreback added a commit that referenced this pull request May 16, 2014

Merge pull request #7086 from bjonen/adj_trunc

48729e2

Tidy representation when truncating dfs

jreback merged commit 48729e2 into pandas-dev:master May 16, 2014

bjonen mentioned this pull request May 18, 2014

ENH: Change truncation ellispsis to unicode character #7167

Closed

Tidy representation when truncating dfs #7086

Tidy representation when truncating dfs #7086

Conversation

bjonen commented May 9, 2014

jreback commented May 9, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented May 9, 2014

jreback commented May 10, 2014

bjonen commented May 10, 2014

jreback commented May 10, 2014

bjonen commented May 10, 2014

jreback commented May 10, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented May 10, 2014

TomAugspurger commented May 10, 2014

jorisvandenbossche commented May 10, 2014

jreback commented May 10, 2014

bjonen commented May 10, 2014

jreback commented May 10, 2014

bjonen commented May 11, 2014

jreback commented May 11, 2014

jreback commented May 12, 2014

cpcloud commented May 12, 2014

jorisvandenbossche commented May 12, 2014

bjonen commented May 12, 2014

TomAugspurger commented May 13, 2014

jreback commented May 14, 2014

jorisvandenbossche commented May 14, 2014

bjonen commented May 15, 2014

bjonen commented May 15, 2014

jreback commented May 15, 2014

jorisvandenbossche commented May 15, 2014

jorisvandenbossche commented May 15, 2014

bjonen commented May 15, 2014

jorisvandenbossche commented May 15, 2014

bjonen commented May 15, 2014

jreback commented May 15, 2014

jorisvandenbossche commented May 15, 2014

jorisvandenbossche commented May 15, 2014

bjonen commented May 15, 2014

bjonen commented May 15, 2014

jorisvandenbossche commented May 15, 2014

bjonen commented May 16, 2014

jreback commented May 16, 2014

Choose a reason for hiding this comment

jreback commented May 16, 2014

bjonen commented May 16, 2014

jreback commented May 16, 2014

jreback commented May 16, 2014

jreback commented May 16, 2014

jreback commented May 16, 2014

cpcloud commented May 16, 2014