Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tidy representation when truncating dfs #7086

Merged
merged 1 commit into from
May 16, 2014
Merged

Conversation

bjonen
Copy link
Contributor

@bjonen bjonen commented May 9, 2014

This PR closes #5603

Dataframes are now truncated centrally (similar to pd.Series).

Before:
trunc_before
After:
trunc_after

Ipython notebook:

Simple Index:
notebook_trunc

MultiLevel Index:
notebook_multi

@jreback
Copy link
Contributor

jreback commented May 9, 2014

can you put up a before/after in the top of the PR?

@jreback jreback modified the milestones: 0.14.1, 0.14.0 May 9, 2014
cols_to_show = self.columns
if max_cols > 1:
col_num = (max_cols // 2)
frame = frame.iloc[:,:col_num].join(frame.iloc[:,-col_num:])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, you are joining these here? seems that you should do this all at once with the ... column, e.g.

concat([left,str_sep_columns,right],axis=1) (for truncate_h)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could do something like this, but this gives me a circular reference because concat requires an NDFrame which imports the format module. Any idea how to get around this?

    col_num = (max_cols // 2)
    left = frame.iloc[:,:col_num]
    right = frame.iloc[:,-col_num:]
    sep_col = DataFrame(data=['...'] * len(frame),index=frame.index,columns=['...'])
    frame = concat((left,sep_col,right), axis=1)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like this:

  col_num = (max_cols // 2)
  left = frame.iloc[:,:col_num]
  right = frame.iloc[:,-col_num:]
  sep_col = DataFrame(data=['...'] * (len(left)+len(right),columns=['...'])
  frame = concat((left,sep_col,right), axis=1, ignore_index=True)
  frame.index = Index(left.index.to_list() + ['...'] + right.index.to_list())

the index is going to change anyhow because you are introducing a string column (so might as well just ignore it and then add it after)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But as far as I can see, I cannot import DataFrame because frame.py imports format.py.

@jreback
Copy link
Contributor

jreback commented May 9, 2014

how long is the separating line? do we care? @jorisvandenbossche

also, are we changing defaults for max_rows / max_columns ?

@jreback
Copy link
Contributor

jreback commented May 10, 2014

also going to need a new sub-section in v0.14.0 showing pictures like this (FYI use np.arange when creating the frame instead of Nan's)

include the pictures as pngs (so u can get the before after)

@bjonen
Copy link
Contributor Author

bjonen commented May 10, 2014

Should I place the pngs under pandas/doc/source and add them to the repo?

@jreback
Copy link
Contributor

jreback commented May 10, 2014

they go in pandas/doc/source/_static, see http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#v0-13-0-january-3-2014 (look at the code in v0.13.0.txt), to see how to display it (and add to the repo)

@bjonen
Copy link
Contributor Author

bjonen commented May 10, 2014

Ok added a section in the docs:

screenshot from 2014-05-10 17 15 52

@jreback
Copy link
Contributor

jreback commented May 10, 2014

ok, that looks good

@jorisvandenbossche @TomAugspurger @jseabold @cpcloud

?

@@ -1545,7 +1559,7 @@ def test_info_repr(self):
# Wide
h, w = max_rows-1, max_cols+1
df = pandas.DataFrame(dict((k,np.arange(1,1+h)) for k in np.arange(w)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add several tests here which go thru various indexes for index and columns

e.g. run the test somthing like:

for index in [tm.makeStringIndex, tm.makeFloatIndex ......]:
    for columns in [tm.makeStringIndex, .... ]:
           df = DataFrame(.....,index=index(h), columns=columns(w))

their are 6 make Index types (String,Period,DateTime,Integer,Float,Unicode)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going through all indices is quite slow. Takes 6 seconds for just this one test method compared to 3 seconds for the whole file previously.

Is that ok, or should I take out some combinations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so just do it for the index then (and not the columns too)...

@jreback
Copy link
Contributor

jreback commented May 10, 2014

@bjonen also, pls squash down to a smaller number of commits

@TomAugspurger
Copy link
Contributor

You'll need to push your changes to the docs too.

I'd say change "and a ... signaled the end" to "and an ellipse (...) signaled the end"

@jorisvandenbossche
Copy link
Member

Maybe some performance tests are needed? I can remeber there were some problems last time this was changed for 0.13?

@jreback
Copy link
Contributor

jreback commented May 10, 2014

@bjonen see here: https://github.com/pydata/pandas/wiki/Performance-Testing

pls post anything > 1.2

@bjonen
Copy link
Contributor Author

bjonen commented May 10, 2014

Ok about the performance tests.

What about the other ways to output a df - to_html, to_latex etc.

I'm guessing that at least to_html is important to keep consistent with what to_string is doing. What do you guys think?

@jreback
Copy link
Contributor

jreback commented May 10, 2014

hmm; normally to_html output the full output because its in a notebook which handles the full output via scrollbars.

I guess down to 3 repr types now:

  • full (no truncations),
  • truncate (what this PR is about)
  • info (summary view)

their are various options to switch from full to info now if the repr is too big, but truncate
is now much more the default

I think we might have to go over all the relevant options and put it out there on the ML
(this issue is already out there: #6547)

@jorisvandenbossche @jseabold ?

@bjonen
Copy link
Contributor Author

bjonen commented May 11, 2014

I don't use the notebook much so I don't know what the max_rows/cols defaults are, but from the code it seems that truncation also applies to html representations in the notebook.

format.py l. 852

        if len(self.frame) > self.max_rows:
            row = [''] + (['...'] * ncols)
            self.write_tr(row, indent, self.indent_delta, tags=None,
                          nindex_levels=1)

I am mostly done adapting this, except for the part that takes care of hierarchical indices.

@jreback
Copy link
Contributor

jreback commented May 11, 2014

hmm I think the html repr needs to be consistent (as it's doing the same sort of truncation)

@jreback
Copy link
Contributor

jreback commented May 12, 2014

@TomAugspurger @jorisvandenbossche does html need to do the same? or is in normally turned off for html? (as in the notebook you have scroll bars for really large displays), or it just displays the info view?

@cpcloud
Copy link
Member

cpcloud commented May 12, 2014

this is super minor but how crazy would it be to use the (u'\u22ee') for the column ellipsis and (u'\u22ef') for the row ellipsis? even crazier might be to put (u'\u22f1') on the diagonal

@jorisvandenbossche
Copy link
Member

Yes the html does the same (same rules for truncating or not, it also just follows the max_rows/max_columns options)

@bjonen
Copy link
Contributor Author

bjonen commented May 12, 2014

@cpcloud I like the idea. I'll add it once everything is working as expected. Getting closer...

@TomAugspurger
Copy link
Contributor

Would you be able to get the ellipse to center in the truncated area?

In [12]: df
Out[12]: 
                  0         1       ...         8         9
0         -0.047904  0.145739       ... -0.088891  1.131782
1         -2.326429  0.992864       ...  0.222462 -0.281965
.         .........  ........       ...  ........ .........
8          0.587134 -0.042532       ... -0.629570  0.690751
9         -1.339637 -0.612544       ...  0.500358 -1.195189

[10 rows x 10 columns]

(The formatting Github outputs is pretty close to what I actually see.)
It looks like it does center correctly in your examples from the original post.

@jreback
Copy link
Contributor

jreback commented May 14, 2014

@bjonen rebase and you are up

@jorisvandenbossche
Copy link
Member

I can only look at it tomorrow, so if I have comments, I will do it after merge.

@bjonen
Copy link
Contributor Author

bjonen commented May 15, 2014

I have a problem with the u function (I think it should return u'\xd7', see below). To speed up the merging I took out the row col summary from the tests https://github.com/pydata/pandas/pull/7086/files#diff-d01c1548861395ceef4d69029d266a21R785

import pandas.compat
pandas.compat.u('×')  # the multiplication sign
Out[57]: u'\xc3\x97'
print u'\xc3\x97'
Ãprint u'\xd7'
×

@bjonen
Copy link
Contributor Author

bjonen commented May 15, 2014

@jreback There seems to be another problem with py3. If you have an idea for a fix please go ahead. Otherwise I can look into it later tonight.

@jreback
Copy link
Contributor

jreback commented May 15, 2014

u() doesn't do anything in py2 (it just returns the input)
in py3 it just wraps unicode(..,'unicode_escape') around the arg

it doesn't 'convert' anything, just escapes it

Don't worry about doing special escapes (ala @cpcloud suggestIon), save that for another version.

@jorisvandenbossche
Copy link
Member

Quickly tested it, and found a strange effect when using pd.options.display.max_rows = 5 (the ... are inserted between index 8 and 9 instead of between 1 and 8). Just in general for the rows it is inserted one row off.

See the notebook: http://nbviewer.ipython.org/github/jorisvandenbossche/scipy_notebooks/blob/master/pandas-pr-7086-repr_truncate_centrally.ipynb

Update: it's only in the notebook, not in the terminal.

@jorisvandenbossche
Copy link
Member

And something else:

  • In the notebook always 3 dots (...) are used, while in the terminal not (there it depends on the width of the column?). Would it make sense to have this consistent? (also in the terminal always 3 dots). It was also like this before:

This PR:

In [19]: df = pd.DataFrame(np.arange(250).reshape(50,5))
In [21]: pd.options.display.max_rows = 2
In [22]: pd.options.display.max_columns = 2
In [23]: df
Out[23]:
        0 ...     4
0       0 ...     4
.       . ...     .
49    245 ...   249

[50 rows x 5 columns]

0.13:

In [7]: df
Out[7]:
   0  1
0  0  1 ...
1  5  6 ...
  .. ..

[50 rows x 5 columns]

Or was it for aesthetic reasons that is was chosen to change this? But I personally find the one dot in the first example a bit odd.

Also a Series always uses 3 dots.

@bjonen
Copy link
Contributor Author

bjonen commented May 15, 2014

@jorisvandenbossche Thanks for the notebook

Here the +1 is the cause (https://github.com/pydata/pandas/pull/7086/files#diff-23878beaf55672cdc92c119f79fe492aR897). For hierarchical indices it's correct.

I'll fix it and push again.

@bjonen
Copy link
Contributor Author

bjonen commented May 15, 2014

This is on purpose. If you have a single character column, this column will be expanded to three columns when truncating in the command line. A little wasteful, especially when you are dealing with many such columns. Also, I kind of like it because it underlines the line above where the split happens. In the notebook dots take up very little space so I chose the easy way there.

I see the point about the consistency though. I'm happy to change to 3 dots everywhere if you guys prefer it.

@jreback
Copy link
Contributor

jreback commented May 15, 2014

@bjonen you make a good point, but then again a single '.' might get confusing. I think let's make it 3 everywhere

@jorisvandenbossche
Copy link
Member

I think in 0.13 it was as follows: always 3, unless it is as narrow column, then only 2 (as there is always place enough for 2 dots). For me that is also good, as I think 2 dots is better than 1 (it gives more the sense of continuation dots)

@jorisvandenbossche
Copy link
Member

@bjonen What do you mean with "this column will be expanded to three columns"?

In 0.13, the dots don't determine the width of the column. I mean if you have a column of only 1's, the column is 3 chars wide in the terminal display, and this remains three if it is truncated, although the dots are wider (two chars wide, and a column of two chars (eg 10's) would be 4 characters wide in the terminal display)

@bjonen
Copy link
Contributor Author

bjonen commented May 15, 2014

Ok, so the minimum column width is 3. I thought it was one and then the dots would "expand" the column.

However, looking at your example of 0.13 behavior from above, it seems that the dots were also reduced to two when the column had only one character. Otherwise one can get a full line of dots which looks weird I think.

Also in this PR, the index has a ..., which the previous version didn't.

@bjonen
Copy link
Contributor Author

bjonen commented May 15, 2014

What do you guys prefer when truncating vertically

  1. Constant 3 dots
In [9]: pd.DataFrame(index=[1]*10,columns=['a','b'],data=[[1]*2]*10)
Out[9]: 
     a   b 
1    1   1 
1    1   1 
... ... ...
1    1   1 
1    1   1 

[10 rows x 2 columns]
  1. 3 dots or if the element above is only one character then 2 dots. In this case, leave the html repr at 3 dots or change that too?
In [13]: pd.DataFrame(index=[1]*10,columns=['a','b'],data=[[1]*2]*10)
Out[13]: 
    a   b 
1   1   1 
1   1   1 
.. ... ...
1   1   1 
1   1   1 

[10 rows x 2 columns]

or something entirely different?

@jorisvandenbossche
Copy link
Member

@bjonen Maybe follow the logic of 0.13: so always 3 dots, only when the column width is 3, then only 2 dots (so you don't get a continuous full line of dots)?

So in your example above this would be 3 times 2 dots

@bjonen
Copy link
Contributor Author

bjonen commented May 16, 2014

I need to replace join with concat here https://github.com/pydata/pandas/pull/7086/files#diff-23878beaf55672cdc92c119f79fe492aR327 but I cannot do

from pandas.tools.merge import concat

in format.py because merge imports format. Any idea?
This is important since join will fail on columns with the same label.

@jreback
Copy link
Contributor

jreback commented May 16, 2014

do it just before u need it and it will work
u cannot do circular imports at the module level
but inside a function it is fine

self._chk_truncate()

def _chk_truncate(self):
truncate_h = self.max_cols and (len(self.columns) > self.max_cols)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just do a from pandas.tools.merge import concat inside _chk_truncate, it will work (putting at the TOP of the module will not as that is a circular import)

@jreback
Copy link
Contributor

jreback commented May 16, 2014

@bjonen as soon as you are ready, lmk

or seems ok now (at least for the RC). can you squash?

(can fix the concat/join issue after after as its internal)

@bjonen
Copy link
Contributor Author

bjonen commented May 16, 2014

@jreback Thanks for the hint with the circular reference - switched to concat. Also updated the docs with the correct pictures and moved to the right section. From my side this is ready.

Anything else?

@jreback
Copy link
Contributor

jreback commented May 16, 2014

looks good! ping on green

@jorisvandenbossche ?

@jreback
Copy link
Contributor

jreback commented May 16, 2014

ok...bombs away

jreback added a commit that referenced this pull request May 16, 2014
Tidy representation when truncating dfs
@jreback jreback merged commit 48729e2 into pandas-dev:master May 16, 2014
@jreback
Copy link
Contributor

jreback commented May 16, 2014

@bjonen thanks...this is really excellent!

pls look at the docs when they are built (travis builds them on master), and if you see anything pls submit a followup PR http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#display-changes

@jreback
Copy link
Contributor

jreback commented May 16, 2014

ok...docs are updated.....

@TomAugspurger @jorisvandenbossche @cpcloud @bjonen read over whatsnew (esp display) section and lmk if anything unclear!

@cpcloud
Copy link
Member

cpcloud commented May 16, 2014

@jreback looks good!

@bjonen nice work, liking the new repr!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Truncate DataFrames centrally, rather than at one end
5 participants