-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tidy representation when truncating dfs #7086
Conversation
can you put up a before/after in the top of the PR? |
cols_to_show = self.columns | ||
if max_cols > 1: | ||
col_num = (max_cols // 2) | ||
frame = frame.iloc[:,:col_num].join(frame.iloc[:,-col_num:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, you are joining these here? seems that you should do this all at once with the ...
column, e.g.
concat([left,str_sep_columns,right],axis=1)
(for truncate_h)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could do something like this, but this gives me a circular reference because concat requires an NDFrame which imports the format module. Any idea how to get around this?
col_num = (max_cols // 2)
left = frame.iloc[:,:col_num]
right = frame.iloc[:,-col_num:]
sep_col = DataFrame(data=['...'] * len(frame),index=frame.index,columns=['...'])
frame = concat((left,sep_col,right), axis=1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like this:
col_num = (max_cols // 2)
left = frame.iloc[:,:col_num]
right = frame.iloc[:,-col_num:]
sep_col = DataFrame(data=['...'] * (len(left)+len(right),columns=['...'])
frame = concat((left,sep_col,right), axis=1, ignore_index=True)
frame.index = Index(left.index.to_list() + ['...'] + right.index.to_list())
the index is going to change anyhow because you are introducing a string column (so might as well just ignore it and then add it after)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But as far as I can see, I cannot import DataFrame because frame.py imports format.py.
how long is the separating line? do we care? @jorisvandenbossche also, are we changing defaults for |
also going to need a new sub-section in v0.14.0 showing pictures like this (FYI use np.arange when creating the frame instead of Nan's) include the pictures as pngs (so u can get the before after) |
Should I place the pngs under pandas/doc/source and add them to the repo? |
they go in |
ok, that looks good @jorisvandenbossche @TomAugspurger @jseabold @cpcloud ? |
@@ -1545,7 +1559,7 @@ def test_info_repr(self): | |||
# Wide | |||
h, w = max_rows-1, max_cols+1 | |||
df = pandas.DataFrame(dict((k,np.arange(1,1+h)) for k in np.arange(w))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add several tests here which go thru various indexes for index and columns
e.g. run the test somthing like:
for index in [tm.makeStringIndex, tm.makeFloatIndex ......]:
for columns in [tm.makeStringIndex, .... ]:
df = DataFrame(.....,index=index(h), columns=columns(w))
their are 6 make
Index types (String,Period,DateTime,Integer,Float,Unicode)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going through all indices is quite slow. Takes 6 seconds for just this one test method compared to 3 seconds for the whole file previously.
Is that ok, or should I take out some combinations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so just do it for the index then (and not the columns too)...
@bjonen also, pls squash down to a smaller number of commits |
You'll need to push your changes to the docs too. I'd say change "and a ... signaled the end" to "and an ellipse (...) signaled the end" |
Maybe some performance tests are needed? I can remeber there were some problems last time this was changed for 0.13? |
@bjonen see here: https://github.com/pydata/pandas/wiki/Performance-Testing pls post anything > 1.2 |
Ok about the performance tests. What about the other ways to output a df - to_html, to_latex etc. I'm guessing that at least to_html is important to keep consistent with what to_string is doing. What do you guys think? |
hmm; normally to_html output the full output because its in a notebook which handles the full output via scrollbars. I guess down to 3 repr types now:
their are various options to switch from full to info now if the repr is too big, but truncate I think we might have to go over all the relevant options and put it out there on the ML |
I don't use the notebook much so I don't know what the max_rows/cols defaults are, but from the code it seems that truncation also applies to html representations in the notebook. format.py l. 852 if len(self.frame) > self.max_rows:
row = [''] + (['...'] * ncols)
self.write_tr(row, indent, self.indent_delta, tags=None,
nindex_levels=1) I am mostly done adapting this, except for the part that takes care of hierarchical indices. |
hmm I think the html repr needs to be consistent (as it's doing the same sort of truncation) |
@TomAugspurger @jorisvandenbossche does html need to do the same? or is in normally turned off for html? (as in the notebook you have scroll bars for really large displays), or it just displays the info view? |
this is super minor but how crazy would it be to use the |
Yes the html does the same (same rules for truncating or not, it also just follows the max_rows/max_columns options) |
@cpcloud I like the idea. I'll add it once everything is working as expected. Getting closer... |
Would you be able to get the ellipse to center in the truncated area? In [12]: df
Out[12]:
0 1 ... 8 9
0 -0.047904 0.145739 ... -0.088891 1.131782
1 -2.326429 0.992864 ... 0.222462 -0.281965
. ......... ........ ... ........ .........
8 0.587134 -0.042532 ... -0.629570 0.690751
9 -1.339637 -0.612544 ... 0.500358 -1.195189
[10 rows x 10 columns] (The formatting Github outputs is pretty close to what I actually see.) |
@bjonen rebase and you are up |
I can only look at it tomorrow, so if I have comments, I will do it after merge. |
I have a problem with the u function (I think it should return u'\xd7', see below). To speed up the merging I took out the row col summary from the tests https://github.com/pydata/pandas/pull/7086/files#diff-d01c1548861395ceef4d69029d266a21R785 import pandas.compat
pandas.compat.u('×') # the multiplication sign
Out[57]: u'\xc3\x97'
print u'\xc3\x97'
�
print u'\xd7'
× |
@jreback There seems to be another problem with py3. If you have an idea for a fix please go ahead. Otherwise I can look into it later tonight. |
it doesn't 'convert' anything, just escapes it Don't worry about doing special escapes (ala @cpcloud suggestIon), save that for another version. |
Quickly tested it, and found a strange effect when using See the notebook: http://nbviewer.ipython.org/github/jorisvandenbossche/scipy_notebooks/blob/master/pandas-pr-7086-repr_truncate_centrally.ipynb Update: it's only in the notebook, not in the terminal. |
And something else:
This PR:
0.13:
Or was it for aesthetic reasons that is was chosen to change this? But I personally find the one dot in the first example a bit odd. Also a Series always uses 3 dots. |
@jorisvandenbossche Thanks for the notebook Here the +1 is the cause (https://github.com/pydata/pandas/pull/7086/files#diff-23878beaf55672cdc92c119f79fe492aR897). For hierarchical indices it's correct. I'll fix it and push again. |
This is on purpose. If you have a single character column, this column will be expanded to three columns when truncating in the command line. A little wasteful, especially when you are dealing with many such columns. Also, I kind of like it because it underlines the line above where the split happens. In the notebook dots take up very little space so I chose the easy way there. I see the point about the consistency though. I'm happy to change to 3 dots everywhere if you guys prefer it. |
@bjonen you make a good point, but then again a single '.' might get confusing. I think let's make it 3 everywhere |
I think in 0.13 it was as follows: always 3, unless it is as narrow column, then only 2 (as there is always place enough for 2 dots). For me that is also good, as I think 2 dots is better than 1 (it gives more the sense of continuation dots) |
@bjonen What do you mean with "this column will be expanded to three columns"? In 0.13, the dots don't determine the width of the column. I mean if you have a column of only 1's, the column is 3 chars wide in the terminal display, and this remains three if it is truncated, although the dots are wider (two chars wide, and a column of two chars (eg 10's) would be 4 characters wide in the terminal display) |
Ok, so the minimum column width is 3. I thought it was one and then the dots would "expand" the column. However, looking at your example of 0.13 behavior from above, it seems that the dots were also reduced to two when the column had only one character. Otherwise one can get a full line of dots which looks weird I think. Also in this PR, the index has a ..., which the previous version didn't. |
What do you guys prefer when truncating vertically
In [9]: pd.DataFrame(index=[1]*10,columns=['a','b'],data=[[1]*2]*10)
Out[9]:
a b
1 1 1
1 1 1
... ... ...
1 1 1
1 1 1
[10 rows x 2 columns]
In [13]: pd.DataFrame(index=[1]*10,columns=['a','b'],data=[[1]*2]*10)
Out[13]:
a b
1 1 1
1 1 1
.. ... ...
1 1 1
1 1 1
[10 rows x 2 columns] or something entirely different? |
@bjonen Maybe follow the logic of 0.13: so always 3 dots, only when the column width is 3, then only 2 dots (so you don't get a continuous full line of dots)? So in your example above this would be 3 times 2 dots |
I need to replace
in |
do it just before u need it and it will work |
self._chk_truncate() | ||
|
||
def _chk_truncate(self): | ||
truncate_h = self.max_cols and (len(self.columns) > self.max_cols) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just do a from pandas.tools.merge import concat
inside _chk_truncate
, it will work (putting at the TOP of the module will not as that is a circular import)
@bjonen as soon as you are ready, lmk or seems ok now (at least for the RC). can you squash? (can fix the concat/join issue after after as its internal) |
@jreback Thanks for the hint with the circular reference - switched to Anything else? |
looks good! ping on green |
ok...bombs away |
Tidy representation when truncating dfs
@bjonen thanks...this is really excellent! pls look at the docs when they are built (travis builds them on master), and if you see anything pls submit a followup PR http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#display-changes |
ok...docs are updated..... @TomAugspurger @jorisvandenbossche @cpcloud @bjonen read over whatsnew (esp display) section and lmk if anything unclear! |
This PR closes #5603
Dataframes are now truncated centrally (similar to pd.Series).
Before:
After:
Ipython notebook:
Simple Index:
MultiLevel Index: