Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix df repr troubles #3395

Closed
wants to merge 1 commit into from
Closed

Conversation

lodagro
Copy link
Contributor

@lodagro lodagro commented Apr 18, 2013

My apologies for the mess i created on the dataframe repr. Not only did i get the non-interactive behavior wrong, i also introduced a performance issue in the interactive mode.

This PR should make my wrong, right again. In non-interactive mode i re-enabled concise formats and made sure that auto-terminal-size detection is not used (little test added), also performance issue reported in #3337 and #3373 is fixed (added a benchmark for it).

@ghost
Copy link

ghost commented Apr 19, 2013

@jreback , @hayd ,@lodagro. Quick quiz.

open up qtconsole / ipnb. how many rows and cols fit on your default display?

for width:

print "1"*100

for height:

range(60)

Does one of those scroll off the screen for you? a sample size of 4 ought to be
statistically significant.

what about height=100, does that fit on your screen?

@jreback
Copy link
Contributor

jreback commented Apr 19, 2013

monitor is 24", range(75) fills height, "1"*300 fills width (courier fixed font), this is an xterm

@ghost
Copy link

ghost commented Apr 19, 2013

24"? nice.

but on that size screen you usually have 2 windows paneled, right? so say half that
width, which means width 100 is fine, and height 100 is too much.

@jreback
Copy link
Contributor

jreback commented Apr 19, 2013

actually don't use 2 side by side,I overlap say about 3/4 width, but your assumption is prob about right

@ghost
Copy link

ghost commented Apr 19, 2013

I usually work in something like w=120,h=70. so the current (and new defaults)
are wrong for me too.

@jreback
Copy link
Contributor

jreback commented Apr 19, 2013

isn't this the get_terminal_size feature? (or its not 'accurate'/reliable the issue)?

@ghost
Copy link

ghost commented Apr 19, 2013

Exactly, we use hardwired values for qtconsole (and soon ipnb) because get_terminal_size()
fails for them , and I think 0.10.1 and updated values in this PR both get it wrong.

@cpcloud
Copy link
Member

cpcloud commented Apr 19, 2013

This issue lies with finding the length of the largest row repr which seems difficult to do without introducing a performance hit.

@ghost
Copy link

ghost commented Apr 19, 2013

the great majority of cases where the frame is really wide and this becomes slow,
is due to many columns, not a lot of data in a small number of them. max_cols
catches that, and this PR now checks that first. so, reduced surface area at least.

@lodagro
Copy link
Contributor Author

lodagro commented Apr 19, 2013

I am typically running ipython in a full screen terminal: width x heigth = 260 x 65, 20`` monitor 1600x1200.
And the frames i use often are more than 200 display columns wide. Defaults are/were not ok for me either. I use the auto detect horizontally, which was the default a long time ago, not any more now.

Using default options, get_terminal_size() is never used. Only when requested by user, doing so is only interesting in a real terminal as indicated in the method docstring and display option descriptions. This is what i do.
Hard wired values (out of user control) are only used in non-interactive mode and if user enabled auto-detection. For this case auto-detect values are overruled with hard-coded values (as requested in #1610). In all other scenarios user can set/control dimensions.

Current repr performance is in line/better compared to v0.10.1 (there are only two benchmarks but they cover a frame that is too wide and one too high (a new benchmark i added)).

So what do you think would make more sense to have as defaults for display.width, heigth, max_rows, max_columns? Auto detect as default can not be done, since this only works in a real terminal, other than than i don`t care what defaults are.

@ghost
Copy link

ghost commented Apr 19, 2013

W=100, H=60 seems like it would would work beter for everyone so far ,where
the current default don't work for everyone so far.

aside, I'm pretty sure this PR rebreaks #2275.
if the user sets height and width to auto-detect in qtconsole, you get back the terminal
height and width in which qtconsole was launched.

@lodagro
Copy link
Contributor Author

lodagro commented Apr 19, 2013

If no other proposals i will update defaults to suggestion of @y-p.

#2275 should still be ok, fix to that PR was to disable html repr in qtconsole, this is still the case, however i think this is no longer needed. I will test it later today.

@cpcloud
Copy link
Member

cpcloud commented Apr 19, 2013

@y-p not sure I understand. If you try making a frame with many rows you'll see that it's slow to repr because to_string is called on the frame and then that is split by newlines and then the max of the length of each resulting list is computed. This is horribly slow for a frame with, say, 1e6 rows. Proof:
slow-repr

@ghost
Copy link

ghost commented Apr 19, 2013

Are you using master or this PR branch (yet to be merged?)

In [1]: df =DataFrame(randn(10000,1).T)

In [2]: df.shape
Out[2]: (1, 10000)

In [3]: timeit repr(df)
1000 loops, best of 3: 929 us per loop

In [4]: timeit repr(df.T)
1000 loops, best of 3: 1.05 ms per loop

In [5]: 

@ghost
Copy link

ghost commented Apr 19, 2013

If len(self.index) > reasonable_threshold:
    fail_fast
If len(self.columns) > reasonable_threshold:
    fail_fast
else:
   slow_op

Covers most real cases I should think, but I guess there will be
pathological cases, when cells contains much data. I seem to recall
... here it is:

In [6]: pd.describe_option('max_colwidth')
display.max_colwidth: [default: 50] [currently: 50]
: int
        The maximum width in characters of a column in the repr of
        a pandas data structure. When the column overflows, a "..."
        placeholder is embedded in the output.

Would be glad to learn of a corner case we've missed, if you find one.

@jreback
Copy link
Contributor

jreback commented Apr 19, 2013

if you hit the slow_op part you coudl do a sampling of the data (say 10%) random to see if they are too wide/long (so you can fail_fast)?

@lodagro
Copy link
Contributor Author

lodagro commented Apr 19, 2013

Two screenshots with the qtconsole. For both i launched the qtconsole in a tiny terminal and both behave as expected, fact that qtconsole is launched in tiny terminal has no impact.

Current state of the code.

Imgur

Adapt code to re-enabled the html repr.

Imgur

@ghost
Copy link

ghost commented Apr 20, 2013

That's because of the default values for width/height. set them to 0 and try again.
for the same reason, scripts with wide output now get their output wrapped using expand_repr,
which is a change from 0.10.1, if anyone relied on that behaviour.
And so in a way that's a regression of #1610 .

Note: The expand_repr output in script mode is actually very nice, just concerned
about yet another change in established behavior.

@ghost
Copy link

ghost commented Apr 22, 2013

I'll be merging this today with some fixes.

@ghost
Copy link

ghost commented Apr 22, 2013

  • fixed a Misleading __str__/__repr__ for DatatimeIndex objects with less than 3 items #1611 related regression, the (new) test was passing by accident as
    the df dimensions were too small to trigger the fail with the defaults for height/width.
  • width/height auto-detect is now activated by None rather then 0.
    (made it easier to handle the corner cases, and cleaner I think)
  • re-enabled HTML for qtconsole (I completely agree with @lodagro,
    this should be user controlled)
  • adjusted defaults of height,width to 80,80. Ipython's notebook defaults to an unusually
    large font, so the 80 (rather then 100 as I suggested) in the PR works better.
  • ipnb and ipq are now treated equally. this ia a long overdue oversight.
    The hardcoded default values for dimensions now reuse the defaults
    of disp.height/width.
    ip qtconsole was special cased by mistake, when the original Issue was addressed.
  • optimized repr_fits_horizontal, beyond @lodagro's fix, the width
    check now only generates the repr for the slice that's about to be printed.

Beyond the test suite, I tested in terminal ipython, qtconsole,ipnb and script
form with various options, cleared up whatever I found.

I'm uneasy about so much churn and subtlety going on at the RC phase,
would have preferred to have it all take place in 0.12. But enough is enough.

b9fa04a

@ghost ghost closed this Apr 22, 2013
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants