HTML (and text) reprs for large dataframes. #5550

takluyver · 2013-11-19T19:49:46Z

As discussed in #4886, the HTML representation of DataFrames currently starts off as a table, but switches to the condensed info view if the table exceeds a certain size (by default, more than 60 rows or 20 columns). I've seen this confusing users, who think that they suddenly have a completely different kind of object, and don't understand why.

With these changes, the HTML repr always displays the table, but truncates it when it exceeds a certain size. It reuses the same options, display.max_rows and display.max_columns.

Before:

After:

ghost · 2013-11-20T21:08:12Z

a couple of issues: edge formatting on wide tables, and the empty dataframe
repr looks off.

Also, the PR affects only the html repr so (QT)console users will see different behavior then
ipnb users.

takluyver · 2013-11-20T22:27:01Z

I've fixed the wide display issue - it wasn't truncating the extra row added for the row index names.

The empty dataframe repr is the same as what's displayed in my system installation of pandas - I agree that it's a bit odd, but I think that's a separate issue.

I agree, the behaviour of plain text reprs should be similar. Do you want me to tackle that in this PR, or separately?

jreback · 2013-11-20T22:38:45Z

this is related to #1889 as well

ghost · 2013-11-20T22:40:28Z

That'd be great, I think it fits here ok.

Not sure about putting this in 0.13, I've had bad luck with last minute changes to
display code. @jreback?

jreback · 2013-11-20T23:08:13Z

I think this is fine with a couple of minor issues:

need a mention in v0.13.0 and main docs
about this default and how to use .info() to get existing summary view

the hard codes for max rows / columns should come purely from the options and not hard code the functions
can u change that

jorisvandenbossche · 2013-11-21T08:06:48Z

I also mentioned it in the issue, but what do you think of showing first 30 ... last 30 rows/cols instead of first 60 ...? As is done for the Series html repr.
(It's also what is proposed in #1889)

ghost · 2013-11-21T09:11:13Z

Makes sense to me, but can be done in a subsequent PR if needs be.
ipnb is solidifying it's browser<->kernel message passing, may soon be
time to revisit the grid view from (with paging) #2974, this time with
built-in functionality rather then an in-process web server (like Exhibitionist does).

takluyver · 2013-11-21T22:07:50Z

I have:

Made the plain text reprs match the HTML reprs (truncating with ... beyond max_rows/max_columns). To get the info view, you need to call the info() method.
Removed the defaults for max_rows/max_cols from the methods to which they are passed - so calling to_string() or to_html() will by default show the entire DataFrame untruncated. Only the reprs automatically truncate.
Removed the max_info_rows option - this was only used when displaying the info view for a repr.
Documented the changes.

ghost · 2013-11-21T22:43:05Z

Better deprecate max_info_rows rather then remove it, we may wish to move it's
enforcement over to info(). There are examples of doing that in config_init.py. Actually,
just generally deprecate rather then remove to reduce friction.
you removed some of the ugliest code I've ever written - good omen.
can you have a look at Console-width detection should be interactive sessions only #1610 and
see if that raises any issues with the changes? (regressions)

takluyver · 2013-11-21T23:45:53Z

max_info_rows is back, deprecated.

I've had a brief look at #1610 - I don't think this should cause a regression, because format.get_console_size() checks whether it's in an interactive session.

ghost · 2013-11-22T16:46:17Z

I was very wrong with my initial objections, this is just great.

It's impossible to set default values for for max_rows and max_columns
that make things look good on both ipnb and qtconsole (which I usually use),
but that's a pre-existing issue. IPython scroller for tall output seems too large
to me, as well.

Regardless - tested this and liked it, +1 to merge.

@jreback, any more issues to address before the green button?

jreback · 2013-11-22T17:03:03Z

is it possible to have an option to do the exisiting behavior , but default to the new

maybe display.notebook_repr_html = 'info' ?

if its easy I would add this to provide back compat, if not then ok (w/o going back to @y-p admitted 'ugliest' code)

ghost · 2013-11-22T17:44:53Z

in the terminal, with display.expand_frame_repr=False and display.width=0 so that auto-detection is used,
the truncation doesn't obey the width detection (since it depends on number of columns,
not terminal width). Not a blocker.

takluyver · 2013-11-22T19:13:23Z

I think it should be easy enough to have an option to revert to the old behaviour (at least roughly - I'd rather not restore max_info_rows as well). I'll do one option for both the terminal and the notebook: display.large_repr = 'truncate' | 'info'.

Truncation to terminal width is harder, because that would have to propagate down into the actual formatting code, and no doubt deal with various corner cases.

takluyver · 2013-11-22T19:43:54Z

Added the option in the form I described in my last message.

ghost · 2013-11-22T20:09:31Z

When truncating, having a footer with total row count would eliminate the need
to use df.info in many cases and so reduce the impact of the change on existing users.
(For example, after filtering a frame you're often interested in the size of the result).

Edit: as a header is probably better, since in ipnb you may be forced to scroll down manually to
expose that part of the view.

takluyver · 2013-11-23T02:36:29Z

I played around with some different options: showing it below the table looked more natural, and I opted to show it whether or not the table is truncated. The format is "61 rows × 26 columns". In the terminal, it shows up in [square brackets] to highlight that it's not part of the table.

takluyver · 2013-11-23T02:57:53Z

The failing test attempts to roundtrip a dataframe to and from the clipboard. It tests various ways of doing this, but one of them (passing excel=False 😕) will simply write str(df) to the clipboard. That would already not work for any dataframe large enough to get the info repr, but the test only uses a 5 × 3 frame.

Should we attempt to fix that, or simply remove the code path that writes str(df) to the clipboard.

ghost · 2013-11-23T08:02:12Z

The size issue is known: #5346, re confusion see #5070.

to_clipboard(excel=false) should probably use show_dimensions=False and use
to_string directly to avoid truncation.

takluyver · 2013-11-23T22:08:17Z

I've made the clipboard use to_string() instead of str(), which should also fix #5346. We'll see what Travis says.

However, now I appear to have a merge conflict. What's the preferred strategy for pandas: rebase, merge into my branch, or let whoever merges the PR handle it?

jreback · 2013-11-23T22:19:45Z

you need to clear merge conflicts via rebasing

jreback · 2013-11-24T01:09:41Z

see here: https://github.com/pydata/pandas/wiki/Using-Git

will need you to squash down before merging a well

takluyver · 2013-11-24T07:04:46Z

Rebased, squashing a couple of commits where I had undone some change.

hayd · 2013-11-24T07:44:15Z

Mercilessly squashing to 1 commit will make life a easier imo...

@jreback perhaps we should add that to wiki?

jreback · 2013-11-24T13:23:23Z

sure feel free to update/expand wiki

takluyver · 2013-11-24T16:58:54Z

I don't follow why squashing the whole PR to one commit would be useful. It seems to defeat the point of a DVCS.

takluyver · 2013-11-26T00:11:12Z

OK, great. Here's a more prominent section in the release notes, including a little picture.

HTML reprs for large dataframes.

ghost · 2013-11-26T01:05:52Z

Merged. Thanks @takluyver.

takluyver · 2013-11-26T01:20:35Z

:-) Thanks everyone for the review and improvements.

jreback · 2013-11-26T01:30:09Z

@takluyver

docs on the web are built at 5pm est

pls review the changes and make sure they look right

thanks again

takluyver · 2013-11-26T23:09:49Z

Nearly right - there should be an image here: http://pandas.pydata.org/pandas-docs/dev/whatsnew.html#dataframe-repr-changes . I realise now that I didn't check it in. doc/source/_static is ignored by git, so it didn't show up as a new file. Should images for the docs be stored somewhere else?

jreback · 2013-11-26T23:11:45Z

no that's right
but when u checkin you have to use -f
as git normally ignores it

jtratner · 2013-11-26T23:12:45Z

Just check it in (there are a few other static images there). The folder is ignored because all the generated plots are stored there.

ghost · 2013-11-26T23:15:58Z

So, should we change the defaults for max_rows and max_columns?

takluyver · 2013-11-26T23:20:35Z

The image is now PR #5594.

I might consider bumping the default max_columns down a bit, because I think in most real examples, 20 columns is very wide. Then again, when I open a blank spreadsheet, I see 20 columns, and I think it's more annoying to hide columns than to hide rows, so I'm not sure that it should change.

TomAugspurger · 2013-12-05T21:09:36Z

Has anyone had some performance issues with this on large DataFrames in the IPython notebook?
For a DataFrame with 1,536,532 rows and 22 columns, it ran for a minute before I interrupted the kernel.

It doesn't take long at all in terminal, and I don't use the qtconsole.

I don't mind, but I wanted people to be aware.

jreback · 2013-12-05T21:17:44Z

this should be ok on master (as it doesn't display all the rows), unless you have max_rows set to some big number

TomAugspurger · 2013-12-05T21:27:50Z

My display.max_columns is 20 and display.max_rows is 60.

That's why I was surprised it was taking longer on large frames.

TomAugspurger · 2013-12-05T21:33:31Z

I'm doing some timing right now to dig into it (I'll put up a notebook).

TomAugspurger · 2013-12-05T21:46:37Z

I guess it's a bit tricky to profile reprs. I'll come back to this later.

I can say that its a lot quicker just on a random frame. My example a had MultiIndex.

ghost · 2013-12-05T22:14:55Z

confirmed, we fixed that bug for the Index case, but I missed the MultiIndex equivalent.
Will fix.

good catch.

ghost · 2013-12-05T22:16:46Z

Once again, the wisdom of not merging things right before a release (and vice versa) shines through.

ghost · 2013-12-05T22:58:46Z

Should be fixed, add vbenches.

ghost · 2014-01-17T19:29:20Z

... I kinda like this phase of the release cycle:

#pandas new output of row and column numbers during every print is surprisingly nice cc @wesmckinn
— Chris (@cdubhland) January 17, 2014

jreback · 2014-01-17T19:30:10Z

:)

takluyver mentioned this pull request Nov 19, 2013

Rethink when HTML repr of DataFrame is displayed #4886

Closed

This was referenced Nov 20, 2013

HTML repr for empty dataframes is ugly #5562

Closed

Series do not display HTML repr #5563

Open

TomAugspurger mentioned this pull request Nov 24, 2013

Pandas Series doesn't display table as HTML in IPython notebook #5580

Closed

ghost pushed a commit that referenced this pull request Nov 26, 2013

Merge pull request #5550 from takluyver/long-repr-html

2e4ca43

HTML reprs for large dataframes.

ghost merged commit 2e4ca43 into pandas-dev:master Nov 26, 2013

michaelaye mentioned this pull request Nov 26, 2013

be80898 breaks display of large dataframes in IPython notebooks #5588

Closed

takluyver mentioned this pull request Nov 26, 2013

Add image for whatsnew docs #5594

Merged

takluyver mentioned this pull request Nov 28, 2013

Truncate DataFrames centrally, rather than at one end #5603

Closed

ghost mentioned this pull request Dec 5, 2013

BUG: repr_html, fix GH5588 for the MultiIndex case #5649

Merged

ghost mentioned this pull request Dec 7, 2013

0.13rc1+ vs. 0.12 vbench, perf regressions. #5660

Closed

ghost mentioned this pull request Jan 16, 2014

ENH: revamp null count supression for large frames in df.info() #5974

Merged

jseabold mentioned this pull request Mar 6, 2014

New DataFrame display information? #6547

Closed

bjonen mentioned this pull request May 26, 2014

max_columns == 0 incorrectly wraps around for wide dfs #7180

Closed

sinhrks mentioned this pull request Apr 9, 2016

Add a DataFrame.show() method pls! #1889

Closed

This pull request was closed.

HTML (and text) reprs for large dataframes. #5550

HTML (and text) reprs for large dataframes. #5550

Conversation

takluyver commented Nov 19, 2013

ghost commented Nov 20, 2013

takluyver commented Nov 20, 2013

jreback commented Nov 20, 2013

ghost commented Nov 20, 2013

jreback commented Nov 20, 2013

jorisvandenbossche commented Nov 21, 2013

ghost commented Nov 21, 2013

takluyver commented Nov 21, 2013

ghost commented Nov 21, 2013

takluyver commented Nov 21, 2013

ghost commented Nov 22, 2013

jreback commented Nov 22, 2013

ghost commented Nov 22, 2013

takluyver commented Nov 22, 2013

takluyver commented Nov 22, 2013

ghost commented Nov 22, 2013

takluyver commented Nov 23, 2013

takluyver commented Nov 23, 2013

ghost commented Nov 23, 2013

takluyver commented Nov 23, 2013

jreback commented Nov 23, 2013

jreback commented Nov 24, 2013

takluyver commented Nov 24, 2013

hayd commented Nov 24, 2013

jreback commented Nov 24, 2013

takluyver commented Nov 24, 2013

takluyver commented Nov 26, 2013

ghost commented Nov 26, 2013

takluyver commented Nov 26, 2013

jreback commented Nov 26, 2013

takluyver commented Nov 26, 2013

jreback commented Nov 26, 2013

jtratner commented Nov 26, 2013

ghost commented Nov 26, 2013

takluyver commented Nov 26, 2013

TomAugspurger commented Dec 5, 2013

jreback commented Dec 5, 2013

TomAugspurger commented Dec 5, 2013

TomAugspurger commented Dec 5, 2013

TomAugspurger commented Dec 5, 2013

ghost commented Dec 5, 2013

ghost commented Dec 5, 2013

ghost commented Dec 5, 2013

ghost commented Jan 17, 2014

jreback commented Jan 17, 2014