Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML (and text) reprs for large dataframes. #5550

Merged
5 commits merged into from
Nov 26, 2013

Conversation

takluyver
Copy link
Contributor

As discussed in #4886, the HTML representation of DataFrames currently starts off as a table, but switches to the condensed info view if the table exceeds a certain size (by default, more than 60 rows or 20 columns). I've seen this confusing users, who think that they suddenly have a completely different kind of object, and don't understand why.

With these changes, the HTML repr always displays the table, but truncates it when it exceeds a certain size. It reuses the same options, display.max_rows and display.max_columns.

Before:

pandas_long_repr_before
pandas_wide_repr_before

After:

pandas_long_repr_after
pandas_wide_repr_after

@ghost
Copy link

ghost commented Nov 20, 2013

a couple of issues: edge formatting on wide tables, and the empty dataframe
repr looks off.

Also, the PR affects only the html repr so (QT)console users will see different behavior then
ipnb users.

html_repr_shot

@takluyver
Copy link
Contributor Author

I've fixed the wide display issue - it wasn't truncating the extra row added for the row index names.

The empty dataframe repr is the same as what's displayed in my system installation of pandas - I agree that it's a bit odd, but I think that's a separate issue.

I agree, the behaviour of plain text reprs should be similar. Do you want me to tackle that in this PR, or separately?

@jreback
Copy link
Contributor

jreback commented Nov 20, 2013

this is related to #1889 as well

@ghost
Copy link

ghost commented Nov 20, 2013

That'd be great, I think it fits here ok.

Not sure about putting this in 0.13, I've had bad luck with last minute changes to
display code. @jreback?

@jreback
Copy link
Contributor

jreback commented Nov 20, 2013

I think this is fine with a couple of minor issues:

need a mention in v0.13.0 and main docs
about this default and how to use .info() to get existing summary view

the hard codes for max rows / columns should come purely from the options and not hard code the functions
can u change that

@jorisvandenbossche
Copy link
Member

I also mentioned it in the issue, but what do you think of showing first 30 ... last 30 rows/cols instead of first 60 ...? As is done for the Series html repr.
(It's also what is proposed in #1889)

@ghost
Copy link

ghost commented Nov 21, 2013

Makes sense to me, but can be done in a subsequent PR if needs be.
ipnb is solidifying it's browser<->kernel message passing, may soon be
time to revisit the grid view from (with paging) #2974, this time with
built-in functionality rather then an in-process web server (like Exhibitionist does).

@takluyver
Copy link
Contributor Author

I have:

  • Made the plain text reprs match the HTML reprs (truncating with ... beyond max_rows/max_columns). To get the info view, you need to call the info() method.
  • Removed the defaults for max_rows/max_cols from the methods to which they are passed - so calling to_string() or to_html() will by default show the entire DataFrame untruncated. Only the reprs automatically truncate.
  • Removed the max_info_rows option - this was only used when displaying the info view for a repr.
  • Documented the changes.

@ghost
Copy link

ghost commented Nov 21, 2013

  • Better deprecate max_info_rows rather then remove it, we may wish to move it's
    enforcement over to info(). There are examples of doing that in config_init.py. Actually,
    just generally deprecate rather then remove to reduce friction.
  • you removed some of the ugliest code I've ever written - good omen.
    can you have a look at Console-width detection should be interactive sessions only #1610 and
    see if that raises any issues with the changes? (regressions)

@takluyver
Copy link
Contributor Author

max_info_rows is back, deprecated.

I've had a brief look at #1610 - I don't think this should cause a regression, because format.get_console_size() checks whether it's in an interactive session.

@ghost
Copy link

ghost commented Nov 22, 2013

I was very wrong with my initial objections, this is just great.

It's impossible to set default values for for max_rows and max_columns
that make things look good on both ipnb and qtconsole (which I usually use),
but that's a pre-existing issue. IPython scroller for tall output seems too large
to me, as well.

Regardless - tested this and liked it, +1 to merge.

@jreback, any more issues to address before the green button?

@jreback
Copy link
Contributor

jreback commented Nov 22, 2013

is it possible to have an option to do the exisiting behavior , but default to the new

maybe display.notebook_repr_html = 'info' ?

if its easy I would add this to provide back compat, if not then ok (w/o going back to @y-p admitted 'ugliest' code)

@ghost
Copy link

ghost commented Nov 22, 2013

in the terminal, with display.expand_frame_repr=False and display.width=0 so that auto-detection is used,
the truncation doesn't obey the width detection (since it depends on number of columns,
not terminal width). Not a blocker.

@takluyver
Copy link
Contributor Author

I think it should be easy enough to have an option to revert to the old behaviour (at least roughly - I'd rather not restore max_info_rows as well). I'll do one option for both the terminal and the notebook: display.large_repr = 'truncate' | 'info'.

Truncation to terminal width is harder, because that would have to propagate down into the actual formatting code, and no doubt deal with various corner cases.

@takluyver
Copy link
Contributor Author

Added the option in the form I described in my last message.

@ghost
Copy link

ghost commented Nov 22, 2013

When truncating, having a footer with total row count would eliminate the need
to use df.info in many cases and so reduce the impact of the change on existing users.
(For example, after filtering a frame you're often interested in the size of the result).

Edit: as a header is probably better, since in ipnb you may be forced to scroll down manually to
expose that part of the view.

@takluyver
Copy link
Contributor Author

I played around with some different options: showing it below the table looked more natural, and I opted to show it whether or not the table is truncated. The format is "61 rows × 26 columns". In the terminal, it shows up in [square brackets] to highlight that it's not part of the table.

@takluyver
Copy link
Contributor Author

The failing test attempts to roundtrip a dataframe to and from the clipboard. It tests various ways of doing this, but one of them (passing excel=False 😕) will simply write str(df) to the clipboard. That would already not work for any dataframe large enough to get the info repr, but the test only uses a 5 × 3 frame.

Should we attempt to fix that, or simply remove the code path that writes str(df) to the clipboard.

@ghost
Copy link

ghost commented Nov 23, 2013

The size issue is known: #5346, re confusion see #5070.

to_clipboard(excel=false) should probably use show_dimensions=False and use
to_string directly to avoid truncation.

@takluyver
Copy link
Contributor Author

I've made the clipboard use to_string() instead of str(), which should also fix #5346. We'll see what Travis says.

However, now I appear to have a merge conflict. What's the preferred strategy for pandas: rebase, merge into my branch, or let whoever merges the PR handle it?

@jreback
Copy link
Contributor

jreback commented Nov 23, 2013

you need to clear merge conflicts via rebasing

@jreback
Copy link
Contributor

jreback commented Nov 24, 2013

see here: https://github.com/pydata/pandas/wiki/Using-Git

will need you to squash down before merging a well

@takluyver
Copy link
Contributor Author

Rebased, squashing a couple of commits where I had undone some change.

@hayd
Copy link
Contributor

hayd commented Nov 24, 2013

Mercilessly squashing to 1 commit will make life a easier imo...

@jreback perhaps we should add that to wiki?

@jreback
Copy link
Contributor

jreback commented Nov 24, 2013

sure feel free to update/expand wiki

@takluyver
Copy link
Contributor Author

I don't follow why squashing the whole PR to one commit would be useful. It seems to defeat the point of a DVCS.

@takluyver
Copy link
Contributor Author

OK, great. Here's a more prominent section in the release notes, including a little picture.

ghost pushed a commit that referenced this pull request Nov 26, 2013
HTML reprs for large dataframes.
@ghost ghost merged commit 2e4ca43 into pandas-dev:master Nov 26, 2013
@ghost
Copy link

ghost commented Nov 26, 2013

Merged. Thanks @takluyver.

@takluyver
Copy link
Contributor Author

:-) Thanks everyone for the review and improvements.

@jreback
Copy link
Contributor

jreback commented Nov 26, 2013

@takluyver

docs on the web are built at 5pm est

pls review the changes and make sure they look right

thanks again

@takluyver
Copy link
Contributor Author

Nearly right - there should be an image here: http://pandas.pydata.org/pandas-docs/dev/whatsnew.html#dataframe-repr-changes . I realise now that I didn't check it in. doc/source/_static is ignored by git, so it didn't show up as a new file. Should images for the docs be stored somewhere else?

@jreback
Copy link
Contributor

jreback commented Nov 26, 2013

no that's right
but when u checkin you have to use -f
as git normally ignores it

@jtratner
Copy link
Contributor

Just check it in (there are a few other static images there). The folder is ignored because all the generated plots are stored there.

@ghost
Copy link

ghost commented Nov 26, 2013

So, should we change the defaults for max_rows and max_columns?

@takluyver
Copy link
Contributor Author

The image is now PR #5594.

I might consider bumping the default max_columns down a bit, because I think in most real examples, 20 columns is very wide. Then again, when I open a blank spreadsheet, I see 20 columns, and I think it's more annoying to hide columns than to hide rows, so I'm not sure that it should change.

@TomAugspurger
Copy link
Contributor

Has anyone had some performance issues with this on large DataFrames in the IPython notebook?
For a DataFrame with 1,536,532 rows and 22 columns, it ran for a minute before I interrupted the kernel.

It doesn't take long at all in terminal, and I don't use the qtconsole.

I don't mind, but I wanted people to be aware.

@jreback
Copy link
Contributor

jreback commented Dec 5, 2013

this should be ok on master (as it doesn't display all the rows), unless you have max_rows set to some big number

@TomAugspurger
Copy link
Contributor

My display.max_columns is 20 and display.max_rows is 60.

That's why I was surprised it was taking longer on large frames.

@TomAugspurger
Copy link
Contributor

I'm doing some timing right now to dig into it (I'll put up a notebook).

@TomAugspurger
Copy link
Contributor

I guess it's a bit tricky to profile reprs. I'll come back to this later.

I can say that its a lot quicker just on a random frame. My example a had MultiIndex.

@ghost
Copy link

ghost commented Dec 5, 2013

confirmed, we fixed that bug for the Index case, but I missed the MultiIndex equivalent.
Will fix.

good catch.

@ghost
Copy link

ghost commented Dec 5, 2013

Once again, the wisdom of not merging things right before a release (and vice versa) shines through.

@ghost
Copy link

ghost commented Dec 5, 2013

Should be fixed, add vbenches.

@ghost
Copy link

ghost commented Jan 17, 2014

... I kinda like this phase of the release cycle:

#pandas new output of row and column numbers during every print is surprisingly nice cc @wesmckinn

— Chris (@cdubhland) January 17, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

@jreback
Copy link
Contributor

jreback commented Jan 17, 2014

:)

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants