Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX? df.__repr__ takes for ever with big data #3337

Closed
dengemann opened this issue Apr 13, 2013 · 22 comments
Closed

FIX? df.__repr__ takes for ever with big data #3337

dengemann opened this issue Apr 13, 2013 · 22 comments
Labels
Milestone

Comments

@dengemann
Copy link
Contributor

Hi all,

after having updated my local branch with the current dev master ( pd.version == '0.11.0rc1' -- compiling under Mac OS 10.8 went smoothly) some of my scripts in which I handle large data frames (2.6 million rows, 10 columns) stopped working.
It turned out that accessing the df's repr method but also looking up docstrings seems to make ipython unresponsive and creates the impression as if it takes forever.

This little snippet reproduces the error with fake data ('tested' with EPD Python and IPython 7.2 (32-bit)):

import numpy as np
import pandas as pd

data  = np.random.random((3000000, 10))
df = pd.DataFrame(data)

print df  # does the same as `df.__repr__()` or `df?`

Can anyone confirm this?
It must have been some of the more recent changes as I did not have trouble with this earlier.
Any pointers and comments are highly welcome.

Cheers,
Denis

@ghost
Copy link

ghost commented Apr 13, 2013

see #3326 (comment)
should be fixed by the time 0.11.0 final is out.

In the meantime, git reset --hard 7db1af4^ will get you practically everythin new
in 0.11.0.

Thanks for reporting it, wil leave open till it's sorted.

@ghost
Copy link

ghost commented Apr 13, 2013

Can you provide a testcase for the docstring issue you mention? that's new.

@dengemann
Copy link
Contributor Author

Thanks for your quick reply. Unfortunately I cannot give you a pure Python example as DataFrame instances don't employ But doing df? or df?? in IPython does the 'trick'. Maybe that's the point, IPython will certainly employ some inspection magic to find out about the doc string. Would it make sense to move the doc string from under DataFrame.init to DataFrame, so it's possible to referred to it as DataFrame.doc ?

@ghost
Copy link

ghost commented Apr 13, 2013

The IPython docstring problem seems to be part of the same issue, so
there's probably no need for that, they'll both get fixed together.

@dengemann
Copy link
Contributor Author

Yes, makes sense. Just out of curiosity, wouldn't it be more consistent with other scientific / data analysis packages in Python to have non-empty .doc attributes? Or is this already being addressed somewhere? (sorry, I'm quite new to tracking the dev version of pandas and it's not so easy to get an overview on the open issues)

@ghost
Copy link

ghost commented Apr 13, 2013

I guess we all use ?? rather then doc , and IPython context help finds
it as well, so I never noticed that before.

Sure. PR welcome.

@dengemann
Copy link
Contributor Author

I guess we all use ?? rather then doc , and IPython context help finds it as well, so I never noticed that before.

I agree, neither did I. When using the debugger however, IPython features aren't available and:

print foo.doc

is a helpful idiom.

Sure. PR welcome.

ok

@dengemann
Copy link
Contributor Author

@y-p I will look for this 'issue' in the entire library. where ever I see pep8 violations I also would address those. (?)

@ghost
Copy link

ghost commented Apr 13, 2013

Have a read on CONTRIBUTING.md for various notes, including plus and minuses for
"PEP8 storms".

@dengemann
Copy link
Contributor Author

I just found that many violations are systematic and rather appear as
'dialects', thanks for the pointers, I'll take look.

On Sat, Apr 13, 2013 at 3:01 PM, y-p notifications@github.com wrote:

Have a read on CONTRIBUTING.md for various notes, including plus and
minuses for
"PEP8 storms".


Reply to this email directly or view it on GitHubhttps://github.com//issues/3337#issuecomment-16332706
.

@hayd
Copy link
Contributor

hayd commented Apr 13, 2013

@y-p I always feel like a PEP8 storm... :)

@ghost
Copy link

ghost commented Apr 13, 2013

I guess it's not bad to hit the 'reset' button once per release, but more then that
becomes disruptive. I use emacs and dangling whitespace from lesser editors
is a constant annoyance too, so "formatting storms" are not all bad.

@dengemann
Copy link
Contributor Author

After taking a tour I've learned ended up wondering about a few things.
It looks like local style to not separate args in a function and not to drop white spaces around kwargs.
also the typical 2 lines between functions rule is not complied with consistently.
Most importantly on many many instances lines are far beyond 80 characters.
I think this calls for a separate PR.
It would be good for new contributors to have a more detailed style guideline including a minimum agreed-upon set of pep8 rules. This might also keep down pep8 storms. If e.g. the 2 lines between functions rule is not among the core style rules then may can simply ignore it (despite eyes bleeding from the editor's pep8 mode).

Wdyt?

@ghost
Copy link

ghost commented Apr 13, 2013

So CONTRIBUTING.md didn't convince you. oh well. :)

@dengemann
Copy link
Contributor Author

So CONTRIBUTING.md didn't convince you. oh well. :)

Admittedly not, was more of a deterrent to me ;-)

@dengemann
Copy link
Contributor Author

(with regard to pep8 interventions)

@dengemann
Copy link
Contributor Author

Btw. a few exceptions to pep8 which seem to have the status of a local pandas convention are:

  1. White spaces in list-comprehensions and expressions
  • [ k for k in my_list ]
  • [ v ]

see e. g. : https://github.com/pydata/pandas/blob/master/pandas/core/frame.py#L558

  1. adding white spaces to equalize column count across a series of assignments belonging together (config file style)

e. g. https://github.com/pydata/pandas/blob/master/pandas/core/frame.py#L840

These patterns seem to emerge from time to time, but too frequently to look like unwanted.
So I wouldn't be sure whether to touch them or not in a pep8 storm ;-)

Without complicating things one could address this CONTRIBUTING.md, given there is exceptions one likes to keep un-pep8-ted.

@lodagro
Copy link
Contributor

lodagro commented Apr 15, 2013

The DataFrame repr behavior is now such that, conscise formats are only used in interactive mode (see also #3326, discussion not yet closed). Interactive mode can be enforced by using pd.options.mode.sim_interactive = True.

(pandas)[lodagro@ubuntu][1193][n] cat issue_3337.py                                                  [~/projects/pandas/issues]
import numpy as np
import pandas as pd

pd.options.mode.sim_interactive = True

data  = np.random.random((3000, 10))
df = pd.DataFrame(data)

print df
(pandas)[lodagro@ubuntu][1194][n] python issue_3337.py                                               [~/projects/pandas/issues]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3000 entries, 0 to 2999
Data columns (total 10 columns):
0    3000  non-null values
1    3000  non-null values
2    3000  non-null values
3    3000  non-null values
4    3000  non-null values
5    3000  non-null values
6    3000  non-null values
7    3000  non-null values
8    3000  non-null values
9    3000  non-null values
dtypes: float64(10)
(pandas)[lodagro@ubuntu][1195][n]    

@ghost
Copy link

ghost commented Apr 15, 2013

I really don't think the change in behaviour makes sense.
repr() is not a report-generation function. a repr() should be bounded in length,
with the user having some control over the bound. IMO, this part should be rolled back.

Even if it were, it's now established behaviour so this change will break people's scripts.
it already has.

sim_interactive is stricly for testing purposes only, people should not use it for anything else.

@wesm
Copy link
Member

wesm commented Apr 17, 2013

Status?

@ghost
Copy link

ghost commented Apr 17, 2013

broken.

@ghost
Copy link

ghost commented Apr 22, 2013

fixed by b9fa04a

@ghost ghost closed this as completed Apr 22, 2013
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants