Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: enable Series.info() #37320

Merged
merged 34 commits into from
Dec 1, 2021
Merged

Conversation

ivanovmg
Copy link
Member

@ivanovmg ivanovmg commented Oct 21, 2020

I took over #31796 from @MarcoGorelli.
In this PR I took the tests and the docstring from #31796, refactored tests (separated into dataframe and series-related only test classes).
Then on top of the recent changes (#36752) I implemented series info.

New classes:

  • SeriesInfo (store data, which will be used in the outputs)
  • SeriesInfoPrinter (basically creator of the appropriate table builder)
  • SeriesTableBuilder (both Verbose and NonVerbose)
  • TableBuilderVerboseMixin (shared functionality for verbose info builders of both dataframe and series)

It seems to me that tests are not sufficient enough.
In particular, it seems that empty series info should be covered.
Currently there is a special empty dataframe info, but for series info there is just a generic verbose info with zero items.
If there is a need for a dedicated empty series info, then I would need to add method _fill_empty_info into SeriesTableBuilder.

Static typing makes code quite verbose. In some cases we have the very same methods/properties, but with different type annotations to satisfy type checking (methods are small, but anyway). If somebody can suggest me a better way to handle it, then that would be great.

@ivanovmg
Copy link
Member Author

Got some CI issue with building documentation (presumably because of warning related to numpy).
I think this is not related to the changes. Can anyone restart?

In file included from /home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/numpy/core/include/numpy/ndarraytypes.h:1822:0,
from /home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/numpy/core/include/numpy/ndarrayobject.h:12,
from /home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/numpy/core/include/numpy/arrayobject.h:4,
from /home/runner/.cache/ipython/cython/_cython_magic_1e384fc850b1a0be145d9b7384e71f98.c:630:
/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: #warning "Using deprecated NumPy API, disable it with " "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
#warning "Using deprecated NumPy API, disable it with "
^~~~~~~
In file included from /home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/numpy/core/include/numpy/ndarraytypes.h:1822:0,
from /home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/numpy/core/include/numpy/ndarrayobject.h:12,
from /home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/numpy/core/include/numpy/arrayobject.h:4,
from /home/runner/.cache/ipython/cython/_cython_magic_1f3f4faa63381d31bc6688d149dcf218.c:631:
/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: #warning "Using deprecated NumPy API, disable it with " "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
#warning "Using deprecated NumPy API, disable it with "
^~~~~~~

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls fix the tests. I can barely tell what changed here.

assert (
df_with_object_index.memory_usage(index=True, deep=True).sum()
== df_with_object_index.memory_usage(index=True).sum()
class TestDataFrameInfo:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this diff is super confusing. I would rather simply make 2 test files then jamming them in one. (you can also make a sub-module if that works better).

Copy link
Member Author

@ivanovmg ivanovmg Oct 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a separate module tests/io/formats/tests_series_info.py.
After this PR, we can probably move both test info modules to tests/*/methods/.

@jreback
Copy link
Contributor

jreback commented Oct 23, 2020

Static typing makes code quite verbose. In some cases we have the very same methods/properties, but with different type annotations to satisfy type checking (methods are small, but anyway). If somebody can suggest me a better way to handle it, then that would be great.

we have FrameOrSeries to handle this or FrameOrUnion, otherwise you can make a type alias as well.

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ivanovmg for the PR.

needs a release note in 1.2 and a versionadded tag in Series.info docstring.

@@ -4564,6 +4565,96 @@ def replace(
method=method,
)

@Substitution(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the Substitution decorator should be necessary with the doc decorator. (and not seen them used together)

The doc decorator was created to supersede the Appender and Substitution decorators.

Copy link
Member Author

@ivanovmg ivanovmg Oct 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that Substitution is still necessary if we use one generic docstring for DataFrame and Series info. I could not figure out how I can replace some keywords in the base docstring, to make it suitable for both frame and series.
Probably I do not know how to use doc decorator.

Series.memory_usage: Memory usage of Series."""
),
)
@doc(SeriesInfo.to_buffer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SeriesInfo.to_buffer doesn't have a docstring. so this doesn't render.

>>> help(pd.Series.info)
Help on function info in module pandas.core.series:

info(self, verbose: Union[bool, NoneType] = None, buf: Union[IO[str], NoneType] = None, max_cols: Union[int, NoneType] = N
one, memory_usage: Union[bool, str, NoneType] = None, null_counts: Union[bool, NoneType] = None) -> None

>>>

and wouldn't have memory_usage, max_cols, and null_counts parameters anyway?

(as an aside there appears to be a few issues with DataFrame.info docstring on master, such as alignment of console output and rogue data parameter. Not sure if always like this or from recent refactors, so if you get time, it would be great if can you check that out)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding null_counts - does it mean that we do not need series info without non-null counts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(as an aside there appears to be a few issues with DataFrame.info docstring on master, such as alignment of console output and rogue data parameter. Not sure if always like this or from recent refactors, so if you get time, it would be great if can you check that out)

I noticed not only here, but in couple of other places, that indentation gets bad, when using this kind of construct:

        %(max_cols_sub)s

I never touched the docstring, so probably @MarcoGorelli can comment on the rendering issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that I just added dedent in some parameters docs, which make info docstrings render better, without extra indentation.

verbose: Optional[bool] = None,
buf: Optional[IO[str]] = None,
max_cols: Optional[int] = None,
memory_usage: Optional[Union[bool, str]] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the docstring for DataFrame.info is

memory_usage: bool, str, optional

I think this should be

memory_usage: bool or 'deep', optional

might be able to use Literal here (see #37137) and maybe create an alias in typing . follow-on OK too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to use Literal, but looks like that is available only starting from Python 3.8.

"Argument `max_cols` can only be passed "
"in DataFrame.info, not Series.info"
)
return SeriesInfo(self, memory_usage).to_buffer(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems odd imo to have parameters other than buf passed to to_buffer()

would it be better to pass verbose and show_counts to SeriesInfo constructor or rename to_buffer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting the params in the constructor is possible, but in this case in SeriesInfo there will be two more attributes, which are used only in one method (smaller cohesion within the class).
I would prefer renaming the method. I will look into that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed to_buffer -> render.

However, I had to make the same function signature for DataFrameInfo and SeriesInfo to avoid typing errors.
Thus, I pass max_cols into render and raise ValueError there instead of pandas.core.series.info.
How does it look?

@ivanovmg ivanovmg requested a review from jreback October 23, 2020 14:41
@ivanovmg
Copy link
Member Author

Thanks @ivanovmg for the PR.

needs a release note in 1.2 and a versionadded tag in Series.info docstring.

I added versionadded tag.
The problem is that it creates extra newline in DataFrame.info() docstring.
Any idea how to solve this? Like, if substitution string is empty, then do not create new line.

Or maybe it is better to just create two separate docstrings for DataFrame and Series, but with the duplication?

@ivanovmg
Copy link
Member Author

I added versionadded tag.
The problem is that it creates extra newline in DataFrame.info() docstring.

CI/Checks complains just about that.

@jreback jreback added the Output-Formatting __repr__ of pandas objects, to_string label Nov 4, 2020
@jreback
Copy link
Contributor

jreback commented Nov 4, 2020

@ivanovmg if you'd merge master will have a look

buf: Optional[IO[str]] = None,
max_cols: Optional[int] = None,
memory_usage: Optional[Union[bool, str]] = None,
null_counts: bool = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per #36805 and #37999 , this should be show_counts

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but one thing at a time. Will wait for the public API update first.

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Dec 25, 2020
@mroeschke
Copy link
Member

Going to mark as a draft as this PR depends on #38062

@mroeschke mroeschke marked this pull request as draft March 10, 2021 05:34
@mroeschke mroeschke removed the Stale label Mar 10, 2021
@jreback
Copy link
Contributor

jreback commented Oct 4, 2021

would take this, if you can merge master and will look

@pep8speaks
Copy link

pep8speaks commented Oct 4, 2021

Hello @ivanovmg! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-11-30 02:01:11 UTC

@ivanovmg ivanovmg marked this pull request as ready for review October 5, 2021 15:39
@ivanovmg
Copy link
Member Author

@jreback, I merged master and made several updates.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm (as a follow on see if we have the correct entry in api.rst)

@jreback jreback added this to the 1.4 milestone Dec 1, 2021
@jreback jreback merged commit ef3237f into pandas-dev:master Dec 1, 2021
@jreback
Copy link
Contributor

jreback commented Dec 1, 2021

thanks @ivanovmg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: add Series.info method
6 participants