Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX-#4570: Replace np.bool -> np.bool_ #4571

Merged
merged 5 commits into from
Jun 16, 2022
Merged

Conversation

NickCrews
Copy link
Contributor

@NickCrews NickCrews commented Jun 11, 2022

I was getting:

  DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
  Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

I assume that this is the correct substitution, instead of changing to a simple bool, but I don't understand what this code does here, so I may be wrong.

If this looks good I can do the final TODOs in this checklist

What do these changes do?

  • commit message follows format outlined here
  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves BUG: DeprecationWarning from using np.bool #4570
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date
  • added (Issue Number: PR title (PR Number)) and github username to release notes for next major release

@NickCrews NickCrews requested a review from a team as a code owner June 11, 2022 07:39
@NickCrews NickCrews changed the title FIX-4570: Replace np.bool -> np.bool_ FIX-#4570: Replace np.bool -> np.bool_ Jun 11, 2022
Copy link
Collaborator

@devin-petersohn devin-petersohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @NickCrews, thanks for the contribution! The changes look great!

CI will fail on the commit message, it needs to have a signed off message (git commit -s) and the first line of the message needs to follow this: https://modin.readthedocs.io/en/latest/development/contributing.html#commit-message-formatting . Setting the first line of the commit message to the PR title will fix that.

We use commit messages to generate the release notes. Thanks, and let me know if I can help in any way!

@codecov
Copy link

codecov bot commented Jun 11, 2022

Codecov Report

Merging #4571 (e33561e) into master (7df6cb3) will increase coverage by 3.10%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4571      +/-   ##
==========================================
+ Coverage   86.25%   89.36%   +3.10%     
==========================================
  Files         228      229       +1     
  Lines       18453    18727     +274     
==========================================
+ Hits        15917    16735     +818     
+ Misses       2536     1992     -544     
Impacted Files Coverage Δ
...odin/core/storage_formats/pandas/query_compiler.py 96.19% <100.00%> (ø)
modin/pandas/base.py 94.81% <100.00%> (+0.08%) ⬆️
modin/pandas/series.py 94.00% <100.00%> (ø)
...ecution/ray/implementations/pandas_on_ray/io/io.py 93.33% <0.00%> (-6.67%) ⬇️
modin/core/storage_formats/pandas/parsers.py 88.95% <0.00%> (-1.27%) ⬇️
modin/experimental/batch/test/test_pipeline.py 100.00% <0.00%> (ø)
...mentations/pandas_on_ray/partitioning/partition.py 93.57% <0.00%> (+1.83%) ⬆️
...tations/pandas_on_python/partitioning/partition.py 93.75% <0.00%> (+2.08%) ⬆️
...entations/pandas_on_dask/partitioning/partition.py 91.46% <0.00%> (+2.43%) ⬆️
... and 18 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

I was getting:
  DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
  Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

I assume that this is the correct substitution, instead of chaning to a simple `bool`, but I don't understand what this code does here, so I may be wrong.

Signed-off-by: Nick Crews <nicholas.b.crews@gmail.com>
@NickCrews
Copy link
Contributor Author

@devin-petersohn Thanks, I think I updated the branch correctly. I didn't do the task:

added (Issue Number: PR title (PR Number)) and github username to release notes for next major release

because you are saying that is autogenerated? Or I need to do that?

@YarShev
Copy link
Collaborator

YarShev commented Jun 12, 2022

@NickCrews, we should probably replace all entries of np.bool in the codebase (search for np.bool -> replace all to np.bool_). Would you be able to handle that?

@NickCrews
Copy link
Contributor Author

@YarShev That followup commit replaced all the other instances of np.bool. But I think in those cases they should get replaced with just bool. Double check me.

@devin-petersohn
Copy link
Collaborator

because you are saying that is autogenerated? Or I need to do that?

Not exactly, sorry for not being clear. We are working toward that, but for now the message also needs to be manually added here: https://github.com/modin-project/modin/blame/master/docs/release_notes/release_notes-0.16.0.rst#L9

@NickCrews
Copy link
Contributor Author

Added entry to release notes! That last CI run failed, though it looked unrelated to this PR? IDK we will see this new run again.

I will note, as a new contributor the number of hoops I have to jump through, as compared to contributing to say pandas, is annoying. Autogenerated release notes, and looser requirements on how to format commit messages and issue numbers etc would probably make me more likely to contribute in the future. Just my two cents, I'm sure there are reasons for the structure!

@RehanSD
Copy link
Collaborator

RehanSD commented Jun 12, 2022

Added entry to release notes! That last CI run failed, though it looked unrelated to this PR? IDK we will see this new run again.

I will note, as a new contributor the number of hoops I have to jump through, as compared to contributing to say pandas, is annoying. Autogenerated release notes, and looser requirements on how to format commit messages and issue numbers etc would probably make me more likely to contribute in the future. Just my two cents, I'm sure there are reasons for the structure!

Hi @NickCrews! That makes a lot of sense, thank you - we'll be sure to keep that in mind moving forward! One of the reasons we like to lint commits is so that the commit history on the main branch remains clear + descriptive!

I've approved the CI to run, and look forward to merging when everything is green!

@NickCrews
Copy link
Contributor Author

@RehanSD I just had to force-push again to fix a error in the docs build, thanks!

@RehanSD
Copy link
Collaborator

RehanSD commented Jun 12, 2022

@YarShev That followup commit replaced all the other instances of np.bool. But I think in those cases they should get replaced with just bool. Double check me.

Hi @NickCrews quick question - why are we replacing those instances with bool instead of np.bool_? I did some checking, and it seems that bool and np.bool_ behave similarly when doing dtypes checking (i.e. df.dtypes == bool and df.dtypes == np.bool_ return the same result), but np.bool_ != bool - so unless there's a compelling reason to swap to bool, I'd prefer sticking with np.bool_, in case there are some edge cases we're missing!

(I've also approved the run again)!

@YarShev
Copy link
Collaborator

YarShev commented Jun 12, 2022

@RehanSD, thanks, that is a great question! Some info to be considered - https://stackoverflow.com/questions/55905690/how-exactly-does-the-behavior-of-python-bool-and-numpy-bool-differ.

@NickCrews
Copy link
Contributor Author

I made the choice between going to bool vs np.bool_ based on:

  • where possible, use bool. I figured anywhere we could be basic python and not doing stuff with numpy the better. numpy recommends using bool, and I figured it would be the one users are more likely to use. These dtype checks seemed like prime use cases where someone would use bool instead of np.bool_. Per your tests you did above (nice, I didn't even bother doing that explicitly, thanks!), I think that is even more evidence this is an OK thing to do. Also, I read numpy's release notes as saying that going to bool in these cases is fine in cases where we are only interacting with the public API of numpy.
  • In the cases where I kept np.bool_ , it was because it seemed like we were getting into internals, doing some meta-programming to create new types and methods etc. I thought the distinction was more important there so I wanted to let sleeping dogs lie.

Hopefully that reasoning makes sense? Other things I'm missing? I will change everything to np.bool_ if you tell me, but I think I still prefer the way the PR is now.

CI failures still look unrelated? These same workflows are failing on main.

@RehanSD
Copy link
Collaborator

RehanSD commented Jun 13, 2022

I think that makes sense to me - but that begets the broader question, why use np.bool_ at all? I'm fairly certain that we could just use bool everywhere - the only benefit from using np.bool_ (as far as I can tell) is that it is more memory efficient than bool. If we do decide to use np.bool_ I think it would be good to document why we made that decision (e.g. a comment that goes something like The following APIs use np.bool_ to maintain backwards compatibility/be more memory efficient/etc.

cc: @devin-petersohn

Thank you for pointing out the CI errors - I'll take a closer look at those later today/tomorrow!

@RehanSD
Copy link
Collaborator

RehanSD commented Jun 13, 2022

Just checked and the errors are due to a dependency bug we are currently trying to resolve. #4568 is one particular solution - once we've got something merged for that, you'll be able to pull from master and have CI actually run!

@NickCrews
Copy link
Contributor Author

Yeah that's true, and I think in a nice simple world we DO only use bool. I'm not sure what the implications are. I just started diving into the internals of what isin = Map.register(pandas.DataFrame.isin, dtypes=np.bool_) actually does, but then I stopped. Too complicated for me to put that much effort in, someone who is more familiar will have to speak to that.

I agree a comment would be good. At my current level of understanding I think I would say something like Perhaps we could just use bool here?, I dont want to pretend I know more than I do, or it might keep the next maintainer from actually making a change that's good.

@RehanSD
Copy link
Collaborator

RehanSD commented Jun 13, 2022

Just ran a quick test, which supports the idea that we could probably just swap np.bool_ with bool everywhere - the following code snippet seems to indicate that pandas treats np.bool_ as bool in the backend:

In [1]: import pandas as pd; import numpy as np

In [2]: df = pd.DataFrame([[True, False, True]]*1000, columns=["A", "B", "C"], dtype=np.bool_)

In [3]: df.dtypes
Out[3]:
A    bool
B    bool
C    bool
dtype: object

In [4]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       1000 non-null   bool
 1   B       1000 non-null   bool
 2   C       1000 non-null   bool
dtypes: bool(3)
memory usage: 3.1 KB

In [5]: bool_df = pd.DataFrame([[True, False, True]]*1000, columns=["A", "B", "C"])

In [6]: bool_df.dtypes
Out[6]:
A    bool
B    bool
C    bool
dtype: object

In [7]: bool_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       1000 non-null   bool
 1   B       1000 non-null   bool
 2   C       1000 non-null   bool
dtypes: bool(3)
memory usage: 3.1 KB

In [8]: np_bool_df = pd.DataFrame([[np.bool_(True), np.bool_(False), np.bool_(True)]]*1000, columns=["A", "B", "C"])

In [9]: np_bool_df.dtypes
Out[9]:
A    bool
B    bool
C    bool
dtype: object

In [10]: np_bool_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       1000 non-null   bool
 1   B       1000 non-null   bool
 2   C       1000 non-null   bool
dtypes: bool(3)
memory usage: 3.1 KB

@YarShev
Copy link
Collaborator

YarShev commented Jun 13, 2022

@YarShev That followup commit replaced all the other instances of np.bool. But I think in those cases they should get replaced with just bool. Double check me.

Hi @NickCrews quick question - why are we replacing those instances with bool instead of np.bool_? I did some checking, and it seems that bool and np.bool_ behave similarly when doing dtypes checking (i.e. df.dtypes == bool and df.dtypes == np.bool_ return the same result), but np.bool_ != bool - so unless there's a compelling reason to swap to bool, I'd prefer sticking with np.bool_, in case there are some edge cases we're missing!

(I've also approved the run again)!

@RehanSD, I am not sure that is the fair check you did because when comparing df.dtypes == bool or df.dtypes == np.bool_, as well as np.bool_ != bool it seems we are doing different operations.

import pandas as pd; import numpy as np

df = pd.DataFrame([[True, False, True]]*1000, columns=["A", "B", "C"], dtype=np.bool_)
df.dtypes == bool
A    True
B    True
C    True
dtype: bool

df.dtypes == np.bool
DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  df.dtypes == np.bool
A    True
B    True
C    True
dtype: bool

df.dtypes == np.bool_
A    True
B    True
C    True
dtype: bool

np.bool_ == bool
False

@YarShev
Copy link
Collaborator

YarShev commented Jun 13, 2022

Yeah that's true, and I think in a nice simple world we DO only use bool. I'm not sure what the implications are. I just started diving into the internals of what isin = Map.register(pandas.DataFrame.isin, dtypes=np.bool_) actually does, but then I stopped. Too complicated for me to put that much effort in, someone who is more familiar will have to speak to that.

@NickCrews, we use isin = Map.register(pandas.DataFrame.isin, dtypes=np.bool_) to pass dtypes in the constructor of the core Modin dataframe. I think we can use Python boolean everywhere as NumPy itself says this will not modify any behavior and is safe even though the use of np.bool_ is memory efficient a bit.

import sys
import numpy as np

a = np.array([True], dtype=np.bool_)
type(a[0])
numpy.bool_

b = True
type(b)
bool

sys.getsizeof(a[0])
25

sys.getsizeof(b)
28

@devin-petersohn, @RehanSD, @NickCrews , thoughts?

@devin-petersohn
Copy link
Collaborator

I will note, as a new contributor the number of hoops I have to jump through, as compared to contributing to say pandas, is annoying. Autogenerated release notes, and looser requirements on how to format commit messages and issue numbers etc would probably make me more likely to contribute in the future. Just my two cents, I'm sure there are reasons for the structure!

Thanks @NickCrews this is good feedback. I think the contribution requirements list has grown and a lot of things have been added to solve one problem or another, but I agree that we should definitely try to soften as many of these requirements. This checklist has grown to be a bit too long:

image

Most of these requirements were reminders for myself/other maintainers to enforce some good discipline, but don't necessarily need to be so strict.

@devin-petersohn
Copy link
Collaborator

@RehanSD @YarShev I think the conversation is getting a bit too in the weeds. I think the way @NickCrews has structured this is actually fine: np.bool_ is used within the query compiler and bool is used at the higher layers in Modin (where code should be more readable). I don't think we need to necessarily go so deep.

I am fine with the changes as-is.

@YarShev
Copy link
Collaborator

YarShev commented Jun 13, 2022

@RehanSD @YarShev I think the conversation is getting a bit too in the weeds. I think the way @NickCrews has structured this is actually fine: np.bool_ is used within the query compiler and bool is used at the higher layers in Modin (where code should be more readable). I don't think we need to necessarily go so deep.

I am fine with the changes as-is.

@devin-petersohn, a Modin developer might be confused with this behavior in future and he will have a question why np.bool_ is used in some places, but Python bool is used in others. I am rather for the consistent behavior everywhere if there is no any implication or side effect of using a concrete boolean dtype.

@RehanSD
Copy link
Collaborator

RehanSD commented Jun 13, 2022

To be honest, I think @devin-petersohn is right - I'm happy to merge this PR as is and open a new PR to do a refactor.

@YarShev I don't know if np.bool_ is actually more memory efficient when used with pandas. It seems to me that pandas is just converting np.bool_ to bool under the hood, negating any memory savings we would expect.

@YarShev
Copy link
Collaborator

YarShev commented Jun 13, 2022

To be honest, I think @devin-petersohn is right - I'm happy to merge this PR as is and open a new PR to do a refactor.

@RehanSD, I do not see a problem to do the refactor in this PR. However, if we want to move forward, we can merge this PR as is and open a new PR to do the refactor. Would you be willing to do that?

@RehanSD
Copy link
Collaborator

RehanSD commented Jun 13, 2022

Fine by me @YarShev! @devin-petersohn thoughts? Thank you for all of your awesome work on this issue @NickCrews !

Copy link
Collaborator

@devin-petersohn devin-petersohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YarShev I don't necessarily agree that a Modin developer would get confused, especially given the git blame would link back to this discussion, but I have made it easy for @NickCrews to make it all consistent.

modin/pandas/base.py Outdated Show resolved Hide resolved
modin/pandas/base.py Outdated Show resolved Hide resolved
modin/pandas/base.py Outdated Show resolved Hide resolved
modin/pandas/series.py Outdated Show resolved Hide resolved
modin/pandas/test/dataframe/test_reduce.py Outdated Show resolved Hide resolved
modin/pandas/test/test_series.py Outdated Show resolved Hide resolved
modin/pandas/test/test_series.py Outdated Show resolved Hide resolved
modin/pandas/test/test_series.py Outdated Show resolved Hide resolved
@YarShev
Copy link
Collaborator

YarShev commented Jun 14, 2022

@devin-petersohn, makes sense to me, thanks! @NickCrews, please rebase on current master to fix failures in CI.

See modin-project#4571

Co-authored-by: Devin Petersohn <devin-petersohn@users.noreply.github.com>
@NickCrews
Copy link
Contributor Author

Suggestions applied. Thank you all! That was easy to apply those suggestions. I haven't used that feature before but that was slick!

YarShev
YarShev previously approved these changes Jun 14, 2022
Copy link
Collaborator

@YarShev YarShev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NickCrews, thanks for the changes! Will look forward to seeing further contributions from your side.

@RehanSD
Copy link
Collaborator

RehanSD commented Jun 14, 2022

Hi @NickCrews! Thank you so much for getting this in! I just wanted to include a link to join our Modin Slack as well (it's also in the README) - its where we have longer form discussions + collaborations with other Modin Developers, and do Q+A for users - I think it would be great to have you on there too, to ease future collaboration! Link

@RehanSD
Copy link
Collaborator

RehanSD commented Jun 14, 2022

Looked into the bug on CI - turns out one of our tests uses np.bool_ for describe, and since the dtypes are now bool, it doesn't match up exactly. This can be fixed by copy-pasting the following for our describe method in base.py. Also happy to push this commit if that's ok with @NickCrews ?

    def describe(
        self, percentiles=None, include=None, exclude=None, datetime_is_numeric=False
    ):  # noqa: PR01, RT01, D200
        """
        Generate descriptive statistics.
        """
        if include is not None and (isinstance(include, np.dtype) or include != "all"):
            if not is_list_like(include):
                include = [include]
            include = [
                np.dtype(i)
                if not (isinstance(i, type) and i.__module__ == "numpy")
                else i
                for i in include
            ]
            if not any(
                (isinstance(inc, np.dtype) and inc == d)
                or (
                    not isinstance(inc, np.dtype)
                    and inc.__subclasscheck__(getattr(np, d.__str__()))
                )
                or (inc in [bool, np.bool_] and d in [bool, np.bool_])
                for d in self._get_dtypes()
                for inc in include
            ):
                # This is the error that pandas throws.
                raise ValueError("No objects to concatenate")
        if exclude is not None:
            if not is_list_like(exclude):
                exclude = [exclude]
            exclude = [np.dtype(e) for e in exclude]
            if all(
                (isinstance(exc, np.dtype) and exc == d)
                or (
                    not isinstance(exc, np.dtype)
                    and exc.__subclasscheck__(getattr(np, d.__str__()))
                )
                for d in self._get_dtypes()
                for exc in exclude
            ):
                # This is the error that pandas throws.
                raise ValueError("No objects to concatenate")
        if percentiles is not None:
            # explicit conversion of `percentiles` to list
            percentiles = list(percentiles)

            # get them all to be in [0, 1]
            validate_percentile(percentiles)

            # median should always be included
            if 0.5 not in percentiles:
                percentiles.append(0.5)
            percentiles = np.asarray(percentiles)
        else:
            percentiles = np.array([0.25, 0.5, 0.75])
        return self.__constructor__(
            query_compiler=self._query_compiler.describe(
                percentiles=percentiles,
                include=include,
                exclude=exclude,
                datetime_is_numeric=datetime_is_numeric,
            )
        )

@YarShev YarShev dismissed their stale review June 14, 2022 18:32

CI is red

@NickCrews
Copy link
Contributor Author

NickCrews commented Jun 14, 2022

I'm nervous that adding a oneline fixer of or (inc in [bool, np.bool_] and d in [bool, np.bool_]) is a weird exceptional case that separates bool/np.bool_ from all the other dtypes. Seems like a code smell, and we should solve this in a more generalized way.

Looking at how pandas does this, their describe() uses select_dtypes() so that those two methods are guaranteed to have the same behavior and always be in sync. Do we want to / can we do the same thing? Or somehow merge that logic together so they both use the same helper functions?

IDK, could just do this tweak, merge, and then fixup later, but then it runs the risk of being a TODO that never is fixed 😆

YAY, I love it when the scope of a PR explodes.

@RehanSD
Copy link
Collaborator

RehanSD commented Jun 14, 2022

That makes sense - but I think there may be an easier way to handle this - we can just use pandas.core.dtypes.common.pandas_dtype which converts a type to a pandas or numpy dtype. The result of calling this on np.bool_ is bool, which works well, so we can just map this across include and exclude in both select_dtypes and describe.

@NickCrews
Copy link
Contributor Author

@RehanSD OK that seems like a nice simple alternative. Do you think you could actually implement it though? I am getting blocked from being able to work on this locaclly by #4579

@RehanSD
Copy link
Collaborator

RehanSD commented Jun 15, 2022

Of course - happy to do so (unless you'd prefer to do it, since it seems that Mahesh may have unblocked you)?

…es correctly

Signed-off-by: Rehan Durrani <rehan@ponder.io>
@NickCrews
Copy link
Contributor Author

@RehanSD thank you, that commit looks great! @YarShev I think this is ready for review again.

Copy link
Collaborator

@RehanSD RehanSD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@RehanSD RehanSD merged commit dbc78a9 into modin-project:master Jun 16, 2022
@YarShev
Copy link
Collaborator

YarShev commented Jun 16, 2022

Thank you guys, LGTM!

@RehanSD
Copy link
Collaborator

RehanSD commented Jun 16, 2022

Thank you @NickCrews! I've gone ahead and merged this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: DeprecationWarning from using np.bool
4 participants