-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#4570: Replace np.bool -> np.bool_ #4571
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @NickCrews, thanks for the contribution! The changes look great!
CI will fail on the commit message, it needs to have a signed off message (git commit -s
) and the first line of the message needs to follow this: https://modin.readthedocs.io/en/latest/development/contributing.html#commit-message-formatting . Setting the first line of the commit message to the PR title will fix that.
We use commit messages to generate the release notes. Thanks, and let me know if I can help in any way!
Codecov Report
@@ Coverage Diff @@
## master #4571 +/- ##
==========================================
+ Coverage 86.25% 89.36% +3.10%
==========================================
Files 228 229 +1
Lines 18453 18727 +274
==========================================
+ Hits 15917 16735 +818
+ Misses 2536 1992 -544
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
I was getting: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations I assume that this is the correct substitution, instead of chaning to a simple `bool`, but I don't understand what this code does here, so I may be wrong. Signed-off-by: Nick Crews <nicholas.b.crews@gmail.com>
@devin-petersohn Thanks, I think I updated the branch correctly. I didn't do the task:
because you are saying that is autogenerated? Or I need to do that? |
@NickCrews, we should probably replace all entries of |
@YarShev That followup commit replaced all the other instances of np.bool. But I think in those cases they should get replaced with just bool. Double check me. |
Not exactly, sorry for not being clear. We are working toward that, but for now the message also needs to be manually added here: https://github.com/modin-project/modin/blame/master/docs/release_notes/release_notes-0.16.0.rst#L9 |
Added entry to release notes! That last CI run failed, though it looked unrelated to this PR? IDK we will see this new run again. I will note, as a new contributor the number of hoops I have to jump through, as compared to contributing to say pandas, is annoying. Autogenerated release notes, and looser requirements on how to format commit messages and issue numbers etc would probably make me more likely to contribute in the future. Just my two cents, I'm sure there are reasons for the structure! |
Hi @NickCrews! That makes a lot of sense, thank you - we'll be sure to keep that in mind moving forward! One of the reasons we like to lint commits is so that the commit history on the main branch remains clear + descriptive! I've approved the CI to run, and look forward to merging when everything is green! |
@RehanSD I just had to force-push again to fix a error in the docs build, thanks! |
Hi @NickCrews quick question - why are we replacing those instances with (I've also approved the run again)! |
@RehanSD, thanks, that is a great question! Some info to be considered - https://stackoverflow.com/questions/55905690/how-exactly-does-the-behavior-of-python-bool-and-numpy-bool-differ. |
I made the choice between going to bool vs np.bool_ based on:
Hopefully that reasoning makes sense? Other things I'm missing? I will change everything to np.bool_ if you tell me, but I think I still prefer the way the PR is now. CI failures still look unrelated? These same workflows are failing on |
I think that makes sense to me - but that begets the broader question, why use cc: @devin-petersohn Thank you for pointing out the CI errors - I'll take a closer look at those later today/tomorrow! |
Just checked and the errors are due to a dependency bug we are currently trying to resolve. #4568 is one particular solution - once we've got something merged for that, you'll be able to pull from master and have CI actually run! |
Yeah that's true, and I think in a nice simple world we DO only use bool. I'm not sure what the implications are. I just started diving into the internals of what I agree a comment would be good. At my current level of understanding I think I would say something like |
Just ran a quick test, which supports the idea that we could probably just swap In [1]: import pandas as pd; import numpy as np
In [2]: df = pd.DataFrame([[True, False, True]]*1000, columns=["A", "B", "C"], dtype=np.bool_)
In [3]: df.dtypes
Out[3]:
A bool
B bool
C bool
dtype: object
In [4]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 1000 non-null bool
1 B 1000 non-null bool
2 C 1000 non-null bool
dtypes: bool(3)
memory usage: 3.1 KB
In [5]: bool_df = pd.DataFrame([[True, False, True]]*1000, columns=["A", "B", "C"])
In [6]: bool_df.dtypes
Out[6]:
A bool
B bool
C bool
dtype: object
In [7]: bool_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 1000 non-null bool
1 B 1000 non-null bool
2 C 1000 non-null bool
dtypes: bool(3)
memory usage: 3.1 KB
In [8]: np_bool_df = pd.DataFrame([[np.bool_(True), np.bool_(False), np.bool_(True)]]*1000, columns=["A", "B", "C"])
In [9]: np_bool_df.dtypes
Out[9]:
A bool
B bool
C bool
dtype: object
In [10]: np_bool_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 1000 non-null bool
1 B 1000 non-null bool
2 C 1000 non-null bool
dtypes: bool(3)
memory usage: 3.1 KB |
@RehanSD, I am not sure that is the fair check you did because when comparing import pandas as pd; import numpy as np
df = pd.DataFrame([[True, False, True]]*1000, columns=["A", "B", "C"], dtype=np.bool_)
df.dtypes == bool
A True
B True
C True
dtype: bool
df.dtypes == np.bool
DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
df.dtypes == np.bool
A True
B True
C True
dtype: bool
df.dtypes == np.bool_
A True
B True
C True
dtype: bool
np.bool_ == bool
False |
@NickCrews, we use import sys
import numpy as np
a = np.array([True], dtype=np.bool_)
type(a[0])
numpy.bool_
b = True
type(b)
bool
sys.getsizeof(a[0])
25
sys.getsizeof(b)
28 @devin-petersohn, @RehanSD, @NickCrews , thoughts? |
Thanks @NickCrews this is good feedback. I think the contribution requirements list has grown and a lot of things have been added to solve one problem or another, but I agree that we should definitely try to soften as many of these requirements. This checklist has grown to be a bit too long: Most of these requirements were reminders for myself/other maintainers to enforce some good discipline, but don't necessarily need to be so strict. |
@RehanSD @YarShev I think the conversation is getting a bit too in the weeds. I think the way @NickCrews has structured this is actually fine: I am fine with the changes as-is. |
@devin-petersohn, a Modin developer might be confused with this behavior in future and he will have a question why |
To be honest, I think @devin-petersohn is right - I'm happy to merge this PR as is and open a new PR to do a refactor. @YarShev I don't know if |
@RehanSD, I do not see a problem to do the refactor in this PR. However, if we want to move forward, we can merge this PR as is and open a new PR to do the refactor. Would you be willing to do that? |
Fine by me @YarShev! @devin-petersohn thoughts? Thank you for all of your awesome work on this issue @NickCrews ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YarShev I don't necessarily agree that a Modin developer would get confused, especially given the git blame
would link back to this discussion, but I have made it easy for @NickCrews to make it all consistent.
@devin-petersohn, makes sense to me, thanks! @NickCrews, please rebase on current master to fix failures in CI. |
See modin-project#4571 Co-authored-by: Devin Petersohn <devin-petersohn@users.noreply.github.com>
Suggestions applied. Thank you all! That was easy to apply those suggestions. I haven't used that feature before but that was slick! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NickCrews, thanks for the changes! Will look forward to seeing further contributions from your side.
Hi @NickCrews! Thank you so much for getting this in! I just wanted to include a link to join our Modin Slack as well (it's also in the README) - its where we have longer form discussions + collaborations with other Modin Developers, and do Q+A for users - I think it would be great to have you on there too, to ease future collaboration! Link |
Looked into the bug on CI - turns out one of our tests uses def describe(
self, percentiles=None, include=None, exclude=None, datetime_is_numeric=False
): # noqa: PR01, RT01, D200
"""
Generate descriptive statistics.
"""
if include is not None and (isinstance(include, np.dtype) or include != "all"):
if not is_list_like(include):
include = [include]
include = [
np.dtype(i)
if not (isinstance(i, type) and i.__module__ == "numpy")
else i
for i in include
]
if not any(
(isinstance(inc, np.dtype) and inc == d)
or (
not isinstance(inc, np.dtype)
and inc.__subclasscheck__(getattr(np, d.__str__()))
)
or (inc in [bool, np.bool_] and d in [bool, np.bool_])
for d in self._get_dtypes()
for inc in include
):
# This is the error that pandas throws.
raise ValueError("No objects to concatenate")
if exclude is not None:
if not is_list_like(exclude):
exclude = [exclude]
exclude = [np.dtype(e) for e in exclude]
if all(
(isinstance(exc, np.dtype) and exc == d)
or (
not isinstance(exc, np.dtype)
and exc.__subclasscheck__(getattr(np, d.__str__()))
)
for d in self._get_dtypes()
for exc in exclude
):
# This is the error that pandas throws.
raise ValueError("No objects to concatenate")
if percentiles is not None:
# explicit conversion of `percentiles` to list
percentiles = list(percentiles)
# get them all to be in [0, 1]
validate_percentile(percentiles)
# median should always be included
if 0.5 not in percentiles:
percentiles.append(0.5)
percentiles = np.asarray(percentiles)
else:
percentiles = np.array([0.25, 0.5, 0.75])
return self.__constructor__(
query_compiler=self._query_compiler.describe(
percentiles=percentiles,
include=include,
exclude=exclude,
datetime_is_numeric=datetime_is_numeric,
)
) |
I'm nervous that adding a oneline fixer of Looking at how pandas does this, their IDK, could just do this tweak, merge, and then fixup later, but then it runs the risk of being a TODO that never is fixed 😆 YAY, I love it when the scope of a PR explodes. |
That makes sense - but I think there may be an easier way to handle this - we can just use |
Of course - happy to do so (unless you'd prefer to do it, since it seems that Mahesh may have unblocked you)? |
…es correctly Signed-off-by: Rehan Durrani <rehan@ponder.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Thank you guys, LGTM! |
Thank you @NickCrews! I've gone ahead and merged this! |
I was getting:
I assume that this is the correct substitution, instead of changing to a simple
bool
, but I don't understand what this code does here, so I may be wrong.If this looks good I can do the final TODOs in this checklist
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date