Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: RecursionError when attempting to replace np.nan values under main branch #45725

Closed
3 tasks done
roib20 opened this issue Jan 30, 2022 · 8 comments · Fixed by #45749 or #48234
Closed
3 tasks done

BUG: RecursionError when attempting to replace np.nan values under main branch #45725

roib20 opened this issue Jan 30, 2022 · 8 comments · Fixed by #45749 or #48234
Labels
Bug Regression Functionality that used to work in a prior pandas version replace replace method
Milestone

Comments

@roib20
Copy link
Contributor

roib20 commented Jan 30, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np


NaN_values = [np.nan, np.nan]
ser = pd.Series(data=NaN_values)
df = pd.DataFrame(data=NaN_values)


try:
    ser = ser.replace({np.nan: pd.NA})
except RecursionError:
    print("RecursionError: maximum recursion depth exceeded while calling a Python object")

try:
    df = df.replace({np.nan: None})
except RecursionError:
    print("RecursionError: maximum recursion depth exceeded while calling a Python object")

Issue Description

Note: this bug is not present in 1.4.x branch, but it is in the main branch and 1.5 milestone.

Let's say we have a Series or DataFrame containing numpy.nan values (a common scenario), and we want to convert them all to Panda's own type pandas.NA, or alternatively to the native Python None type. One way to do this conversion is by using the replace method as shown in the example. From my testing this works in pandas 1.4.0 as well as 1.3.5, but not in the main branch. A RecursionError exception is thrown.

Suggested tags: 1.5 Milestone, Regression, replace, Dtype Conversions

Expected Behavior

The replace method should successfully replace np.nan with the desired value, without errors. This behavior works in pandas 1.4.0 and previous versions and should be fixed for the pandas 1.5 release.

Installed Versions

commit : f77480f

python : 3.10.2.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22000
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252

pandas : 1.5.0.dev0+234.gf77480f9f5
numpy : 1.22.1
pytz : 2021.3
dateutil : 2.8.2
pip : 21.1.2
setuptools : 57.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

@roib20 roib20 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 30, 2022
@roib20
Copy link
Contributor Author

roib20 commented Jan 30, 2022

From my testing and debugging, I think this regression is caused from PR #45304 (merged to main branch with the 1.5 milestone), specifically the change in the coerce_to_target_dtype method under pandas/core/internals/blocks.py. Reverting this method to its previous code avoids the RecursionError.

@jbrockmendel
Copy link
Member

solution is to add value = self._standardize_fill_value(value) near the top of Block.replace.

@roib20 roib20 changed the title BUG: RecursionError when atempting to replace "np.nan" values under main branch BUG: RecursionError when attemptingto replace "np.nan" values under main branch Feb 1, 2022
@roib20 roib20 changed the title BUG: RecursionError when attemptingto replace "np.nan" values under main branch BUG: RecursionError when attempting to replace "np.nan" values under main branch Feb 1, 2022
@roib20
Copy link
Contributor Author

roib20 commented Feb 1, 2022

Alright working on a PR with Fix+Tests now.

roib20 added a commit to roib20/pandas that referenced this issue Feb 1, 2022
roib20 added a commit to roib20/pandas that referenced this issue Feb 1, 2022
@roib20 roib20 changed the title BUG: RecursionError when attempting to replace "np.nan" values under main branch BUG: RecursionError when attempting to replace np.nan values under main branch Feb 1, 2022
@jreback jreback added this to the 1.5 milestone Feb 1, 2022
@jreback jreback added replace replace method and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 1, 2022
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Mar 16, 2022
@simonjayhawkins
Copy link
Member

From my testing and debugging, I think this regression is caused from PR #45304 (merged to main branch with the 1.5 milestone), specifically the change in the coerce_to_target_dtype method under pandas/core/internals/blocks.py. Reverting this method to its previous code avoids the RecursionError.

hmm, bisect is giving first bad commit: [eae37b0] BUG: Series[Interval[int]][1] = np.nan incorrect coercion/raising (#45568)

@simonjayhawkins
Copy link
Member

hmm, bisect is giving first bad commit: [eae37b0] BUG: Series[Interval[int]][1] = np.nan incorrect coercion/raising (#45568)

can confirm that the regression reported here is the result of the following code added to find_result_type in #45568

    elif is_valid_na_for_dtype(right, left.dtype):
        # e.g. IntervalDtype[int] and None/np.nan
        new_dtype = ensure_dtype_can_hold_na(left.dtype)

@jbrockmendel
Copy link
Member

can confirm that the regression reported here is the result of the following code added to find_result_type

what are right and left.dtype in the relevant cases?

@simonjayhawkins
Copy link
Member

ignoring the None case for now, for the .replace({np.nan: pd.NA}) case

in Block.replace

> /home/simon/pandas/pandas/core/internals/blocks.py(588)replace()
-> if not mask.any():
(Pdb) self._can_hold_element(value)
False
(Pdb) self
NumericBlock: slice(0, 1, 1), 1 x 2, dtype: float64
(Pdb) value
<NA>
(Pdb) 

so we know that we need to enter the

        elif self.ndim == 1 or self.shape[0] == 1:
            blk = self.coerce_to_target_dtype(value)
            return blk.replace(
                to_replace=to_replace,
                value=value,
                inplace=True,
                mask=mask,
            )

block once the blocks have been split and change the dtype of the block so that self._can_hold_element(value) will be True on that next recursive call.

what are right and left.dtype in the relevant cases?

the call to self.coerce_to_target_dtype(value) will reach the code in find_result_type

> /home/simon/pandas/pandas/core/dtypes/cast.py(1482)find_result_type()
-> isinstance(left, np.ndarray)
(Pdb) a
left = array([[nan, nan]])
right = <NA>
(Pdb) left.dtype
dtype('float64')
(Pdb) is_valid_na_for_dtype(right, left.dtype)
True
(Pdb) 

and we set new_dtype = ensure_dtype_can_hold_na(left.dtype)

(Pdb) ensure_dtype_can_hold_na(left.dtype)
dtype('float64')

instead of returning object

(Pdb) dtype, _ = infer_dtype_from(right, pandas_dtype=True)
(Pdb) dtype
dtype('O')
(Pdb) find_common_type([left.dtype, dtype])
dtype('O')

@jbrockmendel
Copy link
Member

jbrockmendel commented Mar 21, 2022

Yikes, there are a couple of unpleasant things interacting here.

In many Block methods we do value = self._standardize_fill_value(value) near the top to convert considered-equivalent NA values. We don't do that in Block.replace. Doing that in Block.replace would prevent the RecursionError, but these would become no-ops, which I don't think is what the user wants.

The alternative is to intercept this particular situation somewhere. I think @simonjayhawkins has a branch that intercepts it at the level of find_common_type, but unless there are other use cases I'm inclined to handle it in Block.replace (though he likely has thought about it more than i have).

There's also an alternative where we do value = self._standardize_fill_value(value) in Block.replace for the sake of cross-method consistency and tell users they need to explicitly cast in order to avoid this being a no-op.

@mroeschke mroeschke removed this from the 1.5 milestone Aug 15, 2022
@phofl phofl mentioned this issue Sep 8, 2022
@phofl phofl added this to the 1.5.1 milestone Oct 7, 2022
@phofl phofl added the Regression Functionality that used to work in a prior pandas version label Oct 7, 2022
@datapythonista datapythonista modified the milestones: 1.5.1, 1.5.2 Oct 20, 2022
damccorm added a commit to apache/beam that referenced this issue Dec 2, 2022
* [WIP] Pandas 1.5 support

* Partial progress

* Exclude pandas 1.5.0 and 1.5.1 because of pandas-dev/pandas#45725

* Plumb group_keys more places

* Fix bad argument

* Fix bad argument

* Debug

* Debug

* Debug

* Debug

* allow list tests

* Update changes

* fmt

* Lint

* Lint

* Fix pd_version

* Fix pd_version

* Remove CHANGES.md since this needs to go in 2.45 section
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Regression Functionality that used to work in a prior pandas version replace replace method
Projects
None yet
7 participants