Skip to content

Commit

Permalink
Backport PR #37499: REGR: fix isin for large series with nan and mixe…
Browse files Browse the repository at this point in the history
…d object dtype (causing regression in read_csv) (#37517)

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
  • Loading branch information
meeseeksmachine and jorisvandenbossche authored Oct 30, 2020
1 parent da25ec5 commit b317a48
Show file tree
Hide file tree
Showing 4 changed files with 27 additions and 1 deletion.
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.1.4.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ including other versions of pandas.
Fixed regressions
~~~~~~~~~~~~~~~~~
- Fixed regression in :func:`read_csv` raising a ``ValueError`` when ``names`` was of type ``dict_keys`` (:issue:`36928`)
- Fixed regression in :func:`read_csv` with more than 1M rows and specifying a ``index_col`` argument (:issue:`37094`)
- Fixed regression where attempting to mutate a :class:`DateOffset` object would no longer raise an ``AttributeError`` (:issue:`36940`)
- Fixed regression where :meth:`DataFrame.agg` would fail with :exc:`TypeError` when passed positional arguments to be passed on to the aggregation function (:issue:`36948`).
- Fixed regression in :class:`RollingGroupby` with ``sort=False`` not being respected (:issue:`36889`)
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -440,7 +440,7 @@ def isin(comps: AnyArrayLike, values: AnyArrayLike) -> np.ndarray:
if len(comps) > 1_000_000 and not is_object_dtype(comps):
# If the the values include nan we need to check for nan explicitly
# since np.nan it not equal to np.nan
if np.isnan(values).any():
if isna(values).any():
f = lambda c, v: np.logical_or(np.in1d(c, v), np.isnan(c))
else:
f = np.in1d
Expand Down
15 changes: 15 additions & 0 deletions pandas/tests/io/parser/test_index_col.py
Original file line number Diff line number Diff line change
Expand Up @@ -207,3 +207,18 @@ def test_header_with_index_col(all_parsers):

result = parser.read_csv(StringIO(data), index_col="I11", header=0)
tm.assert_frame_equal(result, expected)


@pytest.mark.slow
def test_index_col_large_csv(all_parsers):
# https://github.com/pandas-dev/pandas/issues/37094
parser = all_parsers

N = 1_000_001
df = DataFrame({"a": range(N), "b": np.random.randn(N)})

with tm.ensure_clean() as path:
df.to_csv(path, index=False)
result = parser.read_csv(path, index_col=[0])

tm.assert_frame_equal(result, df.set_index("a"))
10 changes: 10 additions & 0 deletions pandas/tests/series/methods/test_isin.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,3 +89,13 @@ def test_isin_read_only(self):
result = s.isin(arr)
expected = Series([True, True, True])
tm.assert_series_equal(result, expected)


@pytest.mark.slow
def test_isin_large_series_mixed_dtypes_and_nan():
# https://github.com/pandas-dev/pandas/issues/37094
# combination of object dtype for the values and > 1_000_000 elements
ser = Series([1, 2, np.nan] * 1_000_000)
result = ser.isin({"foo", "bar"})
expected = Series([False] * 3 * 1_000_000)
tm.assert_series_equal(result, expected)

0 comments on commit b317a48

Please sign in to comment.