Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Make maybe_convert_object respect dtype itemsize #40908

Merged
merged 14 commits into from
Apr 21, 2021

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Apr 12, 2021

  • Ensure all linting tests pass, see here for how to run them

Precursor to #40790

This adds support for e.g. float32 NumPy dtypes to maybe_convert_object. If any non-NumPy scalar is hit, the behavior is the same as master. This is my first foray into the NumPy C-API, so any tips are appreciated. In particular, I couldn't figure out how to use the C API to do the cast:

result = result.astype(result.dtype.kind + str(itemsize))

Not sure if there should also be specific logic for EAs/nullable types.

From a full ASV run:

       before           after         ratio
     [7d4757b4]       [c1288962]
     <maybe_convert_object_itemsize~12>       <maybe_convert_object_itemsize>
+         768±8μs         1.01±0ms     1.32  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, False, 'float')
+         779±9μs         1.02±0ms     1.31  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, True, 'float')
+         579±5μs         710±20μs     1.23  arithmetic.NumericInferOps.time_divide(<class 'numpy.int8'>)
+        608±30μs         715±20μs     1.18  arithmetic.NumericInferOps.time_divide(<class 'numpy.uint8'>)
+      3.18±0.2μs       3.69±0.3μs     1.16  index_cached_properties.IndexCache.time_engine('UInt64Index')
+     1.58±0.01ms      1.83±0.01ms     1.16  ctors.SeriesConstructors.time_series_constructor(<function arr_dict at 0x7f5964729820>, False, 'float')
+      3.30±0.2μs       3.80±0.5μs     1.15  index_cached_properties.IndexCache.time_engine('TimedeltaIndex')
+        943±80ns       1.08±0.1μs     1.15  index_cached_properties.IndexCache.time_inferred_type('Float64Index')
+     1.73±0.01ms         1.98±0ms     1.14  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, True, 'int')
+     1.72±0.01ms         1.96±0ms     1.14  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, False, 'int')
+     5.19±0.03μs       5.93±0.3μs     1.14  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 6000, datetime.timezone(datetime.timedelta(seconds=3600)))
+     1.65±0.01ms      1.89±0.01ms     1.14  ctors.SeriesConstructors.time_series_constructor(<function arr_dict at 0x7f5964729820>, True, 'float')
+     6.06±0.06μs       6.91±0.1μs     1.14  tslibs.tz_convert.TimeTZConvert.time_tz_convert_from_utc(1, datetime.timezone(datetime.timedelta(seconds=3600)))
+     7.46±0.04μs       8.47±0.5μs     1.14  tslibs.offsets.OffestDatetimeArithmetic.time_subtract(<BusinessDay>)
+     1.15±0.09μs      1.27±0.08μs     1.11  index_cached_properties.IndexCache.time_values('UInt64Index')
-        11.7±1ms       10.5±0.1ms     0.90  algos.isin.IsinAlmostFullWithRandomInt.time_isin(<class 'numpy.object_'>, 18, 'outside')
-        641±30ns         577±20ns     0.90  index_cached_properties.IndexCache.time_is_monotonic('RangeIndex')
-        641±20ns         576±20ns     0.90  index_cached_properties.IndexCache.time_shape('Int64Index')
-     3.37±0.03ms      3.02±0.02ms     0.90  timeseries.ResampleSeries.time_resample('period', '5min', 'ohlc')
-        787±20ns         702±20ns     0.89  index_cached_properties.IndexCache.time_is_monotonic_decreasing('RangeIndex')
-     1.25±0.04μs      1.12±0.04μs     0.89  index_cached_properties.IndexCache.time_is_all_dates('Int64Index')
-      1.50±0.1ms      1.33±0.02ms     0.88  dtypes.SelectDtypes.time_select_dtype_string_exclude('float32')
-     1.19±0.09μs      1.05±0.08μs     0.88  index_cached_properties.IndexCache.time_inferred_type('UInt64Index')
-        57.6±1μs       50.3±0.2μs     0.87  frame_methods.Dtypes.time_frame_dtypes
-         168±3μs        146±0.6μs     0.87  algos.isin.IsinWithArangeSorted.time_isin(<class 'numpy.uint64'>, 8000)
-        552±20ns         480±20ns     0.87  index_cached_properties.IndexCache.time_is_monotonic_increasing('RangeIndex')
-      1.50±0.1ms         1.30±0ms     0.87  dtypes.SelectDtypes.time_select_dtype_string_exclude('complex64')
-        1.89±0ms      1.64±0.01ms     0.87  period.DataFramePeriodColumn.time_set_index
-      11.0±0.2μs      9.50±0.08μs     0.87  period.Indexing.time_series_loc
-        110±20ms      93.1±0.06ms     0.85  algos.isin.IsInLongSeriesLookUpDominates.time_isin('float32', 1000, 'random_hits')
-        392±30ns         330±10ns     0.84  index_cached_properties.IndexCache.time_inferred_type('RangeIndex')
-      4.45±0.4ms      3.68±0.02ms     0.83  algorithms.Factorize.time_factorize(True, True, 'Int64')
-      9.60±0.8ms      7.75±0.01ms     0.81  algorithms.Factorize.time_factorize(True, False, 'string')
-      11.2±0.8ms       8.90±0.1ms     0.80  algos.isin.IsinAlmostFullWithRandomInt.time_isin(<class 'numpy.object_'>, 18, 'inside')
-        615±30ns         484±30ns     0.79  index_cached_properties.IndexCache.time_is_monotonic_increasing('Int64Index')
-       73.8±20ms       55.4±0.5ms     0.75  algos.isin.IsInLongSeriesLookUpDominates.time_isin('float32', 1000, 'monotone_hits')
-       180±0.1μs        128±0.5μs     0.71  indexing_engines.NumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.Int8Engine'>, <class 'numpy.int8'>), 'monotonic_incr')
-        128±40ms      90.8±0.03ms     0.71  algos.isin.IsInLongSeriesLookUpDominates.time_isin('object', 5, 'monotone_hits')
-       186±0.3μs          131±2μs     0.70  indexing_engines.NumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.Int16Engine'>, <class 'numpy.int16'>), 'monotonic_incr')
-      1.72±0.4μs       1.17±0.1μs     0.68  index_cached_properties.IndexCache.time_inferred_type('TimedeltaIndex')
-       216±0.8μs          140±3μs     0.65  indexing_engines.NumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.UInt32Engine'>, <class 'numpy.uint32'>), 'monotonic_incr')
-     1.55±0.01ms          972±7μs     0.63  algos.isin.IsinWithArangeSorted.time_isin(<class 'numpy.int64'>, 100000)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

Specific timings via %timeit on maybe_convert_objects directly:

np.array of python integers:
127 ms ± 80.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  <--- PR
120 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)   <--- master

np.array of int32:
179 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  <--- PR
133 ms ± 274 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)    <--- master

np.array of int32 with last one a python int:
178 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  <--- PR
133 ms ± 106 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)    <--- master
timeit code
values = np.array(range(1_000_000), dtype="object")
print('np.array of python integers:')
%timeit maybe_convert_objects(values)
print()

values = np.array([np.int32(1)] * 1_000_000, dtype="object")
print('np.array of int32:')
%timeit maybe_convert_objects(values)
print()

values = np.array([np.int32(1)] * 999_999 + [1], dtype="object")
print('np.array of int32 with last one a python int:')
%timeit maybe_convert_objects(values)
print()

@rhshadrach rhshadrach added Enhancement Dtype Conversions Unexpected or buggy dtype conversions labels Apr 12, 2021
@@ -195,6 +200,24 @@ cdef inline bint is_nan(object val):
return is_complex_object(val) and val != val


cdef inline int64_t get_itemsize(object val):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is only used in lib.pyx, i think better to put it there

elif seen.is_bool and not seen.nan_:
return bools.view(np.bool_)
result = bools.view(np.bool_)
if result is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be tightened to something like result is floats or result is uints or result is ints? i.e. exclude datetimes/timedeltas/bools

result = bools.view(np.bool_)
if result is not None:
if itemsize_max > 0:
curr_itemsize = cnp.PyArray_ITEMSIZE(result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id just use result.dtype.itemsize and not bother with the C API.

@jreback
Copy link
Contributor

jreback commented Apr 13, 2021

can you add some explict tests for this, maybe around

pandas/tests/dtypes/test_inference.py:        result = lib.maybe_convert_objects(arr)
pandas/tests/dtypes/test_inference.py:    def test_maybe_convert_objects_uint64(self):
pandas/tests/dtypes/test_inference.py:        tm.assert_numpy_array_equal(lib.maybe_convert_objects(arr), exp)
pandas/tests/dtypes/test_inference.py:        tm.assert_numpy_array_equal(lib.maybe_convert_objects(arr), exp)
pandas/tests/dtypes/test_inference.py:        tm.assert_numpy_array_equal(lib.maybe_convert_objects(arr), exp)
pandas/tests/dtypes/test_inference.py:        tm.assert_numpy_array_equal(lib.maybe_convert_objects(arr), exp)
pandas/tests/dtypes/test_inference.py:    def test_maybe_convert_objects_datetime(self):
pandas/tests/dtypes/test_inference.py:        out = lib.maybe_convert_objects(
pandas/tests/dtypes/test_inference.py:        out = lib.maybe_convert_objects(
pandas/tests/dtypes/test_inference.py:        out = lib.maybe_convert_objects(
pandas/tests/dtypes/test_inference.py:    def test_maybe_convert_objects_timedelta64_nat(self):
pandas/tests/dtypes/test_inference.py:        result = lib.maybe_convert_objects(arr, convert_timedelta=True)
pandas/tests/dtypes/test_inference.py:    def test_maybe_convert_objects_nullable_integer(self, exp):
pandas/tests/dtypes/test_inference.py:        result = lib.maybe_convert_objects(arr, convert_to_nullable_integer=True)
pandas/tests/dtypes/test_inference.py:    def test_maybe_convert_objects_bool_nan(self):
pandas/tests/dtypes/test_inference.py:        out = lib.maybe_convert_objects(ind.values, safe=1)
pandas/tests/dtypes/test_inference.py:        result = lib.maybe_convert_objects(arr, convert_datetime=True)

@rhshadrach
Copy link
Member Author

rhshadrach commented Apr 13, 2021

Edit: Turns out, this was an issue with Series.count.

Thanks @jbrockmendel and @jreback - will make changes and tests. But I want to figure out what the right output here is:

        df = DataFrame(
            {"A": [None, 2, 3], "B": [1.0, np.nan, 3.0], "C": ["foo", None, "bar"]}
        )
    
        # Function aggregate
        result = getattr(df, how)({"A": "count"})
        expected = Series({"A": 2})
    
>       tm.assert_series_equal(result, expected)
E       AssertionError: Attributes of Series are different
E       
E       Attribute "dtype" are different
E       [left]:  int32
E       [right]: int64

The count is computed by summing an ndarray of Booleans, which NumPy gives back on a 32-bit machine as an int32. This PR is making it stay an int32 (master would convert to int64).

Do we want Series({"A": 2}) to come back on a 32-bit machine as being an int32 as well? If the answer is no, then I'm thinking this PR should be scrapped. If we do, I think we can implement that here as well:

    if result is uints or result is ints or result is floats:
        if itemsize_max > 0 and itemsize_max != result.dtype.itemsize:
            result = result.astype(result.dtype.kind + str(itemsize_max))
        else:
            result = result.astype(result.dtype.kind + str(np.intp().dtype.itemsize))
        return result

(with perhaps a better way to get at the 4 vs 8 bytes)

@jreback
Copy link
Contributor

jreback commented Apr 14, 2021

we diverge with numpy by always casting a list of python into to int64 (numpy would do int32)

@rhshadrach
Copy link
Member Author

Thanks @jreback - this was actually an issue with Series.count. All paths of DataFrame.count cast to int64, and when level is not None, Series would do this as well. But when using Series.count with level=None, no cast was done. I'm hopeful adding this in (along with some other test fixes) will fix CI.

@rhshadrach rhshadrach marked this pull request as draft April 15, 2021 03:42
@rhshadrach
Copy link
Member Author

rhshadrach commented Apr 15, 2021

The xfailed test on DataFrame.replace is due to a change in NumPy behavior:

values = np.array(['a'], dtype='object')
mask = np.array([True])
value = np.int8(1)
np.putmask(values, mask, value)
print(type(values[0]))

results in <class 'int'> with NumPy 1.19 but <class 'numpy.int8'> with NumPy >= 1.20

@@ -1891,7 +1891,7 @@ def count(self, level=None):
2
"""
if level is None:
return notna(self._values).sum()
return notna(self._values).sum().astype("int64")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All paths for Series.count and DataFrame.count result in int64 except for this one.

@rhshadrach rhshadrach marked this pull request as ready for review April 15, 2021 20:06
@rhshadrach
Copy link
Member Author

@jbrockmendel - changes made, @jreback - test added.

This still needs a whatsnew note, I think at least the impact on Series/DataFrame construction with numpy types should perhaps be in a section of its own. But maybe_convert_objects is used in many places, not sure if there should be more mentioned. When I follow this up with #40790 I'll add on the impact to UDF methods.

@rhshadrach
Copy link
Member Author

Now I'm second guessing making its own section - I'd guess that construction via a list of NumPy scalars is not so common. Maybe just a single line in other enhancements?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment. any perf issues?

elif seen.is_bool and not seen.nan_:
return bools.view(np.bool_)
result = bools.view(np.bool_)
if result is uints or result is ints or result is floats or result is complexes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put a blank line & comment here (e.g. casting to itemsize)

@rhshadrach
Copy link
Member Author

@jreback - changes made

any perf issues?

I updated the ASV in the OP, and included some timeit using maybe_convert_objects directly.

@jreback jreback added this to the 1.3 milestone Apr 20, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm merge on greenish

@rhshadrach
Copy link
Member Author

@jreback - greenish. Failure is:

      raise AssertionError(f"Caused unexpected warning(s): {repr(extra_warnings)}")

E AssertionError: Caused unexpected warning(s): [('ResourceWarning', ResourceWarning('unclosed transport <asyncio.sslproto._SSLProtocolTransport object at 0x000001FA4204DFC8>'), 'C:\Miniconda\envs\pandas-dev\lib\asyncio\sslproto.py', 322)]

@jreback jreback merged commit 549e39b into pandas-dev:master Apr 21, 2021
@jreback
Copy link
Contributor

jreback commented Apr 21, 2021

thanks @rhshadrach

@rhshadrach rhshadrach deleted the maybe_convert_object_itemsize branch April 21, 2021 15:37
yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request May 6, 2021
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants