Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/BUG: Series(floating, dtype=intlike) ignores dtype, DataFrame casts #40110

Closed
jbrockmendel opened this issue Feb 27, 2021 · 20 comments
Closed
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors Deprecate Functionality to remove in pandas Dtype Conversions Unexpected or buggy dtype conversions Needs Discussion Requires discussion from core team before further action

Comments

@jbrockmendel
Copy link
Member

In most contexts, Series is strict about dtype, so will always either return the given dtype or raise. DataFrame is the opposite, often silently ignoring dtype (xref #24435) (i think on the theory that dtype may be intended to apply to some columns but not others).

With floating data and integer dtypes, its the opposite:

arr = np.random.randn(5)

>>> pd.Series(arr, dtype="int16")
0    1.002695
1    0.259332
2   -1.111468
3   -0.680714
4   -0.008943

>>> pd.DataFrame(arr, dtype="int16")
   0
0  1
1  0
2 -1
3  0
4  0

We have exactly one test that is broken if we change the latter behavior, and that is mostly by coincidence. There are other bugs (e.g. Series(bigints, dtype="int8") silently overflowing) that would be easier to fix if maintaining this behavior weren't a consideration.

Is this intentional? cc @jreback @jorisvandenbossche @TomAugspurger

@jorisvandenbossche
Copy link
Member

No idea what the reason would be if it was intentional.

So I would be fine with changing it. I am only thinking if there is some way to deprecate it (which doesn't seem easy without introducing an extra keyword just to manager the deprecation ..), since it is a breaking change (actaul numbers would change)

@jorisvandenbossche
Copy link
Member

It seems this changed behaviour before, as a long time ago we actually honored the dtype (I checked pd 0.20)

@jbrockmendel
Copy link
Member Author

xref #26919

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 10, 2021
@jreback jreback added this to the 1.3 milestone Mar 10, 2021
@jbrockmendel jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label May 18, 2021
@jbrockmendel
Copy link
Member Author

Tried changing this to only disallow float-int casting that included NaNs, but it turns out we explicitly test that Series([1, 2, 3.5], dtype=any_int_dtype) raises.

Is there reasoning/precedent for why we would treat list[float] differently from ndarray[float]? cc @jreback

@jreback
Copy link
Contributor

jreback commented May 20, 2021

only recollection is that we didn't want to silently cast floats to ints like numpy (but agreed this is a bit odd that we allow for array but not list)

@jbrockmendel
Copy link
Member Author

any preference for changing/deprecating one or the other?

@jreback
Copy link
Contributor

jreback commented May 20, 2021

yeah i think we should raise
i find numpy a bit odd here

@jbrockmendel
Copy link
Member Author

yeah i think we should raise

hmm not what i expected. in most cases we think of Series(foo, dtype=bar) as behaving like Series(foo).astype(bar), which wouldn't raise

@jbrockmendel
Copy link
Member Author

update: Index behaves like Series in this case

@jbrockmendel
Copy link
Member Author

This actually gets worse:

In [3]: arr = np.random.randn(5)

In [4]: pd.Series(arr, dtype="i8")
Out[4]: 
0   -1.329695
1    0.721636
2    0.781266
3   -0.413694
4   -0.125569
dtype: float64

In [5]: pd.Series(list(arr), dtype="i8")
ValueError: Trying to coerce float values to integers

The latter behavior we have a specific test for added in #21456

@pandas-dev/pandas-core thoughts on making this consistent?

@rhshadrach
Copy link
Member

Instead of raising, when the specified dtype could result in data loss (e.g. float -> int), does a warning make sense? That way it is visible to interactive users and effectively an error for a production job running with -Werror or in a test framework.

@jbrockmendel
Copy link
Member Author

yeah i think we should raise
i find numpy a bit odd here

@jreback for the most part we like to have Series(values, dtype=dtype) === Series(values).astype(dtype). If you're suggesting we raise on the constructor version, would you also want to disallow the astype? Or we could leave the astype as the "do this instead", but sacrifice the equivalency

@jreback
Copy link
Contributor

jreback commented May 31, 2021

yeah i think we should raise
i find numpy a bit odd here

@jreback for the most part we like to have Series(values, dtype=dtype) === Series(values).astype(dtype). If you're suggesting we raise on the constructor version, would you also want to disallow the astype? Or we could leave the astype as the "do this instead", but sacrifice the equivalency

I would still have this axiom hold. My point is that (and maybe this is a bigger change / deprecation), casting to ints (not extension) should fail if we have NaN (i think we don this now in all cases) or if these are not exactly the same. This would be 'safe' casting.

IOW the numpy behavior i view as really bad actully (if you want to round/floor/ceil, then that should be explicit).

So I think that both construction & astype should fail here.

@jbrockmendel
Copy link
Member Author

should fail if we have NaN

totally agreed. the case in question here is where we have e.g. np.array([1.5, 2.5]), i.e. no NaNs, and not round

@jreback
Copy link
Contributor

jreback commented May 31, 2021

should fail if we have NaN

totally agreed. the case in question here is where we have e.g. np.array([1.5, 2.5]), i.e. no NaNs, and not round

I think we should raise on this as well. This is a big departure from numpy, but IMHO totally unexpected to directly cast these to int should raise. (wether by constror dtype or via .astype)

@jbrockmendel
Copy link
Member Author

OK, ive got a branch almost ready that changes the Series constructor (though not DataFrame, which is a separate branch to ...). I'd be OK with deprecating instead of directly changing, too.

@jreback
Copy link
Contributor

jreback commented May 31, 2021

yeah i think have to deprecate

@TomAugspurger
Copy link
Contributor

I don't have a strong opinion here, but one question: What's the performance impact of verifying / validating that we should raise? Does this have to look at each value, or is it equivalent to ndarray.astype(casting="safe")?

@jbrockmendel
Copy link
Member Author

What's the performance impact of verifying / validating that we should raise? Does this have to look at each value, or is it equivalent to ndarray.astype(casting="safe")?

It is not equivalent to casting="safe", no. It goes through maybe_cast_to_integer_array which when passed an ndarray[floaty] boils down to

casted = arr.astype(dtype, copy=copy)
if np.array_equal(arr, casted):
     return casted

if not np.isfinite(arr).all():  # we do this check because our fallback behavior depends on it
    raise IntCastingNaNError(...)
raise ValueError(..)

I've opened numpy/numpy#19146 to discuss how to implement this (well, the uint analogue of this) more efficiently.

@simonjayhawkins simonjayhawkins removed this from the 1.3 milestone Jun 8, 2021
@mroeschke mroeschke added Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action labels Aug 19, 2021
@jbrockmendel
Copy link
Member Author

closed by #45142 + #45136 + #41770

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors Deprecate Functionality to remove in pandas Dtype Conversions Unexpected or buggy dtype conversions Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

7 participants