-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Series constructor logic overhaul #14427
Comments
Looks like a very sensible cleanup... |
For context, NumPy, pandas and pyarrow all automatically upcast to float64 when there's a mix of integers and floats: >>> np.array([1, 2, 3.])
array([1., 2., 3.])
>>> pd.Series([1, 2, 3.])
0 1.0
1 2.0
2 3.0
dtype: float64
>>> pa.array([1, 2, 3.])
<pyarrow.lib.DoubleArray object at 0x7f8758fc4940>
[
1,
2,
3
] whereas NumPy and pandas auto-cast strings to integers when setting >>> np.array(['1', '2', '3'], dtype=int)
array([1, 2, 3])
>>> pd.Series(['1', '2', '3'], dtype=int)
0 1
1 2
2 3
dtype: int64
>>> pa.array(['1', '2', '3'], pa.int64())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/array.pxi", line 344, in pyarrow.lib.array
File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert '1' with type str: tried to convert to int64 Similarly, NumPy and pandas auto-upcast to >>> np.array([1.0, 2, 'a', date(2022, 1, 1), True, False])
array([1.0, 2, 'a', datetime.date(2022, 1, 1), True, False], dtype=object)
>>> pd.Series([1.0, 2, 'a', date(2022, 1, 1), True, False])
0 1.0
1 2
2 a
3 2022-01-01
4 True
5 False
dtype: object
>>> pa.array([1.0, 2, 'a', date(2022, 1, 1), True, False])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/array.pxi", line 344, in pyarrow.lib.array
File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert 'a' with type str: tried to convert to double None of the three libraries have a |
I had a chance to discuss this with a couple of my students today. We all agreed that silently setting data to We ultimately converged on the following proposal:
pl.Series([1, 2, 3]) # works
pl.Series([1, 2, 3.0]) # fails, inferred Int64 type, encountered float
pl.Series([1.0, 2, 3]) # fails, inferred Float64 type, encountered integer
pl.Series([1, 2, 3.0], dtype=pl.Float64) # works, Float64
pl.Series([1, 2, 3.0], dtype=pl.Int64) # works, Int64
pl.Series([1, 2, 3.5], dtype=pl.Int64) # works, Float64 (casts 3.5 to 3)
pl.Series(['1', '2', '3']) # works, String
pl.Series(['1', '2', '3'], dtype=pl.Int64) # works, Int64
pl.Series([1.0, 2, 'a']) # fails
pl.Series([1.0, 2, 'a'], dtype=pl.String) # works, String
pl.Series([1.0, 2, 'a'], dtype=pl.Object) # works, Object
pl.Series([date(2022, 1, 1), True, False]) # fails
pl.Series([date(2022, 1, 1), True, False], dtype=pl.String) # works, String
pl.Series([date(2022, 1, 1), True, False], dtype=pl.Object) # works, Object
pl.Series([1, 2, '3', 'x'], dtype=pl.Int32) # fails: 'x' can't be cast to Int32
pl.Series([1, 2, '3', 'x'], dtype=pl.String) # works, String
pl.Series([1, 2, '3', 'x'], dtype=pl.Object) # works, Object If you did want to allow the auto-nulling behavior, you could have a |
The problem is that this is very slow. Every value is passed to the Rust side, then we have to figure out which data type it is, then we have to try to cast it, and then check if that worked correctly and potentially raise. A strict mode allows for a much faster way to parse the data as we can make certain assumptions (all data is the same type / all data matches the given dtype). |
I don't think this is correct, but I could be missing something. You still have to check that all data matches the given dtype, even in strict mode, so that you know whether to raise an error. So allowing In other words, there are no cases that would be made faster by having a strict mode; it just forces people to specify The problem with your proposal is that there's no way to say "I want to allow mixed types, but please still raise an error if it's impossible to cast an element to my desired type". If the user wants to allow mixed types, they're forced to also allow dangerous auto-nulling. In practice, to handle mixed types safely, you would either have to cast ahead of time in Python and use |
Perhaps we could make your proposed design performant. But the problem is that your Let's say I want to create a Series of Using your design, this is not possible. Specifying the |
You make a compelling point. It feels like these three all need to be separate arguments:
My proposal mixes 1 and 2, which you correctly point out has disadvantages. Your proposal mixes 2 and 3, which has different disadvantages. |
there another important point here that should not be missing. Despite theoric optimizations here made from students. Polars has become in last two years a huge engineering production bastion, used by tons of enterprises on their data science or data engineering processes. So there is an important element called backward compatibility that should take priority over theoric disertations up to a point, in such a way for any breaking change, the following hypothesis contrast needs to be made. very good theorical change that suposes breaking change with no retrocompatibility: And always on breaking changes, it needs to be analyzed all the cases that the breaking change misses, and offer alternatives at engineering level, in order to not break productive loads without alternative... Tools are tools and when they are really good tools (like polars), plenty of use cases are built over them, specially on libraries that involve data engineering, the amount of use cases is vast and huge, so breaking changes need to take care about all relative use cases trying to not miss or give alternatives at engineering level when something is missed on the break.
|
This is off-topic. You can read up on our policy for breaking changes in the user guide. The functionality of |
Are there any examples of Rust api calls for from_any_values_and_dtype? (It's seemingly the case that Rust takes a lower priority in fleshing out documentation than Python bindings, which seems ironic.) |
There have been a host of issues about the constructors of Series/DataFrame not working as expected. I recently sat down with Ritchie to think out the way it should work. We'll start with
Series
and addressDataFrame
later (they are related, of course).Intended behavior
There are four modes of construction:
Implementation
Rust
On the Rust side, these constructors are represented by two methods:
from_any_values_and_dtype
from_any_values
Both have a
strict
parameter.These methods will follow the logic outlined above.
Python
The Python side will not directly dispatch to the two Rust methods outlined above, because interpreting the full input as
AnyValue
is expensive. They will function as a fallback for when fast paths do not work.There are a few constructors defined in the Rust bindings. These constructors are for a specific dtype and will extract the Python object directly into that dtype. This may fail, hence the need for a fallback.
During strict construction, these fast constructors are used. The data type is determined either by the dtype input of the user, or by the first non-null entry in the data. If the constructor fails due to unexpected data, it raises an error.
During non-strict construction, we also attempt to use these fast constructors, using the same dtype inference. However, if an error occurs due to unexpected data, we fall back to the appropriate Rust constructor. This will be relatively slow, but this can be avoided by sanitizing your data beforehand and using strict construction instead.
Examples
*** Large integers may lose precision when interpreted as floats, so strict mode does not allow this.
Pros and cons
Pro:
Con:
The text was updated successfully, but these errors were encountered: