Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise an error when constructing a Series or DataFrame with mixed types (e.g. string + number) #11156

Closed
Wainberg opened this issue Sep 16, 2023 · 6 comments
Assignees
Labels
A-input-parsing Area: parsing input arguments accepted Ready for implementation enhancement New feature or an improvement of an existing feature P-medium Priority: medium

Comments

@Wainberg
Copy link
Contributor

Wainberg commented Sep 16, 2023

Description

I recently found a bug in my own code where I constructed a DataFrame with a mix of integers and strings, and the integers got set to null. Here's a simple illustration:

>>> pl.Series([1, '2'])
shape: (2,)
Series: '' [str]
[
        null
        "2"
]
>>> pl.DataFrame([1, '2'])
 column_0
 null
 2
shape: (2, 1)

Three other options here are to 1) convert everything to dtype=object (pandas's solution, but highly inefficient), 2) automatically upcast everything to a string, and 3) raise an error. I'm a big fan of raising an error here and letting the user decide whether they want to convert the integers to strings, set them to null, or take some other action.

One of the beautiful things about polars is that it makes it much harder to accidentally introduce missing values than pandas, where pretty much every operation does an implicit outer join! Avoiding implicit conversions to null during Series/DataFrame construction would further reduce the potential for missing value-related bugs.

Edit: this also happens here:

>>> pl.Series([1, 2, 3], dtype=pl.String)
shape: (3,)
Series: '' [str]
[
        null
        null
        null
]

pandas converts to string in this situation:

>>> pd.Series([1, 2, 3], dtype=str)[0]
'1'
@Wainberg Wainberg added the enhancement New feature or an improvement of an existing feature label Sep 16, 2023
@orlp
Copy link
Collaborator

orlp commented Sep 18, 2023

I think this is a very similar issue to this: #11009.

We really should do a pass on the Python -> Polars parsing to make it more restrictive by default, instead of silently casting/nulling/truncating values.

@Wainberg
Copy link
Contributor Author

@stinodego thoughts on polars's behavior of auto-converting pl.Series([1, '2']) to pl.Series([None, '2'])? I'd argue this should be an error.

@stinodego
Copy link
Member

It should either raise or cast to string, not sure which.

@orlp
Copy link
Collaborator

orlp commented Jan 9, 2024

@stinodego I would be in favour of raising an error.

@Wainberg Wainberg changed the title Should constructing a DataFrame with mixed types (string + number) produce nulls? Raise an error when constructing a Series or DataFrame with mixed types (e.g. string + number) Jan 9, 2024
@Wainberg
Copy link
Contributor Author

Wainberg commented Jan 9, 2024

I'm also in favor of raising an error. If the developers are in agreement, could you accept this issue?

@stinodego
Copy link
Member

Closing in favor of #14427

@stinodego stinodego closed this as not planned Won't fix, can't repro, duplicate, stale Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-input-parsing Area: parsing input arguments accepted Ready for implementation enhancement New feature or an improvement of an existing feature P-medium Priority: medium
Projects
Archived in project
Development

No branches or pull requests

3 participants