-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistency around the behavior of the schema
argument across the API
#11723
Comments
What I think the schema arguments should do: If I pass a
If I pass a
@nameexhaustion could you add reproducible examples to your post? Then we can determine if there's any bugs here we should fix. |
Added a script under reproducible scripts, for the examples in the post. |
@stinodego under the proposed syntax, how would one select only columns 4 and 2 of a csv, give them new name, and give specify their dtype? Do you need to specify the entire scheme of the csv, and then use both the 'columns' and 'new_columns' parameter? |
There are a few ways for CSV files.
|
Description
In short, the
schema
argument has wildly different behaviors across different parts of the API, and would likely benefit from some standardization.Examples
Using the following data for examples:
The comparisons are done between
DataFrame
,scan_csv
andread_csv
(more examples can be found by looking at the related issues linked below).Case when the schema length does not match data:
Case when the schema length matches the data, contains the same keys but is in a different order:
Case when the schema length matches the data, but contains different keys:
Reproducible scripts
Schema arg for DataFrame, read_csv and scan_csv
outputs
Solution ideas
One way to resolve this could be to make it so that the
schema
argument will always overwrite the existing names in the data, and when specified it must match the length of the data (and contain no duplicates as @stinodego suggests in #11632). For cases when only needing to define the dtypes of the (potentially a subset of) the columns without overwriting original ordering, theschema_overrides
/dtypes
arguments could be used instead.But this probably needs some further discussion / review.
Related issues
from_arrow
handles empty/duplicate column names badly #11632The text was updated successfully, but these errors were encountered: