Consistency around the behavior of the `schema` argument across the API #11723

nameexhaustion · 2023-10-14T02:44:40Z

Description

In short, the schema argument has wildly different behaviors across different parts of the API, and would likely benefit from some standardization.

Examples

Using the following data for examples:

a,b
1,2

# DataFrame is initialized using this dict:
dict(a=1, b=2)

The comparisons are done between DataFrame, scan_csv and read_csv (more examples can be found by looking at the related issues linked below).

Case when the schema length does not match data:

schema = dict(b=pl.UInt32)
# DataFrame errors
ValueError: the given column-schema names do not match the data dictionary
# scan_csv loads as a partial overwrite:
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ u32 │
╞═════╪═════╡
│ 1   ┆ 2   │
└─────┴─────┘
# read_csv errors
exceptions.ComputeError: found more fields than defined in 'Schema'

Case when the schema length matches the data, contains the same keys but is in a different order:

schema = dict(b=pl.UInt32, a=pl.UInt64)
# DataFrame does not respect the order of the schema,
# but will confusingly re-order the columns in the output 
# to match the schema (note how it is now 2,1 instead of 1,2)
┌─────┬─────┐
│ b   ┆ a   │
│ --- ┆ --- │
│ u32 ┆ u64 │
╞═════╪═════╡
│ 2   ┆ 1   │
└─────┴─────┘
# scan_csv does not respect the order of the schema
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ u64 ┆ u32 │
╞═════╪═════╡
│ 1   ┆ 2   │
└─────┴─────┘
# read_csv respects the order of the schema
┌─────┬─────┐
│ b   ┆ a   │
│ --- ┆ --- │
│ u32 ┆ u64 │
╞═════╪═════╡
│ 1   ┆ 2   │
└─────┴─────┘

Case when the schema length matches the data, but contains different keys:

schema = dict(x=pl.UInt32, y=pl.UInt64)
# DataFrame errors
ValueError: the given column-schema names do not match the data dictionary
# scan_csv and read_csv both respect the order of the schema
┌─────┬─────┐
│ x   ┆ y   │
│ --- ┆ --- │
│ u32 ┆ u64 │
╞═════╪═════╡
│ 1   ┆ 2   │
└─────┴─────┘

Reproducible scripts

Schema arg for DataFrame, read_csv and scan_csv

import polars as pl
from functools import partial

path_in = ".env/input.csv"  # can change this


class funcs:
    DataFrame = partial(pl.DataFrame, dict(a=1, b=2))
    scan_csv = partial(pl.scan_csv, path_in)
    read_csv = partial(pl.read_csv, path_in)


funcs.DataFrame().write_csv(path_in)


def run(f):
    print(f"-- run {f=}")
    try:
        r = f()

        if hasattr(r, "collect"):
            r = r.collect()
    except Exception as e:
        r = e.__repr__()
    print(r)


# Note the default type is Int64.
for schema in (
    dict(b=pl.UInt32),
    dict(b=pl.UInt32, a=pl.UInt64),
    dict(x=pl.UInt32, y=pl.UInt64),
):
    print(f"---- {schema=}")
    run(partial(funcs.DataFrame, schema=schema))
    run(partial(funcs.scan_csv, schema=schema))
    run(partial(funcs.read_csv, schema=schema))

print(f"{pl.__version__=}")

outputs

---- schema={'b': UInt32}
-- run f=functools.partial(<class 'polars.dataframe.frame.DataFrame'>, {'a': 1, 'b': 2}, schema={'b': UInt32})
ValueError('the given column-schema names do not match the data dictionary')
-- run f=functools.partial(<function scan_csv at 0x7f6f842e2660>, 'env/input.csv', schema={'b': UInt32})
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ u32 │
╞═════╪═════╡
│ 1   ┆ 2   │
└─────┴─────┘
-- run f=functools.partial(<function read_csv at 0x7f6f842e16c0>, 'env/input.csv', schema={'b': UInt32})
ComputeError("found more fields than defined in 'Schema'\n\nConsider setting 'truncate_ragged_lines=True'.")
---- schema={'b': UInt32, 'a': UInt64}
-- run f=functools.partial(<class 'polars.dataframe.frame.DataFrame'>, {'a': 1, 'b': 2}, schema={'b': UInt32, 'a': UInt64})
shape: (1, 2)
┌─────┬─────┐
│ b   ┆ a   │
│ --- ┆ --- │
│ u32 ┆ u64 │
╞═════╪═════╡
│ 2   ┆ 1   │
└─────┴─────┘
-- run f=functools.partial(<function scan_csv at 0x7f6f842e2660>, 'env/input.csv', schema={'b': UInt32, 'a': UInt64})
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ u64 ┆ u32 │
╞═════╪═════╡
│ 1   ┆ 2   │
└─────┴─────┘
-- run f=functools.partial(<function read_csv at 0x7f6f842e16c0>, 'env/input.csv', schema={'b': UInt32, 'a': UInt64})
shape: (1, 2)
┌─────┬─────┐
│ b   ┆ a   │
│ --- ┆ --- │
│ u32 ┆ u64 │
╞═════╪═════╡
│ 1   ┆ 2   │
└─────┴─────┘
---- schema={'x': UInt32, 'y': UInt64}
-- run f=functools.partial(<class 'polars.dataframe.frame.DataFrame'>, {'a': 1, 'b': 2}, schema={'x': UInt32, 'y': UInt64})
ValueError('the given column-schema names do not match the data dictionary')
-- run f=functools.partial(<function scan_csv at 0x7f6f842e2660>, 'env/input.csv', schema={'x': UInt32, 'y': UInt64})
shape: (1, 2)
┌─────┬─────┐
│ x   ┆ y   │
│ --- ┆ --- │
│ u32 ┆ u64 │
╞═════╪═════╡
│ 1   ┆ 2   │
└─────┴─────┘
-- run f=functools.partial(<function read_csv at 0x7f6f842e16c0>, 'env/input.csv', schema={'x': UInt32, 'y': UInt64})
shape: (1, 2)
┌─────┬─────┐
│ x   ┆ y   │
│ --- ┆ --- │
│ u32 ┆ u64 │
╞═════╪═════╡
│ 1   ┆ 2   │
└─────┴─────┘
pl.__version__='0.19.9'

Solution ideas

One way to resolve this could be to make it so that the schema argument will always overwrite the existing names in the data, and when specified it must match the length of the data (and contain no duplicates as @stinodego suggests in #11632). For cases when only needing to define the dtypes of the (potentially a subset of) the columns without overwriting original ordering, the schema_overrides / dtypes arguments could be used instead.

But this probably needs some further discussion / review.

Related issues

The text was updated successfully, but these errors were encountered:

stinodego · 2023-10-18T12:27:18Z

What I think the schema arguments should do:

If I pass a schema argument:

The resulting DataFrame/LazyFrame should have the given schema. The order of the given schema should be respected.
There should be no type inference. Thus, passing the argument should have a performance benefit.
An error should be thrown if:
- the data has named columns that do not match the schema keys (ignoring ordering).
- the number of columns does not match the number of keys in the schema

If I pass a schema_overrides, argument:

The columns specified should have the given data types. The order of the original data is respected.
There should be no type inference on the columns for which we specify a data type. Thus, passing the argument should have a performance benefit.
If schema is also passed, the data types in schema_overrides take precedence.
~~An error should be thrown if the keys are not present in the data.~~ Keys not present in the data or schema are ignored. See discussion in SchemaError: nonexistent column when created from sequence #15471

@nameexhaustion could you add reproducible examples to your post? Then we can determine if there's any bugs here we should fix.

nameexhaustion · 2023-10-18T13:59:12Z

Added a script under reproducible scripts, for the examples in the post.

mcrumiller · 2024-06-08T11:52:26Z

@stinodego under the proposed syntax, how would one select only columns 4 and 2 of a csv, give them new name, and give specify their dtype? Do you need to specify the entire scheme of the csv, and then use both the 'columns' and 'new_columns' parameter?

stinodego · 2024-06-08T17:53:07Z

There are a few ways for CSV files.

If the CSV file has a header, you can specify schema_overrides on the original name, specify columns with those same names, and call .rename(...) on the result.
If the CSV has no header, you can specify schema_overrides by index (not sure if this is supported yet), and then specify columns by index as well.
If the CSV file has no header, you can specify new_columns to give new names to all the columns, and then specify schema_overrides to give a dtype to specific columns, and then specify columns to only select those.

nameexhaustion added the enhancement New feature or an improvement of an existing feature label Oct 14, 2023

nameexhaustion mentioned this issue Oct 14, 2023

fix(python): require all schema arguments to be a unique set of equal width to the data, and hard error otherwise #11643

Closed

stinodego added accepted Ready for implementation bug Something isn't working P-medium Priority: medium python Related to Python Polars and removed enhancement New feature or an improvement of an existing feature labels Jan 10, 2024

stinodego mentioned this issue Jan 12, 2024

schema_overrides should raise an error on non-existing keys #11972

Open

2 tasks

stinodego removed the accepted Ready for implementation label Jan 12, 2024

stinodego self-assigned this Jan 13, 2024

stinodego added the A-input-parsing Area: parsing input arguments label Jan 23, 2024

CanglongCl mentioned this issue Apr 2, 2024

Proposal: Re-design columns, new_columns, schema, dtypes in read_csv #15431

Closed

stinodego added the reference Reference issue for recurring topics label Apr 3, 2024

This was referenced Apr 6, 2024

Series constructor logic overhaul #14427

Closed

SchemaError: nonexistent column when created from sequence #15471

Closed

cmdlineluser mentioned this issue Apr 23, 2024

CSV parsing: ComputeError #15854

Open

2 tasks

stinodego removed their assignment May 26, 2024

stinodego mentioned this issue Jun 8, 2024

read_csv: dtypes not working and very confusing #14385

Open

2 tasks

cmdlineluser mentioned this issue Sep 19, 2024

Schema assumes the column order in the data when reading a CSV #18821

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistency around the behavior of the `schema` argument across the API #11723

Consistency around the behavior of the `schema` argument across the API #11723

nameexhaustion commented Oct 14, 2023 •

edited

Loading

stinodego commented Oct 18, 2023 •

edited

Loading

nameexhaustion commented Oct 18, 2023

mcrumiller commented Jun 8, 2024

stinodego commented Jun 8, 2024

Consistency around the behavior of the schema argument across the API #11723

Consistency around the behavior of the schema argument across the API #11723

Comments

nameexhaustion commented Oct 14, 2023 • edited Loading

Description

Examples

Reproducible scripts

Solution ideas

Related issues

stinodego commented Oct 18, 2023 • edited Loading

nameexhaustion commented Oct 18, 2023

mcrumiller commented Jun 8, 2024

stinodego commented Jun 8, 2024

Consistency around the behavior of the `schema` argument across the API #11723

Consistency around the behavior of the `schema` argument across the API #11723

nameexhaustion commented Oct 14, 2023 •

edited

Loading

stinodego commented Oct 18, 2023 •

edited

Loading