Jupysql with autopolars crashes when schema cannot be inferred from the first 100 rows #312

jorisroovers · 2023-03-24T13:46:10Z

By default, Polars infers the schema for a Dataframe column from the first 100 rows (see infer_schema_length) in case a generator is passed. This leads to a problem with jupysql when SqlMagic.autopolars = True and the datatype for a column in the ResultSet cannot be correctly inferred from the first 100 rows.

Consider the notebook below that shows the issue.

# %%
# Setup jupysql
%load_ext sql
%config SqlMagic.autopolars = True
%sql duckdb:///:memory:

# %%
# Generate a sequence of 100 NULL values and append a string at the end. 
# The type of the column is inferred as `str`. 
# Polars infers the type correctly because we're passing a list (not a generator) in
# which case Polars will look at all rows to infer the schema.
import polars as pl
df = pl.DataFrame({"mycol": [None for i in range(100)] + ["foo"]})

# %%
# Try to query this dataframe with SQL. This will fail with the following error:
#
# ComputeError: Could not append Utf8("foo") to builder; make sure that all rows have the same schema.
# Or consider increasing the the 'schema_inference_length' argument.
%sql SELECT * FROM df

# %%
# If we patch the `PolarsDataFrame` method to set infer_schema_length to None (=infinity), it works

import sql
def patch_polars(resultset):
    return pl.DataFrame((tuple(row) for row in resultset), schema=resultset.keys, infer_schema_length=None)

sql.run.ResultSet.PolarsDataFrame = patch_polars

# %%
# Works!
%sql SELECT * FROM df

As noted, the reason this fails is because ResultSet is a generator, in which case Polars will only look at the first 100 rows to infer the column type.

jupysql/src/sql/run.py

Line 190 in 65c99f4

frame = pl.DataFrame((tuple(row) for row in self), schema=self.keys)

In the example above, the first 100 rows are NULL in which case Polars infers its default type i64. When it then encounters "foo", it errors because "foo" is clearly not an i64.

As show in the example as well, the fix is to set infer_schema_length in the Dataframe constructor. Since this has a performance implication though, I believe this should ideally be exposed as SqlMagic config option.

%config SqlMagic.polars_infer_schema_length = None
# Alternative
%config SqlMagic.polars_infer_schema_length = 1000

# More generally, allow users to pass any kwarg to the Polars dataframe. Not sure if this syntax works?
%config SqlMagic.polars_dataframe_kwargs = {"infer_schema_length": None}

I'm happy to implement this (next week probably) if you'd accept the change - let me know!

The text was updated successfully, but these errors were encountered:

idomic · 2023-03-24T15:48:34Z

@jorisroovers thanks for reporting!
Which version are you on? I just saw a new release that breaks the CI.

Feel free to submit a PR and happy to guide you through!

edublancas · 2023-03-24T16:05:02Z

Maybe a more straightforward solution is not to use a generator?

If we really need to add another config for polars, I think wrapping them in a single config as you mentioned:

%config SqlMagic.polars_dataframe_kwargs = {"infer_schema_length": None}

jorisroovers · 2023-03-27T07:57:59Z

Which version are you on?

0.6.6

Maybe a more straightforward solution is not to use a generator?

You're right - just wrapping in a list works too:

frame = pl.DataFrame(list(tuple(row) for row in self), schema=self.keys)

That should be a tiny PR then (will include a test too) 👍

Allows for passing of custom keyword arguments to the Polars DataFrame constructor. Fixes ploomber#312

jorisroovers · 2023-03-27T13:42:24Z

You're right - just wrapping in a list works too:

frame = pl.DataFrame(list(tuple(row) for row in self), schema=self.keys)

That should be a tiny PR then (will include a test too) 👍

I take this back, this doesn't actually work:

Polars still crashes on this when passing a list of tuples. I did some digging in the Polars codebase
but it get's a bit complicated and I didn't want to spend too much time going down a rabbit hole. I think it's related to the
fact that it's (immutable) tuples that are being passed.
Even if this would work, I think it's inefficient since this converts a generator expression into a list that is
kept entirely in memory, even if the user isn't accessing all rows.

Long story short, I went with the %config SqlMagic.polars_dataframe_kwargs option, see #325

edublancas · 2023-03-27T17:13:05Z

thanks @jorisroovers for taking a look at this! we'll review your PR!

edublancas · 2023-03-28T00:00:22Z

Do we still have the problem if we do it like this?

frame = pl.DataFrame(list(list(row) for row in self), schema=self.keys)

I'm starting to think that adding the kwargs is an overkill since most of the options are data-specific (e.g., the schema). so it's not very useful to put a global option at the top. If anything, maybe we can automatically pass a larger threshold to define a schema (say, 1k observations)

What is definitely useful is allowing to pass options to the constructor (as you already did in your PR):

results.PolarsDataFrame(a=1, b=2)

jorisroovers · 2023-03-28T09:57:30Z

Do we still have the problem if we do it like this?
frame = pl.DataFrame(list(list(row) for row in self), schema=self.keys)

Yes, same error :/

I'm starting to think that adding the kwargs is an overkill since most of the options are data-specific (e.g., the schema). so it's not very useful to put a global option at the top.

Hmm, that I can see yes. Although I can see that a lot of folks would also set infer_schema_length and nan_to_null globally.

If anything, maybe we can automatically pass a larger threshold to define a schema (say, 1k observations)

FWIW, I discovered this issue when reading in a csv that had an unexpected string in an numeric column on row 3000-something. It would be really nice if we didn't have to do a cleaning step up front to deal with this issue by setting infer_schema_length = None

What is definitely useful is allowing to pass options to the constructor (as you already did in your PR):
results.PolarsDataFrame(a=1, b=2)

I'm not entirely following this suggestion and how it's different from what I implemented in #325. Can you please elaborate? Thanks!

edublancas · 2023-03-30T00:21:35Z

Ok, so it sounds like polars_dataframe_kwargs is the way to go! I'll review your PR now!

jorisroovers added a commit to jorisroovers/jupysql that referenced this issue Mar 27, 2023

Support for SqlMagic.polars_dataframe_kwargs

3a30c50

Allows for passing of custom keyword arguments to the Polars DataFrame constructor. Fixes ploomber#312

jorisroovers mentioned this issue Mar 27, 2023

Support for SqlMagic.polars_dataframe_kwargs #325

Merged

4 tasks

edublancas closed this as completed in #325 Mar 31, 2023

edublancas closed this as completed in 8640764 Mar 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jupysql with autopolars crashes when schema cannot be inferred from the first 100 rows #312

Jupysql with autopolars crashes when schema cannot be inferred from the first 100 rows #312

jorisroovers commented Mar 24, 2023

idomic commented Mar 24, 2023

edublancas commented Mar 24, 2023

jorisroovers commented Mar 27, 2023

jorisroovers commented Mar 27, 2023

edublancas commented Mar 27, 2023

edublancas commented Mar 28, 2023

jorisroovers commented Mar 28, 2023

edublancas commented Mar 30, 2023 •

edited

Loading

Jupysql with autopolars crashes when schema cannot be inferred from the first 100 rows #312

Jupysql with autopolars crashes when schema cannot be inferred from the first 100 rows #312

Comments

jorisroovers commented Mar 24, 2023

idomic commented Mar 24, 2023

edublancas commented Mar 24, 2023

jorisroovers commented Mar 27, 2023

jorisroovers commented Mar 27, 2023

edublancas commented Mar 27, 2023

edublancas commented Mar 28, 2023

jorisroovers commented Mar 28, 2023

edublancas commented Mar 30, 2023 • edited Loading

edublancas commented Mar 30, 2023 •

edited

Loading