Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jupysql with autopolars crashes when schema cannot be inferred from the first 100 rows #312

Closed
jorisroovers opened this issue Mar 24, 2023 · 8 comments · Fixed by #325
Closed

Comments

@jorisroovers
Copy link

By default, Polars infers the schema for a Dataframe column from the first 100 rows (see infer_schema_length) in case a generator is passed. This leads to a problem with jupysql when SqlMagic.autopolars = True and the datatype for a column in the ResultSet cannot be correctly inferred from the first 100 rows.

Consider the notebook below that shows the issue.

# %%
# Setup jupysql
%load_ext sql
%config SqlMagic.autopolars = True
%sql duckdb:///:memory:

# %%
# Generate a sequence of 100 NULL values and append a string at the end. 
# The type of the column is inferred as `str`. 
# Polars infers the type correctly because we're passing a list (not a generator) in
# which case Polars will look at all rows to infer the schema.
import polars as pl
df = pl.DataFrame({"mycol": [None for i in range(100)] + ["foo"]})

# %%
# Try to query this dataframe with SQL. This will fail with the following error:
#
# ComputeError: Could not append Utf8("foo") to builder; make sure that all rows have the same schema.
# Or consider increasing the the 'schema_inference_length' argument.
%sql SELECT * FROM df

# %%
# If we patch the `PolarsDataFrame` method to set infer_schema_length to None (=infinity), it works

import sql
def patch_polars(resultset):
    return pl.DataFrame((tuple(row) for row in resultset), schema=resultset.keys, infer_schema_length=None)

sql.run.ResultSet.PolarsDataFrame = patch_polars

# %%
# Works!
%sql SELECT * FROM df

As noted, the reason this fails is because ResultSet is a generator, in which case Polars will only look at the first 100 rows to infer the column type.

frame = pl.DataFrame((tuple(row) for row in self), schema=self.keys)

In the example above, the first 100 rows are NULL in which case Polars infers its default type i64. When it then encounters "foo", it errors because "foo" is clearly not an i64.

As show in the example as well, the fix is to set infer_schema_length in the Dataframe constructor. Since this has a performance implication though, I believe this should ideally be exposed as SqlMagic config option.

%config SqlMagic.polars_infer_schema_length = None
# Alternative
%config SqlMagic.polars_infer_schema_length = 1000

# More generally, allow users to pass any kwarg to the Polars dataframe. Not sure if this syntax works?
%config SqlMagic.polars_dataframe_kwargs = {"infer_schema_length": None}

I'm happy to implement this (next week probably) if you'd accept the change - let me know!

@idomic
Copy link

idomic commented Mar 24, 2023

@jorisroovers thanks for reporting!
Which version are you on? I just saw a new release that breaks the CI.

Feel free to submit a PR and happy to guide you through!

@edublancas
Copy link

Maybe a more straightforward solution is not to use a generator?

If we really need to add another config for polars, I think wrapping them in a single config as you mentioned:

%config SqlMagic.polars_dataframe_kwargs = {"infer_schema_length": None}

@jorisroovers
Copy link
Author

Which version are you on?

0.6.6

Maybe a more straightforward solution is not to use a generator?

You're right - just wrapping in a list works too:

frame = pl.DataFrame(list(tuple(row) for row in self), schema=self.keys)

That should be a tiny PR then (will include a test too) 👍

jorisroovers added a commit to jorisroovers/jupysql that referenced this issue Mar 27, 2023
Allows for passing of custom keyword arguments to the Polars DataFrame
constructor.

Fixes ploomber#312
@jorisroovers
Copy link
Author

You're right - just wrapping in a list works too:

frame = pl.DataFrame(list(tuple(row) for row in self), schema=self.keys)

That should be a tiny PR then (will include a test too) 👍

I take this back, this doesn't actually work:

  1. Polars still crashes on this when passing a list of tuples. I did some digging in the Polars codebase
    but it get's a bit complicated and I didn't want to spend too much time going down a rabbit hole. I think it's related to the
    fact that it's (immutable) tuples that are being passed.
  2. Even if this would work, I think it's inefficient since this converts a generator expression into a list that is
    kept entirely in memory, even if the user isn't accessing all rows.

Long story short, I went with the %config SqlMagic.polars_dataframe_kwargs option, see #325

@edublancas
Copy link

thanks @jorisroovers for taking a look at this! we'll review your PR!

@edublancas
Copy link

Do we still have the problem if we do it like this?

frame = pl.DataFrame(list(list(row) for row in self), schema=self.keys)

I'm starting to think that adding the kwargs is an overkill since most of the options are data-specific (e.g., the schema). so it's not very useful to put a global option at the top. If anything, maybe we can automatically pass a larger threshold to define a schema (say, 1k observations)

What is definitely useful is allowing to pass options to the constructor (as you already did in your PR):

results.PolarsDataFrame(a=1, b=2)

@jorisroovers
Copy link
Author

Do we still have the problem if we do it like this?

frame = pl.DataFrame(list(list(row) for row in self), schema=self.keys)

Yes, same error :/

I'm starting to think that adding the kwargs is an overkill since most of the options are data-specific (e.g., the schema). so it's not very useful to put a global option at the top.

Hmm, that I can see yes. Although I can see that a lot of folks would also set infer_schema_length and nan_to_null globally.

If anything, maybe we can automatically pass a larger threshold to define a schema (say, 1k observations)

FWIW, I discovered this issue when reading in a csv that had an unexpected string in an numeric column on row 3000-something. It would be really nice if we didn't have to do a cleaning step up front to deal with this issue by setting infer_schema_length = None

What is definitely useful is allowing to pass options to the constructor (as you already did in your PR):

results.PolarsDataFrame(a=1, b=2)

I'm not entirely following this suggestion and how it's different from what I implemented in #325. Can you please elaborate? Thanks!

@edublancas
Copy link

edublancas commented Mar 30, 2023

Ok, so it sounds like polars_dataframe_kwargs is the way to go! I'll review your PR now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants