-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jupysql with autopolars crashes when schema cannot be inferred from the first 100 rows #312
Comments
@jorisroovers thanks for reporting! Feel free to submit a PR and happy to guide you through! |
Maybe a more straightforward solution is not to use a generator? If we really need to add another config for polars, I think wrapping them in a single config as you mentioned: %config SqlMagic.polars_dataframe_kwargs = {"infer_schema_length": None} |
0.6.6
You're right - just wrapping in a list works too: frame = pl.DataFrame(list(tuple(row) for row in self), schema=self.keys) That should be a tiny PR then (will include a test too) 👍 |
Allows for passing of custom keyword arguments to the Polars DataFrame constructor. Fixes ploomber#312
I take this back, this doesn't actually work:
Long story short, I went with the |
thanks @jorisroovers for taking a look at this! we'll review your PR! |
Do we still have the problem if we do it like this? frame = pl.DataFrame(list(list(row) for row in self), schema=self.keys) I'm starting to think that adding the kwargs is an overkill since most of the options are data-specific (e.g., the schema). so it's not very useful to put a global option at the top. If anything, maybe we can automatically pass a larger threshold to define a schema (say, 1k observations) What is definitely useful is allowing to pass options to the constructor (as you already did in your PR): results.PolarsDataFrame(a=1, b=2) |
Yes, same error :/
Hmm, that I can see yes. Although I can see that a lot of folks would also set
FWIW, I discovered this issue when reading in a csv that had an unexpected string in an numeric column on row 3000-something. It would be really nice if we didn't have to do a cleaning step up front to deal with this issue by setting
I'm not entirely following this suggestion and how it's different from what I implemented in #325. Can you please elaborate? Thanks! |
Ok, so it sounds like |
By default, Polars infers the schema for a Dataframe column from the first 100 rows (see infer_schema_length) in case a generator is passed. This leads to a problem with jupysql when
SqlMagic.autopolars = True
and the datatype for a column in theResultSet
cannot be correctly inferred from the first 100 rows.Consider the notebook below that shows the issue.
As noted, the reason this fails is because
ResultSet
is a generator, in which case Polars will only look at the first 100 rows to infer the column type.jupysql/src/sql/run.py
Line 190 in 65c99f4
In the example above, the first 100 rows are
NULL
in which case Polars infers its default typei64
. When it then encounters"foo"
, it errors because"foo"
is clearly not ani64
.As show in the example as well, the fix is to set
infer_schema_length
in the Dataframe constructor. Since this has a performance implication though, I believe this should ideally be exposed asSqlMagic
config option.I'm happy to implement this (next week probably) if you'd accept the change - let me know!
The text was updated successfully, but these errors were encountered: