Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Expose infer_schema_length parameter on read_database #15076

Merged
merged 2 commits into from
Mar 15, 2024

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Mar 14, 2024

Closes #15059.

If not using an Arrow-aware driver or the "schema_overrides" parameter, and a column starts with > 100 null values, we need to expose the "infer_schema_length" parameter to allow for more generous dtype inference. (The other options are preferred, but we still need to make this parameter available if they cannot be used).

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Mar 14, 2024
Copy link
Member

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fair enough. Just for my understanding though: why does read_database need to do schema inference at all? Don't databases have a strict schema that we can use? I guess we cannot account for all possible third party data types so we look at the data instead?

py-polars/polars/io/database.py Outdated Show resolved Hide resolved
@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Mar 15, 2024

I guess we cannot account for all possible third party data types so we look at the data instead?

The query result's cursor "description" property has a type_code entry, but it's an absolute free-for-all as to what you find in it. Could be a string name, an integer code known to the backend, a python-native class, a driver-specific enum, etc. Translating it often requires deep backend/driver-specific knowledge.

I do have a "TODO" to improve this on our side, but it's non-trivial (which I know well, because I have written code that does exactly this at work, but it's not simple to do the same for Polars, as I have an extensive dtype-translation architecture running to thousands of lines of code by itself in place there ;)

(Also there are backends like SQLite that don't populate the cursor type_code entry at all, so that will always require introspection - the result of a query may have no resemblance to the schema of the underlying tables, so knowing the table schemas typically won't help that much unless it's a "SELECT *").

Copy link
Member

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation!

@stinodego stinodego changed the title feat(python): expose "infer_schema_length" param on read_database feat(python): Expose infer_schema_length parameter on read_database Mar 15, 2024
@stinodego stinodego merged commit 0abbe5c into pola-rs:main Mar 15, 2024
13 checks passed
@alexander-beedie alexander-beedie deleted the read-database-infer-schema branch March 15, 2024 10:32
@alexander-beedie alexander-beedie added the A-io-database Area: reading/writing to databases label Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-database Area: reading/writing to databases enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pl.read_database() error suggests increasing infer_schema_length - but no such option exists
2 participants