feat(python): improved dtype inference/refinement for `read_database` results #15126

alexander-beedie · 2024-03-18T13:12:27Z

Additional layer of dtype inference for query results that do not return Arrow data directly, and that do not have an explicit "schema_overrides" entry.

This update adds support for inferring more accurate dtypes from the two simplest flavours of cursor "type_code"¹ description, specifically simple python types (eg: datetime, int, str) and string descriptions (eg: "varchar", "double", "array[float4]"). The string-based inference is, by necessity, quite flexible/involved.

Also, if set in the cursor description, the "internal_size", "precision", and "scale" entries are also used to further refine the inferred dtypes (eg: if we have "type_code" = int and internal_size=4, we can infer the more accurate Float32, saving some memory and speeding things up later).

Note that more sophisticated inference requires specific driver module knowledge in order to reverse-lookup bespoke integer codes, enums, and all manner of driver-specific custom type designations (the DPAPI2 spec did not solve this part of the interface at all... ;)

Example

Before:

Result from a SQLAlchemy query returning no rows, using pyodbc against MSSQL. (Previously we would only infer the column names, but not the dtypes).

from sqlalchemy import create_engine
import polars as pl

alchemy_conn = create_engine(
  f"mssql+pyodbc:///?odbc_connect={odbc_string}"
).connect()

df = pl.read_database(
  query = "SELECT TOP 1 * FROM test_table WHERE 1=0",
  connection = alchemy_conn,
)
# shape: (0, 5)
# ┌──────┬───────┬───────┬───────┬──────────┐
# │ name ┆ value ┆ major ┆ minor ┆ revision │
# │ ---  ┆ ---   ┆ ---   ┆ ---   ┆ ---      │
# │ null ┆ null  ┆ null  ┆ null  ┆ null     │  << no dtype inference
# ╞══════╪═══════╪═══════╪═══════╪══════════╡
# └──────┴───────┴───────┴───────┴──────────┘

After:

While pyodbc does not provide especially detailed dtypes (eg: does not specify the size of int/floats, etc) we can infer the broad dtype, which is a notable improvement over "null":

# shape: (0, 5)
# ┌──────┬───────┬───────┬───────┬──────────┐
# │ name ┆ value ┆ major ┆ minor ┆ revision │
# │ ---  ┆ ---   ┆ ---   ┆ ---   ┆ ---      │
# │ str  ┆ bool  ┆ i64   ┆ i64   ┆ i64      │  << dtypes inferred
# ╞══════╪═══════╪═══════╪═══════╪══════════╡
# └──────┴───────┴───────┴───────┴──────────┘

(Note that arrow-odbc is strongly preferred over pyodbc in real-world use with Polars, due to significant performance -and typing- benefits)

Also

Queries using the SQLAlchemy duckdb-engine dialect now automatically take the Arrow-aware duckdb fast-path)

"type_code": https://peps.python.org/pep-0249/#cursor-attributes ↩

…`read_database`

codecov · 2024-03-18T14:25:57Z

Codecov Report

Attention: Patch coverage is 71.11111% with 39 lines in your changes are missing coverage. Please review.

Project coverage is 81.20%. Comparing base (f8ade71) to head (f709520).
Report is 13 commits behind head on main.

Files	Patch %	Lines
py-polars/polars/io/database.py	48.88%	16 Missing and 7 partials ⚠️
py-polars/polars/datatypes/convert.py	82.22%	9 Missing and 7 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #15126      +/-   ##
==========================================
+ Coverage   81.08%   81.20%   +0.12%     
==========================================
  Files        1342     1346       +4     
  Lines      174112   175233    +1121     
  Branches     2459     2506      +47     
==========================================
+ Hits       141178   142302    +1124     
+ Misses      32467    32451      -16     
- Partials      467      480      +13

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

alexander-beedie requested review from ritchie46, stinodego, c-peters and MarcoGorelli as code owners March 18, 2024 13:12

github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Mar 18, 2024

alexander-beedie force-pushed the cursor-dtype-inference branch from 66690b9 to 90dd407 Compare March 18, 2024 13:16

alexander-beedie changed the title ~~feat(python): cursor-level dtype inference/refinement for read_database results~~ feat(python): improved dtype inference/refinement for read_database results Mar 18, 2024

alexander-beedie force-pushed the cursor-dtype-inference branch from 90dd407 to 4389fb5 Compare March 18, 2024 13:32

feat(python): additional cursor-level dtype inference/refinement for …

f709520

…`read_database`

alexander-beedie force-pushed the cursor-dtype-inference branch from 4389fb5 to f709520 Compare March 18, 2024 13:39

ritchie46 approved these changes Mar 18, 2024

View reviewed changes

ritchie46 merged commit 0b65c33 into pola-rs:main Mar 18, 2024
12 checks passed

alexander-beedie deleted the cursor-dtype-inference branch March 18, 2024 19:21

alexander-beedie mentioned this pull request Mar 19, 2024

Improve read_database error message after type inference error with sqlalchemy #13048

Open

alexander-beedie added the A-io-database Area: reading/writing to databases label Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python): improved dtype inference/refinement for `read_database` results #15126

feat(python): improved dtype inference/refinement for `read_database` results #15126

alexander-beedie commented Mar 18, 2024 •

edited

Loading

codecov bot commented Mar 18, 2024

feat(python): improved dtype inference/refinement for read_database results #15126

feat(python): improved dtype inference/refinement for read_database results #15126

Conversation

alexander-beedie commented Mar 18, 2024 • edited Loading

Example

Before:

After:

Also

Footnotes

codecov bot commented Mar 18, 2024

Codecov Report

feat(python): improved dtype inference/refinement for `read_database` results #15126

feat(python): improved dtype inference/refinement for `read_database` results #15126

alexander-beedie commented Mar 18, 2024 •

edited

Loading