feat(python): improved dtype inference/refinement for read_database
results
#15126
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Ref: #15076 (comment)
Additional layer of
dtype
inference for query results that do not return Arrow data directly, and that do not have an explicit "schema_overrides" entry.This update adds support for inferring more accurate dtypes from the two simplest flavours of cursor "type_code"1 description, specifically simple python types (eg:
datetime
,int
,str
) and string descriptions (eg: "varchar", "double", "array[float4]"). The string-based inference is, by necessity, quite flexible/involved.Also, if set in the cursor description, the "internal_size", "precision", and "scale" entries are also used to further refine the inferred dtypes (eg: if we have "type_code" =
int
and internal_size=4, we can infer the more accurateFloat32
, saving some memory and speeding things up later).Note that more sophisticated inference requires specific driver module knowledge in order to reverse-lookup bespoke integer codes, enums, and all manner of driver-specific custom type designations (the DPAPI2 spec did not solve this part of the interface at all... ;)
Example
Before:
Result from a SQLAlchemy query returning no rows, using
pyodbc
against MSSQL. (Previously we would only infer the column names, but not the dtypes).After:
While
pyodbc
does not provide especially detailed dtypes (eg: does not specify the size of int/floats, etc) we can infer the broad dtype, which is a notable improvement over "null":(Note that
arrow-odbc
is strongly preferred overpyodbc
in real-world use with Polars, due to significant performance -and typing- benefits)Also
Queries using the SQLAlchemy
duckdb-engine
dialect now automatically take the Arrow-awareduckdb
fast-path)Footnotes
"type_code": https://peps.python.org/pep-0249/#cursor-attributes ↩