ComputeError for read_database suggests to infer_schema_length but no option to do so with SQLAlchemy Connection #11912

matquant14 · 2023-10-21T02:26:29Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

sql = """SELECT * FROM tbl where custid = ? and region in (?,?)

df=pl.read_database(sql,connection,execute_options={'parameters':('abc123','US','EUR'))

Log output

No response

Issue description

When executing a parameterized query with the read_database function, using a SLQAlchemy connection, I get a compute error that suggests increase the infer_schema_length. There's no available option to do that w/ the read_database function at this time.

I'm getting an error that reads

could not append value: "Date_" of type: str to the builder; make sure that all rows have the same schema or consider increasing infer_schema_length

it might also be that a value overflows the data-type's capacity

The issue seems to reside when putting the rows into a DataFrame., as I'm able to use SQLAlchemy's exec_driver_sql method to get the output in rows.

Expected behavior

The expected output should be a 8524 x 24 DataFrame

Installed versions

--------Version info---------
Polars:              0.19.9
Index type:          UInt32
Platform:            Windows-10-10.0.19044-SP0
Python:              3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_sqlite:  0.5.1
cloudpickle:         2.2.1
connectorx:          0.3.2
deltalake:           0.10.0
fsspec:              2023.9.2
gevent:              <not installed>
matplotlib:          3.8.0
numpy:               1.26.1
openpyxl:            3.1.2
pandas:              2.1.1
pyarrow:             13.0.0
pydantic:            2.4.2
pyiceberg:           <not installed>
pyxlsb:              1.0.10
sqlalchemy:          2.0.22
xlsx2csv:            0.8.1
xlsxwriter:          3.1.9

The text was updated successfully, but these errors were encountered:

alexander-beedie · 2023-10-21T10:11:56Z

Hmm; I can't see an obvious explanation for this in the standard (default) non-batched call to read_database as the internal call will be to "fetchall", which will materialise everything up-front, so there would be no inference.

You haven't got any other parameters involved, such as "schema_overrides", "iter_batches", etc?
Also, what is the underlying/complete table schema and the specific backend/driver? 🤔

matquant14 · 2023-10-21T18:22:51Z

No, I'm not using any other parameters. The defaults to those extra parameters are being passed. This is for SQL Server, where SQLAlchemy is leveraging pyodbc. SQLAlchemy reads the data types for the 24 table fields as

[<class 'str'>, <class 'str'>, <class 'int'>, <class 'str'>, <class 'int'>, <class 'str'>, <class 'str'>, <class 'int'>, <class 'str'>, <class 'str'>, <class 'int'>, <class 'bool'>, <class 'bool'>, <class 'str'>, <class 'bool'>, <class 'int'>, <class 'str'>, <class 'str'>, <class 'datetime.datetime'>, <class 'int'>, <class 'int'>, <class 'int'>, <class 'decimal.Decimal'>, <class 'int'>]

I can run it with pandas and even then convert it to an arrow table without any issue. After converting to an arrow table, the data types for the table fields become

[string, string, int64, string, int64, string, string, int64, string, string, double, bool, bool, string, bool, double, string, string, timestamp[ns], null, double, double, double, double]

Perhaps something is happening w/ the datetime conversion?

My backup, currently, is to execute using pandas and then convert to polars, but would of course like to skip that step entirely.

alexander-beedie · 2023-10-21T22:26:36Z

Somewhat to my surprise it looks like there is a free/test instance of SQL Server available for Ubuntu; while I actually run macOS, I do have a virtualised x86 Ubuntu available on my NAS (which is what I used to validate arrow-odbc), so... guess I'll try and install this tomorrow to attempt to replicate! 😅

Any special pyodbc connection settings I should know about?

matquant14 · 2023-10-22T16:32:01Z

No, no special connection settings really. The SQL Server connection uses username/pw authentication instead of windows authentication, but doubt that should really matter.

I'm using SQLAlchemy's URL function to create the URI which is then fed into an engine to connect..

url=URL.create(
            "mssql+pyodbc",
            host = host,
            username = username,
            password = password,
            port = port,
            database = database,
            query = {"driver": 'ODBC Driver 17 for SQL Server'},

engine=create_engine(url,paramstyle = 'qmark')
engine.connect()

(I think under the hood, SQLAlchemy uses pyodbc as the ODBC connection driver. )

John-bardera · 2023-11-20T03:22:36Z

(I am not an English speaker, so the text may not be appropriate.)
Hi. I was getting the similar error in my environment.

I have confirmed that when I run get_database in a sqlalchemy environment, the datetime type(sqlalchemy.DateTime) is in error.

error message

polars.exceptions.ComputeError: could not append value: 2015-01-04 09:50:00 of type: datetime[μs] to the builder; make sure that all rows have the same schema or consider increasing `infer_schema_length`

it might also be that a value overflows the data-type's capacity

This is a tentative response, but it worked as expected by #11246 (comment).

If I were to respond more appropriately, I should implement #11246 (comment), but I have not.

scripts and detail error logs

scripts

(Functions are expanded for illustrative purposes, but options, etc. remain unchanged.)

return pl.read_database(
    query=query, # str query
    connection=sqlalchemy.create_engine(uri).connect(),
)

errored sqlalchemy columns setting

from sqlalchemy import Column as C
from sqlalchemy import Integer, String, DateTime, Boolean

# ...
postTime = C(DateTime(timezone=True), nullable=False, index=True)

error logs

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/workspaces/backend/feature_creator/feature_creator.py", line 51, in load_targets
    ).get_data()
  File "/workspaces/backend/feature_creator/target_loader.py", line 40, in get_data
    return pl.read_database(
  File "/usr/local/lib/python3.10/site-packages/polars/io/database.py", line 566, in read_database
    return cx.execute(
  File "/usr/local/lib/python3.10/site-packages/polars/io/database.py", line 365, in to_polars
    frame = frame_init(
  File "/usr/local/lib/python3.10/site-packages/polars/io/database.py", line 301, in _from_rows
    return frames if iter_batches else next(frames)  # type: ignore[arg-type]
  File "/usr/local/lib/python3.10/site-packages/polars/io/database.py", line 289, in <genexpr>
    DataFrame(
  File "/usr/local/lib/python3.10/site-packages/polars/dataframe/frame.py", line 368, in __init__
    self._df = sequence_to_pydf(
  File "/usr/local/lib/python3.10/site-packages/polars/utils/_construction.py", line 934, in sequence_to_pydf
    return _sequence_to_pydf_dispatcher(
  File "/usr/local/lib/python3.10/functools.py", line 889, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/usr/local/lib/python3.10/site-packages/polars/utils/_construction.py", line 1126, in _sequence_of_tuple_to_pydf
    return _sequence_of_sequence_to_pydf(
  File "/usr/local/lib/python3.10/site-packages/polars/utils/_construction.py", line 1077, in _sequence_of_sequence_to_pydf
    pydf = PyDataFrame.read_rows(
polars.exceptions.ComputeError: could not append value: 2015-01-04 09:50:00 of type: datetime[μs] to the builder; make sure that all rows have the same schema or consider increasing `infer_schema_length`

it might also be that a value overflows the data-type's capacity

fixed scripts

scripts

(Functions are expanded for illustrative purposes, but options, etc. remain unchanged.)

return pl.read_database(
    query=query, # str query
    connection=sqlalchemy.create_engine(uri).connect(),
    schema_overrides={ 'Race.postTime': pl.Datetime }
)

Hope this helps you. Thanks.

matquant14 · 2024-02-20T22:14:34Z

Hi @alexander-beedie & @stinodego ,

any update on this issue?

alexander-beedie · 2024-02-21T08:17:09Z

Hi @alexander-beedie & @stinodego ,
any update on this issue?

Will revisit this; I had a hell of a time trying to get a working SQL Server up the last time I tried to dig in to this... I'll try again 🤔

In the meantime, can you supply the database-side DDL for the table you're reading from, so I can see the exact SQLServer types? (rather than the alchemy/pandas/arrow types)

alexander-beedie · 2024-02-21T11:34:55Z

Hmm, think I've got something; was able to (finally 😓) get a suitable Docker image up locally, make pyodbc work with it, and set up a test table containing all the different flavours of SQL Server temporal columns.

Do you happen to have a column of type DATETIMEOFFSET? This seems to return a cursor-result description type_code indicating that it's a string, but it actually returns a timezone-aware datetime. (This replicated the same error message for me that you posted above). I have a workaround, and am committing a fix shortly.

(Also: you might want to take a look at using arrow-odbc; much better performance than pyodbc. You can give it a go by simply passing the ODBC string "as-is" to read_database, in place of the sqlalchemy connection).

matquant14 added bug Something isn't working python Related to Python Polars labels Oct 21, 2023

alexander-beedie self-assigned this Oct 21, 2023

stinodego added the needs triage Awaiting prioritization by a maintainer label Jan 13, 2024

stinodego added the A-io-database Area: reading/writing to databases label Feb 21, 2024

alexander-beedie mentioned this issue Feb 21, 2024

fix(python): Fixes a read_database issue loading specific datetime types from SQL Server backends #14627

Merged

alexander-beedie closed this as completed in #14627 Feb 21, 2024

knutmerket mentioned this issue Mar 14, 2024

pl.read_database() error suggests increasing infer_schema_length - but no such option exists #15059

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ComputeError for read_database suggests to infer_schema_length but no option to do so with SQLAlchemy Connection #11912

ComputeError for read_database suggests to infer_schema_length but no option to do so with SQLAlchemy Connection #11912

matquant14 commented Oct 21, 2023

alexander-beedie commented Oct 21, 2023 •

edited

Loading

matquant14 commented Oct 21, 2023

alexander-beedie commented Oct 21, 2023 •

edited

Loading

matquant14 commented Oct 22, 2023

John-bardera commented Nov 20, 2023

scripts

errored sqlalchemy columns setting

error logs

scripts

matquant14 commented Feb 20, 2024

alexander-beedie commented Feb 21, 2024 •

edited

Loading

alexander-beedie commented Feb 21, 2024 •

edited

Loading

ComputeError for read_database suggests to infer_schema_length but no option to do so with SQLAlchemy Connection #11912

ComputeError for read_database suggests to infer_schema_length but no option to do so with SQLAlchemy Connection #11912

Comments

matquant14 commented Oct 21, 2023

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

alexander-beedie commented Oct 21, 2023 • edited Loading

matquant14 commented Oct 21, 2023

alexander-beedie commented Oct 21, 2023 • edited Loading

matquant14 commented Oct 22, 2023

John-bardera commented Nov 20, 2023

scripts

errored sqlalchemy columns setting

error logs

scripts

matquant14 commented Feb 20, 2024

alexander-beedie commented Feb 21, 2024 • edited Loading

alexander-beedie commented Feb 21, 2024 • edited Loading

alexander-beedie commented Oct 21, 2023 •

edited

Loading

alexander-beedie commented Oct 21, 2023 •

edited

Loading

alexander-beedie commented Feb 21, 2024 •

edited

Loading

alexander-beedie commented Feb 21, 2024 •

edited

Loading