Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ComputeError for read_database suggests to infer_schema_length but no option to do so with SQLAlchemy Connection #11912

Closed
2 tasks done
matquant14 opened this issue Oct 21, 2023 · 8 comments · Fixed by #14627
Assignees
Labels
A-io-database Area: reading/writing to databases bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@matquant14
Copy link

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

sql = """SELECT * FROM tbl where custid = ? and region in (?,?)

df=pl.read_database(sql,connection,execute_options={'parameters':('abc123','US','EUR'))

Log output

No response

Issue description

When executing a parameterized query with the read_database function, using a SLQAlchemy connection, I get a compute error that suggests increase the infer_schema_length. There's no available option to do that w/ the read_database function at this time.

I'm getting an error that reads

could not append value: "Date_" of type: str to the builder; make sure that all rows have the same schema or consider increasing infer_schema_length

it might also be that a value overflows the data-type's capacity

The issue seems to reside when putting the rows into a DataFrame., as I'm able to use SQLAlchemy's exec_driver_sql method to get the output in rows.

Expected behavior

The expected output should be a 8524 x 24 DataFrame

Installed versions

--------Version info---------
Polars:              0.19.9
Index type:          UInt32
Platform:            Windows-10-10.0.19044-SP0
Python:              3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_sqlite:  0.5.1
cloudpickle:         2.2.1
connectorx:          0.3.2
deltalake:           0.10.0
fsspec:              2023.9.2
gevent:              <not installed>
matplotlib:          3.8.0
numpy:               1.26.1
openpyxl:            3.1.2
pandas:              2.1.1
pyarrow:             13.0.0
pydantic:            2.4.2
pyiceberg:           <not installed>
pyxlsb:              1.0.10
sqlalchemy:          2.0.22
xlsx2csv:            0.8.1
xlsxwriter:          3.1.9
@matquant14 matquant14 added bug Something isn't working python Related to Python Polars labels Oct 21, 2023
@alexander-beedie alexander-beedie self-assigned this Oct 21, 2023
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Oct 21, 2023

Hmm; I can't see an obvious explanation for this in the standard (default) non-batched call to read_database as the internal call will be to "fetchall", which will materialise everything up-front, so there would be no inference.

You haven't got any other parameters involved, such as "schema_overrides", "iter_batches", etc?
Also, what is the underlying/complete table schema and the specific backend/driver? 🤔

@matquant14
Copy link
Author

No, I'm not using any other parameters. The defaults to those extra parameters are being passed. This is for SQL Server, where SQLAlchemy is leveraging pyodbc. SQLAlchemy reads the data types for the 24 table fields as

[<class 'str'>, <class 'str'>, <class 'int'>, <class 'str'>, <class 'int'>, <class 'str'>, <class 'str'>, <class 'int'>, <class 'str'>, <class 'str'>, <class 'int'>, <class 'bool'>, <class 'bool'>, <class 'str'>, <class 'bool'>, <class 'int'>, <class 'str'>, <class 'str'>, <class 'datetime.datetime'>, <class 'int'>, <class 'int'>, <class 'int'>, <class 'decimal.Decimal'>, <class 'int'>]

I can run it with pandas and even then convert it to an arrow table without any issue. After converting to an arrow table, the data types for the table fields become

[string, string, int64, string, int64, string, string, int64, string, string, double, bool, bool, string, bool, double, string, string, timestamp[ns], null, double, double, double, double]

Perhaps something is happening w/ the datetime conversion?

My backup, currently, is to execute using pandas and then convert to polars, but would of course like to skip that step entirely.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Oct 21, 2023

Somewhat to my surprise it looks like there is a free/test instance of SQL Server available for Ubuntu; while I actually run macOS, I do have a virtualised x86 Ubuntu available on my NAS (which is what I used to validate arrow-odbc), so... guess I'll try and install this tomorrow to attempt to replicate! 😅

Any special pyodbc connection settings I should know about?

@matquant14
Copy link
Author

No, no special connection settings really. The SQL Server connection uses username/pw authentication instead of windows authentication, but doubt that should really matter.

I'm using SQLAlchemy's URL function to create the URI which is then fed into an engine to connect..

url=URL.create(
            "mssql+pyodbc",
            host = host,
            username = username,
            password = password,
            port = port,
            database = database,
            query = {"driver": 'ODBC Driver 17 for SQL Server'},

engine=create_engine(url,paramstyle = 'qmark')
engine.connect()

(I think under the hood, SQLAlchemy uses pyodbc as the ODBC connection driver. )

@John-bardera
Copy link

(I am not an English speaker, so the text may not be appropriate.)
Hi. I was getting the similar error in my environment.

I have confirmed that when I run get_database in a sqlalchemy environment, the datetime type(sqlalchemy.DateTime) is in error.

error message

polars.exceptions.ComputeError: could not append value: 2015-01-04 09:50:00 of type: datetime[μs] to the builder; make sure that all rows have the same schema or consider increasing `infer_schema_length`

it might also be that a value overflows the data-type's capacity

This is a tentative response, but it worked as expected by #11246 (comment).

If I were to respond more appropriately, I should implement #11246 (comment), but I have not.

scripts and detail error logs

scripts

(Functions are expanded for illustrative purposes, but options, etc. remain unchanged.)

return pl.read_database(
    query=query, # str query
    connection=sqlalchemy.create_engine(uri).connect(),
)

errored sqlalchemy columns setting

from sqlalchemy import Column as C
from sqlalchemy import Integer, String, DateTime, Boolean

# ...
postTime = C(DateTime(timezone=True), nullable=False, index=True)

error logs

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/workspaces/backend/feature_creator/feature_creator.py", line 51, in load_targets
    ).get_data()
  File "/workspaces/backend/feature_creator/target_loader.py", line 40, in get_data
    return pl.read_database(
  File "/usr/local/lib/python3.10/site-packages/polars/io/database.py", line 566, in read_database
    return cx.execute(
  File "/usr/local/lib/python3.10/site-packages/polars/io/database.py", line 365, in to_polars
    frame = frame_init(
  File "/usr/local/lib/python3.10/site-packages/polars/io/database.py", line 301, in _from_rows
    return frames if iter_batches else next(frames)  # type: ignore[arg-type]
  File "/usr/local/lib/python3.10/site-packages/polars/io/database.py", line 289, in <genexpr>
    DataFrame(
  File "/usr/local/lib/python3.10/site-packages/polars/dataframe/frame.py", line 368, in __init__
    self._df = sequence_to_pydf(
  File "/usr/local/lib/python3.10/site-packages/polars/utils/_construction.py", line 934, in sequence_to_pydf
    return _sequence_to_pydf_dispatcher(
  File "/usr/local/lib/python3.10/functools.py", line 889, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/usr/local/lib/python3.10/site-packages/polars/utils/_construction.py", line 1126, in _sequence_of_tuple_to_pydf
    return _sequence_of_sequence_to_pydf(
  File "/usr/local/lib/python3.10/site-packages/polars/utils/_construction.py", line 1077, in _sequence_of_sequence_to_pydf
    pydf = PyDataFrame.read_rows(
polars.exceptions.ComputeError: could not append value: 2015-01-04 09:50:00 of type: datetime[μs] to the builder; make sure that all rows have the same schema or consider increasing `infer_schema_length`

it might also be that a value overflows the data-type's capacity
fixed scripts

scripts

(Functions are expanded for illustrative purposes, but options, etc. remain unchanged.)

return pl.read_database(
    query=query, # str query
    connection=sqlalchemy.create_engine(uri).connect(),
    schema_overrides={ 'Race.postTime': pl.Datetime }
)

Hope this helps you. Thanks.

@stinodego stinodego added the needs triage Awaiting prioritization by a maintainer label Jan 13, 2024
@matquant14
Copy link
Author

Hi @alexander-beedie & @stinodego ,

any update on this issue?

@stinodego stinodego added the A-io-database Area: reading/writing to databases label Feb 21, 2024
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Feb 21, 2024

Hi @alexander-beedie & @stinodego ,
any update on this issue?

Will revisit this; I had a hell of a time trying to get a working SQL Server up the last time I tried to dig in to this... I'll try again 🤔

In the meantime, can you supply the database-side DDL for the table you're reading from, so I can see the exact SQLServer types? (rather than the alchemy/pandas/arrow types)

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Feb 21, 2024

Hmm, think I've got something; was able to (finally 😓) get a suitable Docker image up locally, make pyodbc work with it, and set up a test table containing all the different flavours of SQL Server temporal columns.

Do you happen to have a column of type DATETIMEOFFSET? This seems to return a cursor-result description type_code indicating that it's a string, but it actually returns a timezone-aware datetime. (This replicated the same error message for me that you posted above). I have a workaround, and am committing a fix shortly.

(Also: you might want to take a look at using arrow-odbc; much better performance than pyodbc. You can give it a go by simply passing the ODBC string "as-is" to read_database, in place of the sqlalchemy connection).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-database Area: reading/writing to databases bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
4 participants