Polars read_database does not really respect iter_batches = True when using sqlalchemy/oracledb #15470

njesp · 2024-04-04T07:35:19Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import oracledb
import polars as pl


sql = """
    select cast (rownum as number) id,
        cast (rownum as float) id2,
        cast ('textxxxxxx' as varchar2 (10 char)) v
    from (select rownum r
            from (select     rownum r
                        from dual
                    connect by rownum <= 1000) a,
                (select     rownum r
                        from dual
                    connect by rownum <= 1000) b,
                (select     rownum r
                        from dual
                    connect by rownum <= 1000) c
            where rownum <= 1000000000)
"""
oracledb.init_oracle_client()
con = oracledb.connect("@xxx")
d = pl.read_database(sql, con, iter_batches=True, batch_size=100)
cnt = 0
for _d in d:
    cnt += _d.height
    print(cnt)
con.close()

Log output

No response

Issue description

The example code in principle runs, but it only returns the iterator after fetching all 1 billion rows. This results in practice in memory error. The batch processing then does not solve the memory consumption problem.

oracledb==2.0.1

Expected behavior

The same pandas code runs without using any memory, fetching as described in chunksize.


import oracledb
import pandas as pd


sql = """
    select cast (rownum as number) id,
        cast (rownum as float) id2,
        cast ('textxxxxxx' as varchar2 (10 char)) v
    from (select rownum r
            from (select     rownum r
                        from dual
                    connect by rownum <= 1000) a,
                (select     rownum r
                        from dual
                    connect by rownum <= 1000) b,
                (select     rownum r
                        from dual
                    connect by rownum <= 1000) c
            where rownum <= 1000000000)
"""
oracledb.init_oracle_client()
con = oracledb.connect("@xxx")
d = pd.read_sql(sql, con, chunksize=100)
cnt = 0
for _d in d:
    cnt += len(_d)
    print(cnt)
con.close()

Installed versions

--------Version info---------
Polars:               0.20.18
Index type:           UInt32
Platform:             Windows-10-10.0.19044-SP0
Python:               3.11.8 | packaged by Anaconda, Inc. | (main, Feb 26 2024, 21:34:05) [MSC v.1916 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.1
pyarrow:              14.0.2
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.28
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

alexander-beedie · 2024-04-06T07:49:30Z

Ouch; I've identified the issue - will make a PR to address it shortly 👌
Thanks for the test-case; very helpful.

njesp · 2024-04-07T11:25:38Z

That was fast. Thank you 👊

alexander-beedie · 2024-04-08T20:07:36Z

That was fast. Thank you 👊

No problem - fix is in the new 0.20.19 release; let me know if you still see any issues 😅

njesp · 2024-04-09T05:56:54Z

I cannot get it working. I get this:

Traceback (most recent call last):
  File "/home/njn/polars/poc.py", line 24, in <module>
    for _d in d:
  File "/home/njn/.conda/envs/polars_poc/lib/python3.11/site-packages/polars/io/database/_executor.py", line 260, in <genexpr>
    frames = (
             ^
  File "/home/njn/.conda/envs/polars_poc/lib/python3.11/site-packages/polars/io/database/_executor.py", line 175, in _fetchmany_rows
    rows = result.fetchmany(batch_size)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/njn/.conda/envs/polars_poc/lib/python3.11/site-packages/oracledb/cursor.py", line 793, in fetchmany
    self._verify_fetch()
  File "/home/njn/.conda/envs/polars_poc/lib/python3.11/site-packages/oracledb/cursor.py", line 136, in _verify_fetch
    self._verify_open()
  File "/home/njn/.conda/envs/polars_poc/lib/python3.11/site-packages/oracledb/cursor.py", line 146, in _verify_open
    errors._raise_err(errors.ERR_CURSOR_NOT_OPEN)
  File "/home/njn/.conda/envs/polars_poc/lib/python3.11/site-packages/oracledb/errors.py", line 181, in _raise_err
    raise error.exc_type(error) from cause
oracledb.exceptions.InterfaceError: DPY-1006: cursor is not open

It is the same on Windows 10 and Ubuntu Linux 2022.04.

My Python Anaconda environment is defined thus

name: polars_poc
dependencies:
  - python=3.11.8
  - pip=23.3.1
  - pip:
    - oracledb==2.1.1
    - polars==0.20.19

/Niels

njesp · 2024-04-12T07:06:03Z

@alexander-beedie did you see the above comment? It does not work for me.

Regards Niels

njesp added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Apr 4, 2024

alexander-beedie self-assigned this Apr 6, 2024

alexander-beedie removed the needs triage Awaiting prioritization by a maintainer label Apr 6, 2024

alexander-beedie mentioned this issue Apr 6, 2024

fix(python): Address issue with read_database draining iter_batches early #15504

Merged

stinodego closed this as completed in #15504 Apr 6, 2024

njesp mentioned this issue May 14, 2024

polars.read_database fails on Oracle with iter_batches=True. DPY-1006: cursor is not open #16206

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polars read_database does not really respect iter_batches = True when using sqlalchemy/oracledb #15470

Polars read_database does not really respect iter_batches = True when using sqlalchemy/oracledb #15470

njesp commented Apr 4, 2024

alexander-beedie commented Apr 6, 2024 •

edited

Loading

njesp commented Apr 7, 2024

alexander-beedie commented Apr 8, 2024

njesp commented Apr 9, 2024

njesp commented Apr 12, 2024

Polars read_database does not really respect iter_batches = True when using sqlalchemy/oracledb #15470

Polars read_database does not really respect iter_batches = True when using sqlalchemy/oracledb #15470

Comments

njesp commented Apr 4, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

alexander-beedie commented Apr 6, 2024 • edited Loading

njesp commented Apr 7, 2024

alexander-beedie commented Apr 8, 2024

njesp commented Apr 9, 2024

njesp commented Apr 12, 2024

alexander-beedie commented Apr 6, 2024 •

edited

Loading