Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBAPI - running a series of statements that depend on each other (DECLARE, SET, SELECT, ...) #377

Closed
mistercrunch opened this issue Nov 10, 2020 · 17 comments
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@mistercrunch
Copy link
Contributor

mistercrunch commented Nov 10, 2020

see original issue here: googleapis/python-bigquery-sqlalchemy#74

I'm a committer in Apache Superset and we integrate generally fairly well with BigQuery using the DBAPI driver provided here. Using the python DBAPI driver for BigQuery, we're having an issue running a chain of statements that depend on each other in a common session. I'm using a simple connection and cursor behind the scene.

It appears that the statements are not chained and the session/connection isn't persisted. Any clue on how to solve this? All other DBAPI

Other DBAPI implementations allow this type of chaining, and using the same logic with this driver does not work properly.

Here's a simple script exemplifying this:

DECLARE d DATE DEFAULT CURRENT_DATE();
CREATE TEMP FUNCTION Add(x INT64, y INT64) AS (x + y);
SELECT Add(3, 4);

Environment details

  • OS type and version: Linux
  • Python version: 3.7.x
  • pip version: 20.2.4
  • google-cloud-bigquery version: pip show google-cloud-bigquery 1.28.0

Steps to reproduce

code:

with closing(engine.raw_connection()) as conn:
    with closing(conn.cursor()) as cursor:
        cursor.execute("DECLARE d DATE DEFAULT CURRENT_DATE()")
        cursor.execute("CREATE TEMP FUNCTION Add(x INT64, y INT64) AS (x + y)")
        cursor.execute("SELECT Add(3, 4)")
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Nov 10, 2020
@eugeniamz
Copy link

I confirm the issue. The Driver executes the statements in separate sessions, from the project history you can see that

DECLARE d DATE DEFAULT CURRENT_DATE();
CREATE TEMP FUNCTION Add(x INT64, y INT64) AS (x + y);

are separated entries

2020-11-10_07-20-08

and because of executing as separated entries

CREATE TEMP FUNCTION Add(x INT64, y INT64) AS (x + y);
fails as temp function should be followed with a select as the error explains :

2020-11-10_07-20-49

If I execute this sessions directly from the console, the 3 statements run in one session,

2020-11-10_07-18-43

@tswast
Copy link
Contributor

tswast commented Nov 10, 2020

Since the backend API doesn't really have a concept of connection or cursor, we'll have to do this bundling client-side. Does the following sound reasonable?

Statements from Cursor.execute are batched client-side until either a fetch* method or close (or __del__?) is called.

Perhaps we add a "DEFERRED" isolation mode to select that behavior and keep the current "IMMEDIATE"-style execution the default?

@tswast
Copy link
Contributor

tswast commented Nov 10, 2020

Technically, reads in BigQuery use SNAPSHOT isolation, so perhaps if we go this route that should be the only option besides None (autocommit)

@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Nov 10, 2020
@tswast tswast added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed triage me I really want to be triaged. labels Nov 10, 2020
@villebro
Copy link

Since the backend API doesn't really have a concept of connection or cursor, we'll have to do this bundling client-side. Does the following sound reasonable?

Statements from Cursor.execute are batched client-side until either a fetch* method or close (or __del__?) is called.

Perhaps we add a "DEFERRED" isolation mode to select that behavior and keep the current "IMMEDIATE"-style execution the default?

Having an option to enable client-side batching sounds like a good solution, and one we should be able to leverage in Superset with minimal changes.

@mistercrunch
Copy link
Contributor Author

mistercrunch commented Nov 11, 2020

Live debugging the content of this PR apache/superset#11660, that attempts to only run cursor.fetchall() only on the last statement

Screen Shot 2020-11-11 at 1 09 17 PM

message as text for convenience:

400 CREATE TEMPORARY FUNCTION statements must be followed by an actual query.

@mistercrunch
Copy link
Contributor Author

Also, the simpler

DECLARE d DATE DEFAULT CURRENT_DATE();
SELECT d as TEST;

Also does not work, I get bigquery error: 400 Unrecognized name: d at [1:8].

@tswast
Copy link
Contributor

tswast commented Nov 11, 2020

Sorry, I was thinking aloud. "Statements from Cursor.execute are batched client-side until either a fetch* method or close (or del?) is called." is a proposal, not the current state.

I worry about changing the default behavior, so I'm currently thinking of adding a SNAPSHOT isolation level to enable such behavior.

@mistercrunch
Copy link
Contributor Author

Oh gotcha! I thought it was a statement not a proposal.

Can you think of workarounds on our side? Otherwise, we're happy to help improve the DBAPI implementation to make this possible. We'd love if we can help resolving this quickly!

@tswast
Copy link
Contributor

tswast commented Nov 11, 2020

Workaround in BigQuery is to take all statements, join them with ; and call execute once. Potentially we could add the (non-standard) executescript method to indicate that this is supported.

Related: https://docs.python.org/3.8/library/sqlite3.html#sqlite3.Cursor.executescript

@mistercrunch
Copy link
Contributor Author

I tried running a multi-statement script as a large string separated with ;\n and found that following the script execution, cursor.fetchall() returns nothing, and cursor.description raises this:

DEBUG:superset.sql_lab:Query 300: Fetching cursor description
ERROR:sqlalchemy.pool.impl.NullPool:Exception closing connection <google.cloud.bigquery.dbapi.connection.Connection object at 0x126493390>
Traceback (most recent call last):
  File "/Users/max/.pyenv/versions/3.7.7/envs/env37/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 270, in _close_connection
    self._dialect.do_close(connection)
  File "/Users/max/.pyenv/versions/3.7.7/envs/env37/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 549, in do_close
    dbapi_connection.close()
  File "/Users/max/.pyenv/versions/3.7.7/envs/env37/lib/python3.7/site-packages/google/cloud/bigquery/dbapi/_helpers.py", line 258, in with_closed_check
    return method(self, *args, **kwargs)
  File "/Users/max/.pyenv/versions/3.7.7/envs/env37/lib/python3.7/site-packages/google/cloud/bigquery/dbapi/connection.py", line 79, in close
    cursor_.close()
  File "/Users/max/.pyenv/versions/3.7.7/envs/env37/lib/python3.7/site-packages/google/cloud/bigquery/dbapi/_helpers.py", line 257, in with_closed_check
    raise exc_class(exc_msg)
google.cloud.bigquery.dbapi.exceptions.ProgrammingError: Operating on a closed cursor.

We're in uncharted territory here as far as DBAPI is concerned, but I was expecting (hoping is probably the right word here) for the state from the last statement to persist in the cursor since I didn't close the cursor at that point.

@mistercrunch
Copy link
Contributor Author

Digging deeper the line here is the issue https://github.com/googleapis/python-bigquery/blob/master/google/cloud/bigquery/dbapi/cursor.py#L227-L229

In this context, self._query_job.statement_type.upper() == "SCRIPT" , and if I set is_dml = False I can read the cursor from the last statement.

I can open a PR that would set is_dml to False if it's a script, but should probably make sure the last statement is a SELECT, which takes us to SQL-parsing land and may want to stay away from that.

@mistercrunch
Copy link
Contributor Author

Just for giggles: #386

@tswast
Copy link
Contributor

tswast commented Nov 12, 2020

Sent #387 with a system test that fixes the scripting issue.

gcf-merge-on-green bot pushed a commit that referenced this issue Nov 12, 2020
The `is_dml` logic is not needed now that we moved to `getQueryResults` instead of `tabledata.list` (#375).

Previously, the destination table of a DML query would return a non-null value that was unreadable or would return nonsense with DML (and some DDL) queries. 

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
- [ ] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/python-bigquery/issues/new/choose) before writing your code!  That way we can discuss the change, evaluate designs, and agree on the general idea
- [ ] Ensure the tests and linter pass
- [ ] Code coverage does not decrease (if any source code was changed)
- [ ] Appropriate docs were updated (if necessary)

Towards #377  🦕
@mistercrunch
Copy link
Contributor Author

Thankful for amazing maintainers(!) How often do you all push to PyPI?

I think we can work with this fix, but there are some downsides, one of which is Superset can't keep track of the statement currently executing. Another is that we need BigQuery-specific logic to handle this in our codebase. Ideally this DBAPI would conform more to the standard.

@tswast tswast mentioned this issue Nov 16, 2020
4 tasks
@tswast
Copy link
Contributor

tswast commented Nov 16, 2020

Looks like I merged the fix with a failing system test. I'll clean those up in #388

Once that merges, I think it'll be okay to cut the release in #381

@tswast
Copy link
Contributor

tswast commented Mar 25, 2022

Now that sessions are available, I believe we can implement a feature in the DB-API connector to create a new session and persist it for the life of a Connection object. #971

@chalmerlowe
Copy link
Contributor

I would like to close this issue in deference to completing #971
There is a reference/link in #971 to the content here for context and background.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

6 participants