adding support for datasets and query arguments in connection string #25

blainehansen · 2018-09-19T23:29:16Z

closes #24

blainehansen · 2018-09-19T23:32:03Z

This has only been tested with a basic case, a simply modified version of the example on the readme:

from sqlalchemy import *
from sqlalchemy.engine import create_engine
from sqlalchemy.schema import *
engine = create_engine('bigquery://some_project/some_dataset')

table = Table('some_table', MetaData(bind=engine), autoload=True)
print(select([func.count('*')], from_obj=table).scalar())

This will successfully count the rows in some_dataset.some_table

blainehansen · 2018-09-19T23:32:42Z

I'm not sure what other modifications would need to be made, or if this could be achieved in a better way.

mxmzdlv · 2018-09-20T06:59:44Z

Thanks @blainehansen!

I've tested the code and there are some issues with SQL generation, specifically when generating queries with GROUP, HAVING or ORDER labels.

With your approach table names don't include dataset ids, and in that case GROUP BY / HAVING / ORDER should not include them either, but they are still included in the generated query:

E.g., this is what SQLAlchemy generates:

SELECT `sample`.`string` AS `sample_string`, count(`sample`.`integer`) AS `count_1` 
FROM `test_pybigquery`.`sample` 
GROUP BY `test_pybigquery`.`sample`.`string`

And it needs to be:

SELECT `sample`.`string` AS `sample_string`, count(`sample`.`integer`) AS `count_1` 
FROM `test_pybigquery`.`sample`
GROUP BY `sample`.`string`

With our current approach SQLAlchemy just appends dataset id everywhere and that works fine:

SELECT `test_pybigquery.sample`.`string` AS `test_pybigquery_sample_string`, count(`test_pybigquery.sample`.`integer`) AS `count_1` 
FROM `test_pybigquery.sample`
GROUP BY `test_pybigquery.sample`.`string`

Also, get_table_names method still returns table names with dataset ids in them and I think that would cause issues with auto-generated table definitions by Superset.

I've added some tests. Some of the relevant ones you should look at are test_tables_list, test_group_by.

mxmzdlv · 2018-09-20T07:51:00Z

Another issue is that the underlying BigQuery client doesn't allow restricting query execution per specified dataset, so raw queries like that will still execute on any dataset:

engine.execute('SELECT * FROM another_dataset.table').fetchall()

Not sure if that is relevant in your case.

blainehansen · 2018-09-20T22:03:32Z

To your first comment:

We're completely fine if the dataset_id is appended everywhere (as in your last example):

SELECT `test_pybigquery.sample`.`string` AS `test_pybigquery_sample_string`, count(`test_pybigquery.sample`.`integer`) AS `count_1` 
FROM `test_pybigquery.sample`
GROUP BY `test_pybigquery.sample`.`string`

How do I change the code to make that possible? I'm not sure where I can take the url.database string (which is where it's made available from the connection string), and input it to make it included everywhere, without simply naming the table dataset_id.table_name (that would obviously ruin the multi-tenancy). The ideal solution for us is to change your codebase as little as possible to make this feature happen.
(sorry if I seem a little clueless, I've had a very difficult time finding useful docs for sqlalchemy custom dialects, even yesterday finding out about the visit_table function was a long process of searching across random sites like kite.com and then guessing and checking. I'd love your help or pointers in getting this figured out, whether it's code suggestions or pointers to good resources)

To your second comment:

Yeah, raw queries acting on random datasets (or just generally not having any awareness of the dataset restriction) is definitely a problem for us. But without changing the bigquery client to restrict to a specific dataset, I'm not sure how that could be solved. We really want to avoid having to create separate projects for different tenants, that promises to have a lot of maintenance overhead.

Thanks!

I'll start playing with the code with your tests to see what I can make happen.

mxmzdlv · 2018-09-20T22:35:09Z

I think you would need to modify BigQueryCompiler.visit_label or BigQueryIdentifierPreparer.format_label, and include_table in BigQueryCompiler might be relevant, but I am not sure how to get dataset_id into these methods. Perhaps you could have another BigQueryCompiler with a separate logic for when the dataset is specified on connection — you could check it in BigQueryDialect.create_connect_args and overwrite statement_compiler.

I haven't had much luck with the documentation either, so I generally just look into source code —
BigQueryDialect inherits from DefaultDialect and SQL generation mostly happens in SQLCompiler and IdentifierPreparer.

blainehansen · 2018-09-25T20:10:33Z

The pull request at googleapis/google-cloud-python#6088 is actually looking like the maintainers are interested in going in that direction, so I think this will get much easier to achieve once you can pass a default QueryJobConfig to the Client constructor. I'm going to change the direction of this pull request to accept more default config parameters in the sqlalchemy URL to create that default QueryJobConfig.

blainehansen · 2018-09-26T19:10:55Z

I've got significant changes coming in soon, that all depend on googleapis/google-cloud-python#6088 coming through. The tests won't pass until it does, except the ones in test_parse_url.py, (pytest -k parse_url). It will make this change significantly easier.

I wanted your feedback on the docs changes and the api, and if there's any further things to think about regarding how the dialect actually works.

…dataset and don't add dataset prefix

mxmzdlv · 2018-09-27T04:01:02Z

Awesome! I've added some additional tests and changed get_schema_names and get_table_names behavior when dataset is set. Seems to be passing all the tests. I've also tested it with Superset and haven't seen any issues.

Since this still won't prevent executing queries against other datasets when using engine directly — I've added a note regarding that in readme.

I think this is good to merge as soon as the new version of google-cloud-python lands and we change setup.py to require it.

blainehansen · 2018-09-27T16:01:08Z

That landed about 16 hours ago, so it's possible that the tests you've run have already validated everything! Let me know if there's anything else I can do to help.

blainehansen · 2018-09-27T16:24:51Z

The changes from that pull request now show up in google's docs (the default_query_job_config argument is from that pull request), but it doesn't seem like it's been officially released yet? https://github.com/GoogleCloudPlatform/google-cloud-python/releases

It might be necessary to try specifying the repo in the setup.py install_requires options. I'm not sure though.

mxmzdlv · 2018-09-28T02:15:55Z

Yeah, I've manually added changes from that pull request to test.

Ideally we should wait for the release, but if you need to use this ASAP you can try pointing to the google-cloud-python github repo in setup.py and then do pip install https://github.com/marketdial/pybigquery/archive/master.zip in your project.

blainehansen · 2018-10-04T16:07:38Z

Alright, the bigquery code has been released!
https://github.com/googleapis/google-cloud-python/releases

mxmzdlv · 2018-10-05T06:33:39Z

Perfect! Pushed the new version 0.4.6.

blainehansen · 2018-10-05T21:52:22Z

🎊 🎊 🎊

Thanks!

adding basic support for single datasets in connection string

2942c2f

mxmzdlv added 3 commits September 20, 2018 20:00

Add tests

7245ba0

Fix test_tables_list test

f40fe18

Comments

b6fb793

blainehansen mentioned this pull request Sep 20, 2018

Bigquery: add ability to restrict Client class to particular dataset googleapis/google-cloud-python#6042

Closed

blainehansen added 2 commits September 20, 2018 16:48

trying to make more safe

7d68580

trying to make more safe

9c2bd60

adding docs for url parameters

2db0843

blainehansen added 2 commits September 26, 2018 13:21

changing to use parametrize

90a0a1d

docs tweak

33b5bcd

blainehansen changed the title ~~adding basic support for single datasets in connection string~~ adding support for datasets and query arguments in connection string Sep 26, 2018

mxmzdlv added 5 commits September 27, 2018 15:29

Add more tests

f8d74f9

If dataset is specified filter out tables that belong to a different …

340f793

…dataset and don't add dataset prefix

Fix

74237e9

Return only specified dataset if it exists

eed1c2c

Readme changes

f69c461

Update google-cloud-bigquery requirement

f5099fd

Merge branch 'master' of https://github.com/mxmzdlv/pybigquery

29c6d33

mxmzdlv merged commit 404e89f into googleapis:master Oct 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding support for datasets and query arguments in connection string #25

adding support for datasets and query arguments in connection string #25

blainehansen commented Sep 19, 2018

blainehansen commented Sep 19, 2018

blainehansen commented Sep 19, 2018

mxmzdlv commented Sep 20, 2018 •

edited

Loading

mxmzdlv commented Sep 20, 2018 •

edited

Loading

blainehansen commented Sep 20, 2018

mxmzdlv commented Sep 20, 2018

blainehansen commented Sep 25, 2018

blainehansen commented Sep 26, 2018 •

edited

Loading

mxmzdlv commented Sep 27, 2018 •

edited

Loading

blainehansen commented Sep 27, 2018

blainehansen commented Sep 27, 2018

mxmzdlv commented Sep 28, 2018 •

edited

Loading

blainehansen commented Oct 4, 2018

mxmzdlv commented Oct 5, 2018

blainehansen commented Oct 5, 2018

adding support for datasets and query arguments in connection string #25

adding support for datasets and query arguments in connection string #25

Conversation

blainehansen commented Sep 19, 2018

blainehansen commented Sep 19, 2018

blainehansen commented Sep 19, 2018

mxmzdlv commented Sep 20, 2018 • edited Loading

mxmzdlv commented Sep 20, 2018 • edited Loading

blainehansen commented Sep 20, 2018

mxmzdlv commented Sep 20, 2018

blainehansen commented Sep 25, 2018

blainehansen commented Sep 26, 2018 • edited Loading

mxmzdlv commented Sep 27, 2018 • edited Loading

blainehansen commented Sep 27, 2018

blainehansen commented Sep 27, 2018

mxmzdlv commented Sep 28, 2018 • edited Loading

blainehansen commented Oct 4, 2018

mxmzdlv commented Oct 5, 2018

blainehansen commented Oct 5, 2018

mxmzdlv commented Sep 20, 2018 •

edited

Loading

mxmzdlv commented Sep 20, 2018 •

edited

Loading

blainehansen commented Sep 26, 2018 •

edited

Loading

mxmzdlv commented Sep 27, 2018 •

edited

Loading

mxmzdlv commented Sep 28, 2018 •

edited

Loading