Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding support for datasets and query arguments in connection string #25

Merged
merged 16 commits into from
Oct 5, 2018
Merged

adding support for datasets and query arguments in connection string #25

merged 16 commits into from
Oct 5, 2018

Conversation

blainehansen
Copy link
Contributor

closes #24

@blainehansen
Copy link
Contributor Author

This has only been tested with a basic case, a simply modified version of the example on the readme:

from sqlalchemy import *
from sqlalchemy.engine import create_engine
from sqlalchemy.schema import *
engine = create_engine('bigquery://some_project/some_dataset')

table = Table('some_table', MetaData(bind=engine), autoload=True)
print(select([func.count('*')], from_obj=table).scalar())

This will successfully count the rows in some_dataset.some_table

@blainehansen
Copy link
Contributor Author

I'm not sure what other modifications would need to be made, or if this could be achieved in a better way.

@mxmzdlv
Copy link
Contributor

mxmzdlv commented Sep 20, 2018

Thanks @blainehansen!

I've tested the code and there are some issues with SQL generation, specifically when generating queries with GROUP, HAVING or ORDER labels.

With your approach table names don't include dataset ids, and in that case GROUP BY / HAVING / ORDER should not include them either, but they are still included in the generated query:

E.g., this is what SQLAlchemy generates:

SELECT `sample`.`string` AS `sample_string`, count(`sample`.`integer`) AS `count_1` 
FROM `test_pybigquery`.`sample` 
GROUP BY `test_pybigquery`.`sample`.`string`

And it needs to be:

SELECT `sample`.`string` AS `sample_string`, count(`sample`.`integer`) AS `count_1` 
FROM `test_pybigquery`.`sample`
GROUP BY `sample`.`string`

With our current approach SQLAlchemy just appends dataset id everywhere and that works fine:

SELECT `test_pybigquery.sample`.`string` AS `test_pybigquery_sample_string`, count(`test_pybigquery.sample`.`integer`) AS `count_1` 
FROM `test_pybigquery.sample`
GROUP BY `test_pybigquery.sample`.`string`

Also, get_table_names method still returns table names with dataset ids in them and I think that would cause issues with auto-generated table definitions by Superset.

I've added some tests. Some of the relevant ones you should look at are test_tables_list, test_group_by.

@mxmzdlv
Copy link
Contributor

mxmzdlv commented Sep 20, 2018

Another issue is that the underlying BigQuery client doesn't allow restricting query execution per specified dataset, so raw queries like that will still execute on any dataset:

engine.execute('SELECT * FROM another_dataset.table').fetchall()

Not sure if that is relevant in your case.

@blainehansen
Copy link
Contributor Author

To your first comment:

We're completely fine if the dataset_id is appended everywhere (as in your last example):

SELECT `test_pybigquery.sample`.`string` AS `test_pybigquery_sample_string`, count(`test_pybigquery.sample`.`integer`) AS `count_1` 
FROM `test_pybigquery.sample`
GROUP BY `test_pybigquery.sample`.`string`

How do I change the code to make that possible? I'm not sure where I can take the url.database string (which is where it's made available from the connection string), and input it to make it included everywhere, without simply naming the table dataset_id.table_name (that would obviously ruin the multi-tenancy). The ideal solution for us is to change your codebase as little as possible to make this feature happen.
(sorry if I seem a little clueless, I've had a very difficult time finding useful docs for sqlalchemy custom dialects, even yesterday finding out about the visit_table function was a long process of searching across random sites like kite.com and then guessing and checking. I'd love your help or pointers in getting this figured out, whether it's code suggestions or pointers to good resources)

To your second comment:

Yeah, raw queries acting on random datasets (or just generally not having any awareness of the dataset restriction) is definitely a problem for us. But without changing the bigquery client to restrict to a specific dataset, I'm not sure how that could be solved. We really want to avoid having to create separate projects for different tenants, that promises to have a lot of maintenance overhead.

Thanks!

I'll start playing with the code with your tests to see what I can make happen.

@mxmzdlv
Copy link
Contributor

mxmzdlv commented Sep 20, 2018

I think you would need to modify BigQueryCompiler.visit_label or BigQueryIdentifierPreparer.format_label, and include_table in BigQueryCompiler might be relevant, but I am not sure how to get dataset_id into these methods. Perhaps you could have another BigQueryCompiler with a separate logic for when the dataset is specified on connection — you could check it in BigQueryDialect.create_connect_args and overwrite statement_compiler.

I haven't had much luck with the documentation either, so I generally just look into source code —
BigQueryDialect inherits from DefaultDialect and SQL generation mostly happens in SQLCompiler and IdentifierPreparer.

@blainehansen
Copy link
Contributor Author

The pull request at googleapis/google-cloud-python#6088 is actually looking like the maintainers are interested in going in that direction, so I think this will get much easier to achieve once you can pass a default QueryJobConfig to the Client constructor. I'm going to change the direction of this pull request to accept more default config parameters in the sqlalchemy URL to create that default QueryJobConfig.

@blainehansen
Copy link
Contributor Author

blainehansen commented Sep 26, 2018

I've got significant changes coming in soon, that all depend on googleapis/google-cloud-python#6088 coming through. The tests won't pass until it does, except the ones in test_parse_url.py, (pytest -k parse_url). It will make this change significantly easier.

I wanted your feedback on the docs changes and the api, and if there's any further things to think about regarding how the dialect actually works.

@blainehansen blainehansen changed the title adding basic support for single datasets in connection string adding support for datasets and query arguments in connection string Sep 26, 2018
@mxmzdlv
Copy link
Contributor

mxmzdlv commented Sep 27, 2018

Awesome! I've added some additional tests and changed get_schema_names and get_table_names behavior when dataset is set. Seems to be passing all the tests. I've also tested it with Superset and haven't seen any issues.

Since this still won't prevent executing queries against other datasets when using engine directly — I've added a note regarding that in readme.

I think this is good to merge as soon as the new version of google-cloud-python lands and we change setup.py to require it.

@blainehansen
Copy link
Contributor Author

That landed about 16 hours ago, so it's possible that the tests you've run have already validated everything! Let me know if there's anything else I can do to help.

@blainehansen
Copy link
Contributor Author

The changes from that pull request now show up in google's docs (the default_query_job_config argument is from that pull request), but it doesn't seem like it's been officially released yet? https://github.com/GoogleCloudPlatform/google-cloud-python/releases

It might be necessary to try specifying the repo in the setup.py install_requires options. I'm not sure though.

@mxmzdlv
Copy link
Contributor

mxmzdlv commented Sep 28, 2018

Yeah, I've manually added changes from that pull request to test.

Ideally we should wait for the release, but if you need to use this ASAP you can try pointing to the google-cloud-python github repo in setup.py and then do pip install https://github.com/marketdial/pybigquery/archive/master.zip in your project.

@blainehansen
Copy link
Contributor Author

Alright, the bigquery code has been released!
https://github.com/googleapis/google-cloud-python/releases

@mxmzdlv mxmzdlv merged commit 404e89f into googleapis:master Oct 5, 2018
@mxmzdlv
Copy link
Contributor

mxmzdlv commented Oct 5, 2018

Perfect! Pushed the new version 0.4.6.

@blainehansen
Copy link
Contributor Author

🎊 🎊 🎊

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ability to specify dataset in connection string
2 participants