-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding support for datasets and query arguments in connection string #25
Conversation
This has only been tested with a basic case, a simply modified version of the example on the readme: from sqlalchemy import *
from sqlalchemy.engine import create_engine
from sqlalchemy.schema import *
engine = create_engine('bigquery://some_project/some_dataset')
table = Table('some_table', MetaData(bind=engine), autoload=True)
print(select([func.count('*')], from_obj=table).scalar()) This will successfully count the rows in |
I'm not sure what other modifications would need to be made, or if this could be achieved in a better way. |
Thanks @blainehansen! I've tested the code and there are some issues with SQL generation, specifically when generating queries with GROUP, HAVING or ORDER labels. With your approach table names don't include dataset ids, and in that case GROUP BY / HAVING / ORDER should not include them either, but they are still included in the generated query: E.g., this is what SQLAlchemy generates:
And it needs to be:
With our current approach SQLAlchemy just appends dataset id everywhere and that works fine:
Also, I've added some tests. Some of the relevant ones you should look at are |
Another issue is that the underlying BigQuery client doesn't allow restricting query execution per specified dataset, so raw queries like that will still execute on any dataset:
Not sure if that is relevant in your case. |
To your first comment: We're completely fine if the SELECT `test_pybigquery.sample`.`string` AS `test_pybigquery_sample_string`, count(`test_pybigquery.sample`.`integer`) AS `count_1`
FROM `test_pybigquery.sample`
GROUP BY `test_pybigquery.sample`.`string` How do I change the code to make that possible? I'm not sure where I can take the To your second comment: Yeah, raw queries acting on random datasets (or just generally not having any awareness of the dataset restriction) is definitely a problem for us. But without changing the bigquery client to restrict to a specific dataset, I'm not sure how that could be solved. We really want to avoid having to create separate projects for different tenants, that promises to have a lot of maintenance overhead. Thanks! I'll start playing with the code with your tests to see what I can make happen. |
I think you would need to modify I haven't had much luck with the documentation either, so I generally just look into source code — |
The pull request at googleapis/google-cloud-python#6088 is actually looking like the maintainers are interested in going in that direction, so I think this will get much easier to achieve once you can pass a default |
I've got significant changes coming in soon, that all depend on googleapis/google-cloud-python#6088 coming through. The tests won't pass until it does, except the ones in I wanted your feedback on the docs changes and the api, and if there's any further things to think about regarding how the dialect actually works. |
…dataset and don't add dataset prefix
Awesome! I've added some additional tests and changed Since this still won't prevent executing queries against other datasets when using engine directly — I've added a note regarding that in readme. I think this is good to merge as soon as the new version of google-cloud-python lands and we change setup.py to require it. |
That landed about 16 hours ago, so it's possible that the tests you've run have already validated everything! Let me know if there's anything else I can do to help. |
The changes from that pull request now show up in google's docs (the It might be necessary to try specifying the repo in the |
Yeah, I've manually added changes from that pull request to test. Ideally we should wait for the release, but if you need to use this ASAP you can try pointing to the google-cloud-python github repo in setup.py and then do |
Alright, the bigquery code has been released! |
Perfect! Pushed the new version 0.4.6. |
🎊 🎊 🎊 Thanks! |
closes #24