-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make partition metadata available to BigQuery users #2596
Make partition metadata available to BigQuery users #2596
Conversation
550a5d1
to
8b5b156
Compare
Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA. In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin. CLA has not been signed by users: @ran-eh |
hey @ran-eh - thanks for opening this PR! Are you able to sign the CLA attached above? Once that's done, we'd be happy to take a look at the code in here :D |
Thanks @drewbanin . I hope to have it signed next week. |
@drewbanin, did you receive the signed CLA? Paypal legal say they sent it on Tuesday. |
@cla-bot check |
The cla-bot has been summoned, and re-checked this pull request! |
@drewbanin @jtcohen6 Awesome! Can't wait for your review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ran-eh Thank you for the contribution! And for wrangling the CLA :)
I left two comments around implementation specifics. More broadly:
- It looks like there are some pep8 (python style) errors. See the flake8 report in circle.
- You should add an integration test that executes
get_partitions_metadata
on a partitioned table (of known fixture data). Knowing the query succeeds is the minimum; even better to check that the columns and row count of the agate result match expectations. I think these tests could be a good starting point.
# copy-pasted from raw_execute(). This is done in order to discorage | ||
# use of legacySQL queries in DBT, except to obtain partition metadata. | ||
# the method would be removed when partition metadata becomes available | ||
# from standardSQL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To get this working, I totally agree with the approach of duplicating a lot of code between raw_execute
and _raw_execute_legacy_sql
. The main benefit is that we avoid changing any existing code to support legacy SQL, which is good.
With the benefit of hindsight, it looks like we could avoid all code duplication by adding an optional argument to default raw_execute
:
def raw_execute(self, sql, fetch=False, use_legacy_sql=False):
conn = self.get_thread_connection()
client = conn.handle
logger.debug('On {}: {}', conn.name, sql)
job_params = {'use_legacy_sql': use_legacy_sql}
Then, get_partitions_metadata
below can just call:
_, iterator = self.raw_execute(sql, fetch='fetch_result', use_legacy_sql=True)
The ability to run arbitrary legacy SQL would still be unavailable from the Jinja environment. The main qualm would be an additional argument for an existing
@beckjake I see this as a code style question and totally defer to you here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree with @jtcohen6 - the key here is that it's not jinja-accessible, internally it's fine if we expose legacy SQL.
The only thing I might do slightly differently is make use_legacy_sql
a keyword-only argument by writing it as def raw_execute(self, sql, fetch=False, *, use_legacy_sql=False):
.
|
||
return query_job, iterator | ||
|
||
def get_partitions_metadata(self, table_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this method should take a relation as its argument, instead of a string (table_id
). This change would mean that:
- We can construct the legacy SQL table reference using relation components, rather than relying on
from_string()
- Users can pass a
ref()
,source()
, or relation object to the Jinja macro directly. I expect the most common use case (incremental models) to call this asget_partitions_metadata(this)
.
@ran-eh Have you had a chance to take another look at this? We're planning to cut a release candidate of v0.18.0 soon. |
79aa622
to
82ddedd
Compare
82ddedd
to
cce5945
Compare
Huzzah, passing tests! Could you:
|
* Add tests using get_partitions_metadata * Readd asterisk to raw_execute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ran-eh Glad we got this over the finish line! Could you:
- Changelog: under the v0.19.0 section, add a note for this feature, and add yourself as a contributor
- open a new issue laying out the performance gains / cost savings we could realize by using
adapter.get_partitions_metadata
in the "dynamic"insert_overwrite
incremental materialization, instead of the currentselect max(partition)
Thanks for the contribution @ran-eh! |
No description provided.