Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make partition metadata available to BigQuery users #2596

Merged
merged 8 commits into from
Oct 23, 2020

Conversation

ran-eh
Copy link
Contributor

@ran-eh ran-eh commented Jun 27, 2020

No description provided.

@ran-eh ran-eh force-pushed the re-partition-metadata branch 4 times, most recently from 550a5d1 to 8b5b156 Compare June 27, 2020 02:34
@cla-bot
Copy link

cla-bot bot commented Jun 27, 2020

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin.

CLA has not been signed by users: @ran-eh

@dbt-labs dbt-labs deleted a comment from cla-bot bot Jul 1, 2020
@dbt-labs dbt-labs deleted a comment from cla-bot bot Jul 1, 2020
@dbt-labs dbt-labs deleted a comment from cla-bot bot Jul 1, 2020
@dbt-labs dbt-labs deleted a comment from cla-bot bot Jul 1, 2020
@dbt-labs dbt-labs deleted a comment from cla-bot bot Jul 1, 2020
@dbt-labs dbt-labs deleted a comment from cla-bot bot Jul 1, 2020
@drewbanin
Copy link
Contributor

hey @ran-eh - thanks for opening this PR! Are you able to sign the CLA attached above? Once that's done, we'd be happy to take a look at the code in here :D

@ran-eh
Copy link
Contributor Author

ran-eh commented Jul 3, 2020

Thanks @drewbanin . I hope to have it signed next week.

@jtcohen6 jtcohen6 linked an issue Jul 7, 2020 that may be closed by this pull request
@ran-eh
Copy link
Contributor Author

ran-eh commented Jul 18, 2020

@drewbanin, did you receive the signed CLA? Paypal legal say they sent it on Tuesday.

@jtcohen6
Copy link
Contributor

@cla-bot check

@cla-bot
Copy link

cla-bot bot commented Jul 18, 2020

The cla-bot has been summoned, and re-checked this pull request!

@cla-bot cla-bot bot added the cla:yes label Jul 18, 2020
@ran-eh
Copy link
Contributor Author

ran-eh commented Jul 18, 2020

@drewbanin @jtcohen6 Awesome! Can't wait for your review!

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ran-eh Thank you for the contribution! And for wrangling the CLA :)

I left two comments around implementation specifics. More broadly:

  • It looks like there are some pep8 (python style) errors. See the flake8 report in circle.
  • You should add an integration test that executes get_partitions_metadata on a partitioned table (of known fixture data). Knowing the query succeeds is the minimum; even better to check that the columns and row count of the agate result match expectations. I think these tests could be a good starting point.

# copy-pasted from raw_execute(). This is done in order to discorage
# use of legacySQL queries in DBT, except to obtain partition metadata.
# the method would be removed when partition metadata becomes available
# from standardSQL.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To get this working, I totally agree with the approach of duplicating a lot of code between raw_execute and _raw_execute_legacy_sql. The main benefit is that we avoid changing any existing code to support legacy SQL, which is good.

With the benefit of hindsight, it looks like we could avoid all code duplication by adding an optional argument to default raw_execute:

    def raw_execute(self, sql, fetch=False, use_legacy_sql=False):
        conn = self.get_thread_connection()
        client = conn.handle

        logger.debug('On {}: {}', conn.name, sql)

        job_params = {'use_legacy_sql': use_legacy_sql}

Then, get_partitions_metadata below can just call:

       _, iterator = self.raw_execute(sql, fetch='fetch_result', use_legacy_sql=True)

The ability to run arbitrary legacy SQL would still be unavailable from the Jinja environment. The main qualm would be an additional argument for an existing

@beckjake I see this as a code style question and totally defer to you here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree with @jtcohen6 - the key here is that it's not jinja-accessible, internally it's fine if we expose legacy SQL.

The only thing I might do slightly differently is make use_legacy_sql a keyword-only argument by writing it as def raw_execute(self, sql, fetch=False, *, use_legacy_sql=False):.


return query_job, iterator

def get_partitions_metadata(self, table_id):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this method should take a relation as its argument, instead of a string (table_id). This change would mean that:

  • We can construct the legacy SQL table reference using relation components, rather than relying on from_string()
  • Users can pass a ref(), source(), or relation object to the Jinja macro directly. I expect the most common use case (incremental models) to call this as get_partitions_metadata(this).

@jtcohen6
Copy link
Contributor

jtcohen6 commented Aug 14, 2020

@ran-eh Have you had a chance to take another look at this? We're planning to cut a release candidate of v0.18.0 soon.

@ran-eh ran-eh force-pushed the re-partition-metadata branch from 79aa622 to 82ddedd Compare October 11, 2020 00:40
@ran-eh ran-eh force-pushed the re-partition-metadata branch from 82ddedd to cce5945 Compare October 11, 2020 00:44
@ran-eh ran-eh changed the base branch from dev/marian-anderson to dev/kiyoshi-kuromiya October 11, 2020 01:00
@jtcohen6
Copy link
Contributor

Huzzah, passing tests! Could you:

  • Revert the change in 78bd7c9. I don't think it was the cause of the failing integration test.
  • Try adding an integration test, modeled off these, that runs get_partitions_metadata against a partitioned table and checks the length of the results. I'm happy to help with this piece + getting integration tests running locally

* Add tests using get_partitions_metadata

* Readd asterisk to raw_execute
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ran-eh Glad we got this over the finish line! Could you:

  • Changelog: under the v0.19.0 section, add a note for this feature, and add yourself as a contributor
  • open a new issue laying out the performance gains / cost savings we could realize by using adapter.get_partitions_metadata in the "dynamic" insert_overwrite incremental materialization, instead of the current select max(partition)

@jtcohen6
Copy link
Contributor

Thanks for the contribution @ran-eh!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support bq legacySQL queries, to access partition metadata
4 participants