Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hourly, monthly and yearly partitions in BigQuery #2903

Merged
merged 13 commits into from
Nov 30, 2020

Conversation

db-magnus
Copy link
Contributor

resolves #2476

Description

Added possibility for more partition types on timestamp or datetime columns.

Using the granularity field as discussed in the issue to specify how to partition. Had to change some existing tests to support this granularity field.

I've added some tests in , let me know if this is ok or if I should do it differently.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

@cla-bot cla-bot bot added the cla:yes label Nov 19, 2020
@VasiliiSurov
Copy link
Contributor

Hi @db-magnus ,
I just picked up this task to do Yesterday, so let me just ask for a few improvements from you.

  1. Type DATE can be now partitioned by day, month, and year
  2. Granularity comparison has to be added to "_partitions_match" , otherwise attempt to change partition granularity on the same column fails.
  3. And I was thinking to add option "partition_expiration_days", that can be useful as well

https://github.com/fishtown-analytics/dbt/blob/2c8d1b5b8c60f2dbed90265a89b5b590a5f5e62b/plugins/bigquery/dbt/adapters/bigquery/impl.py#L551-L555

    if not is_partitioned and not conf_partition:
        return True
    elif conf_partition and table.time_partitioning is not None:
        table_field = table.time_partitioning.field
        table_granularity = table.partitioning_type
        return table_field == conf_partition.field \
            and table_granularity == conf_partition.granularity

@jtcohen6
Copy link
Contributor

@db-magnus This is really cool, thanks for picking it up! I just reran the failing Redshift test, which was an aberration.

@VasiliiSurov Nice point on _partitions_match. I'm less clear on partition_expiration_days. Where do you see that going in dbt? As a config applied to all partitions in a model? What about incremental models, which may create/update partitions at different times?

@VasiliiSurov
Copy link
Contributor

@jtcohen6 I see this as another optional configuration just like hours_to_expiration and, yes, it's applied to the all time partitions in the table and will allow to keep partitions as long as you want and bq will automatically drop outdated partitions for you.
It can be useful for example if one needs only last N years of data in the incremental model, new partitions will be added and no need to drop old.

{{ config( materialized = 'table', hours_to_expiration = 30*24, partition_by = { 'field': 'updated_at' ,'data_type': 'datetime' ,'granularity': 'hour' }, partition_expiration_days = 2/24 ) }}

https://github.com/VasiliiSurov/dbt/blob/9032467e6e0b513d295cd9c738265ba2e1f99013/plugins/bigquery/dbt/adapters/bigquery/impl.py#L50-L68

https://github.com/VasiliiSurov/dbt/blob/9032467e6e0b513d295cd9c738265ba2e1f99013/plugins/bigquery/dbt/adapters/bigquery/impl.py#L108-L117

https://github.com/VasiliiSurov/dbt/blob/9032467e6e0b513d295cd9c738265ba2e1f99013/plugins/bigquery/dbt/adapters/bigquery/impl.py#L770-L800

@db-magnus
Copy link
Contributor Author

@VasiliiSurov thanks for the feedback, added in the day partitioning and granularity in _partitions_match.

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@db-magnus I tested out locally, this is looking solid. I left a few comments.

Could you also add a changelog note (linking #2476 and #2903, under v0.19.0 "Features") ,and add yourself to the list of contributors?

@VasiliiSurov Could you open a new issue to continue the discussion around partition expiration? I'd rather keep this PR narrow in its scope. I also feel I need to better understand the implications of partition-based time-to-live, especially in context of incremental models that only update some of their partitions during standard runs.

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good. Once we can get the tests running, I'm excited to merge it in. Thanks for opening dbt-labs/docs.getdbt.com#472 as well, I'll take a look.

test/unit/test_bigquery_adapter.py Outdated Show resolved Hide resolved
test/unit/test_bigquery_adapter.py Outdated Show resolved Hide resolved
db-magnus and others added 2 commits November 30, 2020 00:44
Co-authored-by: Jeremy Cohen <jtcohen6@gmail.com>
Co-authored-by: Jeremy Cohen <jtcohen6@gmail.com>
Co-authored-by: Jeremy Cohen <jtcohen6@gmail.com>
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @db-magnus!

@jtcohen6 jtcohen6 merged commit 5ba5271 into dbt-labs:dev/kiyoshi-kuromiya Nov 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for BigQuery hourly partitioned tables
3 participants