-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bq date partitioning #641
Bq date partitioning #641
Conversation
my gut reaction to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i left a few comments that need to be addressed
all_tables = [] | ||
for schema in schemas: | ||
dataset = cls.get_dataset(profile, schema, model_name) | ||
all_tables.extend(dataset.list_tables()) | ||
all_tables.extend(client.list_tables(dataset)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oof, is this the API change you were referencing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah :/
dbt/adapters/bigquery.py
Outdated
relation_object.delete() | ||
client.delete_table(relation_object) | ||
|
||
cls.release_connection(profile, model_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think it's correct to release the connection here -- what if you drop the table first, and then create it? i guess for bigquery it makes no difference, but better to exclude it if extraneous
dbt/adapters/bigquery.py
Outdated
res = cls.fetch_query_results(query) | ||
res = list(iterator) | ||
|
||
cls.release_connection(profile, model_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as on drop re: releasing connection
dbt/adapters/bigquery.py
Outdated
dataset.create() | ||
client.create_dataset(dataset) | ||
|
||
cls.release_connection(profile, model_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
dbt/adapters/bigquery.py
Outdated
for table in client.list_tables(dataset): | ||
client.delete_table(table.reference) | ||
|
||
cls.release_connection(profile, name=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
{% for i in range(0, day_count + 1) %} | ||
{% set the_day = (modules.datetime.timedelta(days=i) + start_date).strftime('%Y%m%d') %} | ||
{% if verbose %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
The provided partition date '{{ date_str }}' does not match the expected format '{{ date_fmt }}' | ||
{%- endset %} | ||
|
||
{% set res = try_or_compiler_error(error_msg, modules.datetime.datetime.strptime, date_str.strip(), date_fmt) %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really clever haha
@@ -173,6 +173,7 @@ def setUp(self): | |||
|
|||
# it's important to use a different connection handle here so | |||
# we don't look into an incomplete transaction | |||
adapter.cleanup_connections() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this obviates the need for release_connection everywhere else
@cmcarthur my first cut of this used I like the idea of using
^this would run for 20180131 (good) and then 20180132 (bad), so we'd need to implement our own sort of date_range function I think. I think you're right though -- it's unusual that So:
Then in your model:
and then a macro which looks like:
So the CLI interface is the same, but the macro interface can work for a single date, a date range, or a smattering of random dates. Let me know what you think about this kind of approach |
I like this approach a lot. We can write up docs on how to set your exact CLI syntax up for a project, but if someone really wanted to implement the other use case, they could do it themselves via:
|
Ok, i'll do that |
This is now implemented such that the
The type checking is pretty loosey goosey here, so you can give a list of strings, list of ints, a single string, a single int, etc. These dates must be provided in BigQuery date format, ie. an 8-character series of digits. To generate this list of dates, users can use the
This
coupled with
Users can further extend these macros to simplify patterns which they use frequently. |
lgtm! |
* first cut of date partitioning * cleanup, implement partitioning in materialization * update requirements.txt * wip for date partitioning with range * log data * arg handling, logging, cleanup + view compat for new bq version * add partitioning tests, compatibility with bq 0.29.0 release * pep8 * fix for strange error in appveyor * debug appveyor... * dumb * debugging weird bq adapter use in pg test * do not use read_project in bq tests * cleanup connections, initialize bq tests * remove debug lines * fix integration tests (actually) * warning for view creation which clobbers tables * add query timeout example for bq * no need to release connections in the adapter * partition_date interface change (wip) * list of dates for bq dp tables * tiny fixes for crufty dbt_project.yml files * rm debug line * fix tests automatic commit by git-black, original commits: 4eb75ec
* first cut of date partitioning * cleanup, implement partitioning in materialization * update requirements.txt * wip for date partitioning with range * log data * arg handling, logging, cleanup + view compat for new bq version * add partitioning tests, compatibility with bq 0.29.0 release * pep8 * fix for strange error in appveyor * debug appveyor... * dumb * debugging weird bq adapter use in pg test * do not use read_project in bq tests * cleanup connections, initialize bq tests * remove debug lines * fix integration tests (actually) * warning for view creation which clobbers tables * add query timeout example for bq * no need to release connections in the adapter * partition_date interface change (wip) * list of dates for bq dp tables * tiny fixes for crufty dbt_project.yml files * rm debug line * fix tests automatic commit by git-black, original commits: 4eb75ec a37374d
This branch adds support for date partitioning on BigQuery in dbt.
TODO:
Usage
Run for a single day, hard-coded
Simple usage, specify a single date manually:
Run for a range of days
More complex usage, specify a range of dates. The
date_sharded_table
macro will interpolate the 8-digit date for each of the days between January 1st and January 10th, inclusive. In this way, the resulting date partitioned table will have 10 partitions, one for each day, built from the corresponding date-shardedevents_[YYYYMMDD]
tables.Dynamically specify partition date(s)
Use a variable instead of hardcoding a single date. The variable defaults to "yesterday" if
partition_date
is not provided.This branch is intended to be used in conjunction with #640 to supply variables to date partitioned tables on the command line.
Additional configuration
A full list of configuration options for date partitioned tables is shown below:
partition_date_format
: The date format (using strptime/stftime conventions) with which to parse thepartition_date
field. Default:%Y%m%d
verbose
: If set toTrue
, dbt will output one log line for each date partition created during the invocation of the date partitioned model. Default:False