Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/bq incremental and archive #856

Merged
merged 12 commits into from
Jul 19, 2018

Conversation

drewbanin
Copy link
Contributor

Fixes #712 by adding support for archival and incremental models on bigquery.

Incremental models are implemented with merge statements, whereas archival just uses insert and update statements. Future work should be done to make use of the merge statement for archival on BigQuery.

In pursuit of archival, some new functionality needed to be added to the bigquery adapter:

  1. create temporary tables
  2. programmatically add columns to a table
  3. translate bigquery column type responses to actual types (eg FLOAT --> FLOAT64)

Copy link
Contributor

@beckjake beckjake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool, this looks like it was pretty tough! I have some minor feedback and questions but nothing big.

# mirrors the implementation of list_relations for other adapters

try:
all_tables = list(all_tables)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered consuming the iterator here by moving the list comprehension up? As it is this is going to make two copies of each table (one BQ API result and one Relation). If we're actually pulling up to 100k entries, doubling that could be a lot!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 100k would be super excessive! But agree, happy to slide the comprehension up into the try/catch

@@ -18,12 +75,12 @@

select
{% for col in adapter.get_columns_in_table(source_relation.schema, source_relation.identifier) %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this just be for col in cols?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, 100%! Good catch

{%- set full_refresh_mode = (flags.FULL_REFRESH == True) -%}

{% if non_destructive_mode %}
{{ log("--non-destructive is not supported on BigQuery, and will be ignored", info=True) }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me like this should fail with an error rather than just logging and ignoring it.

dbt/schema.py Outdated
return cls.TYPE_LABELS.get(dtype.upper(), dtype)

@classmethod
def create(cls, name, label=None, dtype=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on how translate_type works, it looks like you could just pass one label_or_dtype argument and pass it to translate_type unconditionally, no need for two arguments.

self.use_default_project({"data-paths": [self.dir("seed-initial")]})

self.run_dbt(["seed"])
self.run_dbt()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests should probably check to make sure the right number of things are being run (see #854)

@drewbanin
Copy link
Contributor Author

@beckjake just updated this with your review feedback

@@ -51,16 +51,14 @@
{#
Cross-db compatible archival implementation
#}
{% macro archive_select(source_relation, target_relation, unique_key, updated_at) %}
{% macro archive_select(source_relation, target_relation, source_columns, unique_key, updated_at) %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, this is even better

Copy link
Member

@cmcarthur cmcarthur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drew this is fantastic. this branch cleans up a lot of rough edges in the bigquery adapter and Column class, and then the actual code to implement incremental and archive looks very familiar. really great work.

{{ adapter_macro('get_merge_sql', target, source, unique_key, dest_columns) }}
{%- endmacro %}

{% macro default__get_merge_sql(target, source, unique_key, dest_columns) -%}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am i correct in assuming that this works for snowflake and bigquery, but not postgres and redshift? can you add adapter macros for postgres and redshift to raise an exception if this is used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmcarthur long-term, my plan is to implement a version of merge for Redshift and Postgres.

The merge macro will serve as an abstraction that should work across all adapters, while it might compile to something like an insert and update statement on Redshift and Postgres. I really like the idea of making the core materialization logic identical across all adapters, and just calling out to an adapter-specific version of the merge macro.

I can definitely add an exception for pg/redshift for now though

{% macro default__create_schema(schema_name) %}
{% call statement() %}
create schema if not exists {{ schema_name }};
{% endcall %}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get rid of this create_schema macro, use adapter.create_schema instead

@drewbanin
Copy link
Contributor Author

🌞 ☁️ ☁️

🌊 🚢 🌊

shippin it

@drewbanin drewbanin merged commit 574d859 into development Jul 19, 2018
@drewbanin drewbanin deleted the feature/bq-incremental-and-archive branch July 19, 2018 02:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants