Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(#525) drop existing relation at end of full-refresh incremental build #1682

Merged
merged 4 commits into from
Oct 15, 2019

Conversation

drewbanin
Copy link
Contributor

@drewbanin drewbanin commented Aug 14, 2019

fixes #525

Primary goal: minimize downtime for incremental models run in full-refresh mode
Secondary goal: encapsulate incremental upsert logic across adapters so it can be repurposed in higher-order macros

@cla-bot cla-bot bot added the cla:yes label Aug 14, 2019
@drewbanin drewbanin marked this pull request as ready for review August 21, 2019 13:52
@drewbanin drewbanin requested review from beckjake and removed request for beckjake August 21, 2019 13:52
@drewbanin
Copy link
Contributor Author

Moving this issue to the LMA milestone. I think more work is required here to fix this same issue on BigQuery and Snowflake.

@beckjake
Copy link
Contributor

Moving this issue to the LMA milestone. I think more work is required here to fix this same issue on BigQuery and Snowflake.

Okay! I've approved it as-is because I do think this PR is both a step in the right direction and very reasonable.

@darrenhaken
Copy link
Contributor

darrenhaken commented Sep 13, 2019

@drewbanin is there any support you need on this? It is the killer feature

@drewbanin
Copy link
Contributor Author

hey @darrenhaken - I'd love your help testing when we have some working code here!

@darrenhaken
Copy link
Contributor

Of course, we can do BQ testing. Having highly available datasets is a critical feature for us 🙂

@drewbanin
Copy link
Contributor Author

strong agree @darrenhaken - this particular issue is long overdue!

@drewbanin drewbanin changed the base branch from dev/0.14.1 to dev/louisa-may-alcott September 16, 2019 21:42
@drewbanin
Copy link
Contributor Author

@darrenhaken I might do a little more cleanup/refactoring work here, but the --full-refresh atomic replace logic has been implemented for BigQuery in this PR. Feel free to give it a spin and let us know how it goes!

@darrenhaken
Copy link
Contributor

This is awesome! I’ll take to see if I can do some testing this week.

@darrenhaken
Copy link
Contributor

@whittid4 FYI

@darrenhaken
Copy link
Contributor

@drewbanin how does it work with BigQuery? i.e. does it create a temp table first?

@drewbanin
Copy link
Contributor Author

drewbanin commented Sep 18, 2019

@darrenhaken for a full-refresh build of an incremental model:

  1. if the target table doesn't exist: create or replace table as ...
  2. if the target table does exist and is a table: create or replace table as ...
  3. if the target table does exist and is a view: drop view ...; create or replace table as ...

so, this is still not atomic for a view --> incremental materialization switch, but I'm unsure that BQ provides any mechanisms for swapping a view for a table atomically.

Edit: The logic is here and it's actually pretty readable as far as materialization code goes :)

@drewbanin drewbanin force-pushed the fix/minimize-incremental-downtime branch from 0fe4da8 to 95a0587 Compare October 15, 2019 02:54
@elexisvenator
Copy link
Contributor

I'm assuming the answer is because of restrictions in dbt/BQ/snowflake which I am not familiar with.. but why isnt the workflow to just call the table materialization when a full refresh is required? Is there a need to reimplement the logic in the incremental as well?

@drewbanin
Copy link
Contributor Author

@elexisvenator that's the big idea! I want to accomplish the approach you're describing by making the table materialization just call a macro like replace_table(...) which can also be called from incremental models in full-refresh mode.

The Python code in dbt doesn't really know about the materializations that exist in dbt. This is a really neat feature -- it means that each plugin provided for dbt is able to define its own implementation for tables/incrementals/full refreshes/etc. I'd rather push that logic into the materialization layer (and provide good abstractions that can be shared across materializations) than encode this type of information in the dbt Python code.

That's just to say: your instinct is a good one, and I want to accomplish it with good abstractions in jinja instead of hard-assumptions in Python!

@drewbanin
Copy link
Contributor Author

@beckjake can you give this another quick look? I think it's ready to roll now.

The big change I made since last time involves create or replace table statements on BigQuery. BigQuery does not allow partitioning/clustering table configs to change in create or replace table statements. This meant that --full-refresh didn't work in the initial implementation when partition_by or cluster_by were changed in the model code.

We should apply similar logic to the table materialization, but I figured that was out of scope for this issue. I'll create a separate issue to address it.

Copy link
Contributor

@beckjake beckjake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I have a couple questions but nothing significant.

from {{ target_relation }}
where ({{ unique_key }}) in (
select ({{ unique_key }})
from {{ tmp_relation.include(schema=False, database=False) }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this include() necessary/even correct for all databases? I think whatever made your tmp_relation should be giving you the correct include policy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is such a great catch! Yes - this macro should definitely expect the tmp_relation to already have a valid include policy for the given database.

@@ -346,6 +346,36 @@ def execute_model(self, model, materialization, sql_override=None,

return res

@available.parse_none
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to be pedantic about types here: This should probably be @available.parse(lambda *a, **k: True) (or False).

conf_cluster = [conf_cluster]

return table_partition == conf_partition \
and table_cluster == conf_cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does order matter? If not, we should compare set(table_cluster) == set(conf_cluster).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah -- the order is significant -- clustering works like ordering whereby the table is clustered by the first clustering key, then the second, and so on.

This query fails if you run it twice, swapping the order of the clustering keys on the second run:

create or replace table dbt_dbanin.debug_clustering
partition by date_day
cluster by id, name
as (
  select current_date as date_day, 1 as id, 'drew' as name
);

@drewbanin drewbanin merged commit e83aab2 into dev/louisa-may-alcott Oct 15, 2019
@drewbanin drewbanin deleted the fix/minimize-incremental-downtime branch October 15, 2019 18:44
@darrenhaken
Copy link
Contributor

darrenhaken commented Oct 15, 2019 via email

@drewbanin
Copy link
Contributor Author

@darrenhaken this is shipping in 0.15.0, due in November! We'll have a pre-release ready hopefully by the end of this week :)

@darrenhaken
Copy link
Contributor

darrenhaken commented Oct 15, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

minimize downtime for incremental models with full-refresh
4 participants