Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement archival with blocks #1175

Closed
drewbanin opened this issue Dec 5, 2018 · 7 comments · Fixed by #1361
Closed

Implement archival with blocks #1175

drewbanin opened this issue Dec 5, 2018 · 7 comments · Fixed by #1361
Labels
snapshots Issues related to dbt's snapshot functionality

Comments

@drewbanin
Copy link
Contributor

drewbanin commented Dec 5, 2018

Feature

Feature description

Let's take archival out of configuration and redefine it using code. In so doing, users will be able to more flexibly control the semantics of their archival jobs.

Functionality to support:

  1. all existing archive functionality (ie. archive by unique_key and updated_at).
  2. archive from one database to another (#838)
  3. archive records when a subset of fields have changed (#706)
  4. archives should be ref'able
  5. archives should be able to call macros

Proposed spec

{% archive your_archive_name_here() %}
  {{ config(
          target_database='<optional database name>', 
          target_schema='<schema name>',
          target_table='<table name>',
          strategy={'timestamp' | 'check'},

          -- always required
          unique_key='id',

          -- strategy == 'timestamp'
          updated_at='updated_at',

          -- strategy == 'check'
          check_cols=['object_status', 'object_name'],
  ) }}

select
    id,
    object_status,
    object_name,
    updated_at

from source_data.table_name

{% endarchive %}

Parameters:

  • {% archive {archive_name} %}. Use this name to ref the archive
  • config params:
    • target_database: the destination database (if supported by the warehouse)
    • target_schema: the destination schema
    • target_table: the destination table
    • unique_key: The column that uniquely identifies an entity in the query
    • strategy: Can be one of timestamp or check
    • timestamp: implements the existing behavior of archival. Requires an updated_at config
    • check: dbt will compare the columns in check_cols to previous values for the unique_key. Archival will occur when these values change. If check_cols is set to "all", then db will check all columns in the table. (those exact semantics TBD)

Notes:

  • If there are a large number of "check cols", users can build a surrogate key in the SQL and use that as a single check_col
  • this archival destination table can be ref'd using the name specified in the archive block

Considerations:

  • We should provide a mechanism for migrating existing archives. How do we do that?
  • These archives will live in an archives/ dir by default. Users can change this with an archive-paths config in dbt_project.yml.
  • let's normalize the metadata column names as described in (#251)
  • Archives should be individually selectable on the CLI, and should support either FQN or Tag type selectors
  • Archives should be testable in schema.yml files

Who will this benefit?

Archive users

@drewbanin
Copy link
Contributor Author

cc @jthandy

@drewbanin drewbanin added the snapshots Issues related to dbt's snapshot functionality label Dec 5, 2018
@drewbanin drewbanin added this to the Wilt Chamberlain milestone Dec 5, 2018
@drewbanin
Copy link
Contributor Author

cc @jtcohen6

@jthandy
Copy link
Member

jthandy commented Dec 6, 2018

I really love this. If, in the process, we can also improve the logging of archival sql to the standard log that would make me incredibly happy. My guess is we will in the process of touching this code anyway.

Here are some thoughts:

  • do we really need archive-paths? these are in blocks...can't we just put them alongside of something else and have the compiler just parse all of the blocks together? seems like they could live in a macros folder... are you assuming that we'll just continue putting things in their own folders until we make everything blocks and then transition all at once?
  • what if we used the term "delta" instead of "check" for the strategy? "check" feels like a weird name of a strategy to me...
  • do archival names need to be unique within the same namespace as other "refable" things (i.e. models)? if so, should note that. the other alternative i had considered was having a sub-namespace like my_project.archives.archive_name. theoretically we could namespace all objects like this.

@drewbanin
Copy link
Contributor Author

drewbanin commented Dec 6, 2018

improve the logging of archival sql to the standard log

i don't actually know what you mean by that!

do we really need archive-paths?

yeah, my idea is very much that we'll be able to get rid of the "-paths" notion altogether once everything is defined in blocks. I will say: these paths can be overlapping, so you could just make a single directory that's specified as you macro-paths and archive-paths. We can't put models or custom data tests in there yet, but hopefully.... soon....

  • i'm into delta

  • archive names will indeed need to be globally unique, though I am keen to answer the larger "namespacing" question too. Will see if there's an opportunity to broach that topic with archival.

really great feedback 👍

@jthandy
Copy link
Member

jthandy commented Dec 6, 2018

improve the logging of archival sql to the standard log

The last time I checked, archival didn't actually output the queries it was running against your warehouse to the standard dbt.log file. I'm rather used to having all queries logged there and it's made it hard for me to troubleshoot archival issues in the past that this sql isn't present there.

If this is no longer the case then 👍 but would be great if we could just do a super-quick audit of what log statements we have in the archival process. And maybe the archival sql should actually go to /target as well...?

@joevandyk
Copy link
Contributor

I think I saw the archival sql statements being logged yesterday.

@drewbanin
Copy link
Contributor Author

Yeah - these will be logged to dbt.log for sure. Will also be good to compile them to the target/ dir as @jthandy indicated above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
snapshots Issues related to dbt's snapshot functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants