-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-3162] [Feature] Ability to dbt clone
sources
#8722
Comments
dbt clone
sources during a builddbt clone
sources during a build
dbt clone
sources during a builddbt clone
sources
I appreciate you taking time to write up both here and in the Slack thread you mentioned! My initial reaction, is that this request is a no-brainer. This should be possible. However, the biggest barrier to shipping this is architecture of dbt-core, in that today, The way It's no stretch for me to imagine a dbt of the future were sources are nodes. However, this is uncharted territory and would certainly require some non-trivial refactoring to facilitate what is really a conceptual/architectural shift. We're thinking a lot of sources right now. #8617 is not about clone, but presents an opportunity to further develop Below I offer a way to unblock you today, but I won't rule out that perhaps there is short- or medium-term work that might enable you more elegantly without a fundamental rework of dbt key concepts. @leoebfolsom happy to brainstorm more with you on this. kludgy workaroundcleverly generating wrapper view models of sources then cloning those. |
Thanks @dataders! Got it. It doesn't surprise me that this can't be a quick win. I'm glad to have your eyes on it, and hopefully as other issues are addressed, this one will easily follow, when possible! Re: the kludgy workaround (TKW), "cleverly generating wrapper view models of sources then cloning those", that sounds a lot like a base layer to me, unless I'm missing something? If this was only for my org's dbt project, I could potentially advocate for it. But ... I'm looking for a solution that could be implemented across many orgs widely different cultures and reqs; which, to be fair, is how dbt-core should work. So I'm saying: good idea, could work for some, will definitely keep it up my sleeve, but not a resolution of the issue. We agree. What I'm currently exploring and recommending is: the creation of a "representation of prod" that is built in CI. Imagine I do see that if all nodes were treated the same--let nodes be nodes!--as is brilliantly and hilariously articulated here by @alison985, a lot of this would be sorted and unblocked. Exciting! Forward! |
while we're on the subject@aranke flagged #7442 as a stopgap solution that we'll ship in support of contracted models. It explains more of the chewy nuance that's less about sources, and more about delivering something here that is simultaneously intuitive, non-magical (muggle-y?) and without footguns. Is anything about the approach feeling robust enough to be more widely used pattern than a kludgy workaround for a single data team? another way inanother workaround he surfaced was the possibility of invoking the create_or_replace_clone() macro directly. This is effectively the pith of the if you hack it, it they will cloneIf I were to do this, I'd be leaning heavily on the following, and the result might look like the below
{% macro folsom_cloning_blues() %}
{% set source_nodes = graph.sources.values() if graph.sources else [] %}
{% for source in source_nodes %}
{% for node in source %}
{# the juicy bit -- stdout logging here is helpful #}
{% set indigo_schema = node.schema ~'__clone' %}
{# i think this is what incorporate does? #}
{% set indigo_node = node.incorporate(schema=indigo_schema) %}
{%- call statement('doug_coupland_is_right', fetch_result=False) -%}
create_or_replace_clone(node, indigo_node)
{%- endcall -%}
{% endfor %}
{% endfor %}
{% endmacro %} p.s. calling out that this sounds similar to blue-green deployments. can we brand this blue-green-indigo deployment (orange-blue-green)?
|
One month later, and I myself find myself wanting to clone a source. I tried to make a Dynamic table off of a
naturally, I'm thinking
Which, I cannot do with |
Is this your first time submitting a feature request?
Describe the feature
I wish to
dbt clone
sources that are defined in the yml.Why?
I want to analyze dbt code changes and their potential impact on the data that aren't easily found by dbt tests, such as (non-exhaustively) ...
60.0
changing to60.1
, orLeo
changing toLeon
... before merging to main.
Can't you just compare a PR build to a prod build?
Not quite. Sources change over time, and for many teams, data loads to the DWH (leading to updated sources) more frequently than prod dbt jobs run. So, when building data in CI, unless I point upstream to sources that were cloned at the time of the last dbt run, the DAG is non-idempotent 😭 .
In other words: dbt runs at different times using the same code have different results.
In more other words: differences in the data are explained not by code changes in the PR, but rather by shifting upstream source data.
This makes it hard to test code changes for regressions because there's another variable (shifting source data).
How does
dbt clone
ing sources help with this problem?If I could rebuild part or all of the DAG later using clones of sources that were created at the time of the last production run, I could ensure downstream models were built using identical, non-updated source data. Then, I could do a better job of validating that dbt code changes in the PR are not going to break stuff before I hit merge.
Why not use deferral / Slim CI?
Deferral and Slim CI only solves this problem if the model does not depend on sources.
While it is "best practice" to have a base layer, which will significantly reduce the described pain; in practice / in the field, many orgs will not do this, or are not ready to do this yet, for whatever reason.
Furthermore, even if you have a base layer, that base layer might be modified in a PR, which would again lead to shifting source data, because they would be refetched from the DWH; so having a base layer is really just a 90%-of-the-way-there workaround.
Describe alternatives you've considered
prod-branch
at 10:00, dbt run onpr-branch
at 10:05)Who will this benefit?
dbt users who want to implement rigorous regression testing.
Are you interested in contributing this feature?
I probably can't help much with the dbt-core update, but happy to review, or kick around ideas, or make beginner contribs, if helpful
Anything else?
Slack discussion, where @dave-connors-3 suggested I open an issue. 🙇
The text was updated successfully, but these errors were encountered: