Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make dbt 'schema-aware' #2582

Closed
bodschut opened this issue Jun 23, 2020 · 3 comments
Closed

Make dbt 'schema-aware' #2582

bodschut opened this issue Jun 23, 2020 · 3 comments
Labels
enhancement New feature or request stale Issues that have gone stale

Comments

@bodschut
Copy link
Contributor

bodschut commented Jun 23, 2020

Describe the feature

Today, dbt makes it very easy to make changes to existing models. On top, the dbt model selection syntax makes it easy to deploy different sets of models, including single models and models together with their parents and/or children (even a given number of parents/children as from dbt 0.18.0).

One downside of this flexibility of dbt is that there is no protection against a user making a 'backwards incompatible' change to a model, which can have direct repercussions on child models. Under 'backwards incompatible change' you can classify most importantly the change of a column's name or the removal of a column. Indeed, if you have a child model that selects such a column, it will fail on the next run (if that child model is materialised as a view, that view will even become invalid immediately). Yes, when you would deploy the parent model together with it's children in the same dbt run, you will get an error at the moment the child model is deployed, but at that point the harm is already done as the parent model with the change has already been deployed.

So, it could be an idea to make dbt more 'schema aware' and able to detect when a user tries to deploy a model that contains an incompatible change compared to it's previous deployment. At that point, dbt can at least raise a warning, but we could make it configurable by the user that dbt even blocks the deployment at this point and forces you to fix the downstream models.

I'm not directly proposing a solution here as I think implementing something like this will be a complex undertaking, but it would be nice to start the discussion about what role the dbt framework should take here in the future.

Who will this benefit?

This will benefit all users of dbt, regardless of the database being used.

PS: @drewbanin let's continue the slack discussion here

@bodschut bodschut added enhancement New feature or request triage labels Jun 23, 2020
@jtcohen6 jtcohen6 removed the triage label Jun 23, 2020
@jtcohen6
Copy link
Contributor

jtcohen6 commented Jun 23, 2020

Thanks for the discussion above @bodschut! You're right that this is an important issue, and any implementation would be complex undertaking.

In general, our approach has been to run CI jobs that build and test all child models any time there is a proposed change to an existing model. It's a brute-force approach, in that it requires materializing models in a scratch schema in the database. We've got some significant work in progress to power slimmer and smarter CI jobs (see #2465). I know some people who have hacked quick-and-dirty version of this, e.g. by adding limit 0 to all models in CI to quickly pick up any syntax, data type, or missing/renamed column errors.

At that point, dbt can at least raise a warning, but we could make it configurable by the user that dbt even blocks the deployment at this point and forces you to fix the downstream models.

Through an automated CI tool and a git workflow, you get this by requiring passing checks before merge. So while not a perfect answer, it's pretty darn good. If you're worried about breaking changes that may affect non-dbt processes (e.g. BI, ML) that select from dbt models, you could create views or data tests that mock their queries and run those assets in CI.

Without going too far on a tangent, this issue does remind of an issue/idea from a long time ago (#1158): the ability to "extend" YML config from upstream models into downstream models, so as to copy a given column's metadata (name, description, tests). In combination with the --strict CLI flag, you could define the same column across multiple models and catch any unexpected values—though still at runtime, not compile time.

@bodschut
Copy link
Contributor Author

Hi @jtcohen6

Thanks for your take on this. I was not aware on the work on issue #2465 , very interesting! Is there any roadmap for implementing this functionality?

I guess that what I am proposing here takes it even a step further. I agree that you can get a long way using CI/CD, but you must agree that it's always quite some overhead to setup a good CI/CD setup for each project, as the complexity can greatly differ from project to project. Catching backwards incompatible changes at compile time would greatly reduce the complexity and 'hacks' that need to be setup at CI/CD levels.

I agree, implementing something like this would be a big undertaking. It would start by introducing a kind of 'schema registry', where the output schema of each model can be registered and subsequent updates can be checked for changes and warn at compile time whether the introduced changes are backwards compatible or not. At veepee, we actually developed and open sourced such a schema registry to govern our incoming raw data streams (that are subsequently registered in dbt as sources). If you're interested in taking a look, it's here. When you would take an approach like this, you can also automatically 'version' your model and build a complete change history that could afterwards be an interesting addition to the docs site (e.g. minor version bumps for compatible changes and major version bumps for breaking changes).

If you're up to it, I would also be interested to learn a bit more into detail how you setup your CI/CD strategy, as I'm feeling that ours is still suboptimal at the moment...

Anyway, happy to hear some further thoughts :-)

Cheers

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Nov 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests

2 participants