-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When running a specific model, dbt should not try to create the schemas of all other models #2681
Comments
Thanks for the detailed writeup, @FurcyPin. I totally get why this is frustrating. While I don't think we're going to prioritize a code change to resolve it, I'll explain our rationale and suggest some other steps you could take to support your specific use case. Ever since we introduced caching, we try to enable dbt to do as much database introspection and administrivia as possible at the very start of the run. That means grabbing metadata for all schemata/datasets (in all databases/projects) where it plans to materialize models, or creating those schemata/datasets if they do not exist already. That caching happens once at the start of the run—any run—no matter how many models are actually selected to run. In addition, as practitioners, we hold the opinionated view that:
The purpose of the So, here are three things you can do today:
models:
my_other_project_models:
+project: "{{ 'project_2' if target.name == 'prod' else 'project_1' }}"
+dataset: "{{ 'dataset_2' if target.name == 'prod' else 'project_1' }}"
models:
my_other_project_models:
+enabled: "{{ ('true' if target.name == 'prod' else 'false') | as_boolean }}" I'm going to close this issue for now, because we don't have immediate plans to change this behavior. That may change someday, though, if we find we need to question the principles I laid out above. |
Thank you @jtcohen6 for the detailed answer. I do understand and agree with your philosophy of wanting dbt to have full managerial control over the datasets/schemas in which it creates resources. If you don't mind I would like to provide some more details about the way we are using dbt right now, and perhaps you can help me figure out how to do things differently. We used to have a single bigquery project used by our data-engineering team. All the tables within it were managed with dbt and it went well. Recently we migrated all the views of our data-analytics team to dbt. Historically the tables are located in another BigQuery project, because we prefer to separate access rights for the two teams. Since most of the data-analyst's tables are build on top on our data-engineer's table, it made sense for us to keep everything in the same dbt model. So we had a folder structure like this:
And we made a custom Our dbt models are all executed separately with
This is a question of opinion, but I used to have a tool that was capable to run a whole dependency graph by itself Anyway, the reason why I called the described behavior as "super annoying" was because after we added these new data_a models and deployed them in prod but (without scheduling them yet), it caused a production incident were all our pre-existing data_e models failed during the night because the new data_a models were pointing on a project that dbt didn't have rights on yet (because we were planning to test them properly in prod the next morning). Thanks again for your suggestions, and if you have more advice to share in the light of this extended context I would be happy to read them. |
This is an interesting perspective! I imagine there's quite a bit of work involved to update the Airflow DAG any time a dbt dependency is added or changed?
These are things that we're thinking hard about, and we have some progress in the works:
I hear you on this broader point: dbt will never be an orchestration tool on par with Airflow/Jenkins/Dagster/Luigi/etc, nor should it ever try to be. I feel that this is mostly true for inter-process dependencies and coordination, however, rather than intra-process dependencies. While Airflow will always know more about the massive DAG of all data processes at your org, it will never know as much about dbt's DAG as dbt does, at least natively. I do think there's a compelling handoff point whereby dbt is able to give Airflow exactly the information it needs to report on and retry build failures. I'd be curious to hear your feedback about the constructs we're using to get there. |
Yes and no. We could easily make a script that would generate a huge single Airflow DAG, but we don't. pros of an automatically generated DAG:
cons:
This is why we prefer to write and maintain our Airflow DAGs manually, which does add some overhead on our development process but allow us to split that huge connected component into smaller DAGs that 'make sense' and are more readable (we often add DummyOperators as "flow control points" just to make the DAG look nicer). In order to avoid any mistake between the dependencies declared in the DAGs and the actual dependencies between our jobs, Currently this only checks the Spark dependencies, but we plan to integrate with DBT to be able to check the BigQuery dependencies as well.
Indeed, storing the results of the last execution might be a way to better handle retries and recovery from errors. I think that many people think that making a scheduler is easy, but in fact making a good scheduler is very task. This is demonstrated by the huge number of in-house schedulers projects: Yahoo made Oozie, LinkedIn made Azkaban, Facebook made Dataswarm, Criteo made Cuttle, Spotify made Luigi, AirBnB made Airflow... Just to give an example, even if you add the retry feature to dbt, it will still be lacking many features, like:
The last point is one of the things that makes Airflow great. Being able to declare your DAG with python code
Indeed, there is a middle ground where you could use airflow to execute We used a tool called flamy that did pretty much what dbt did, except with Hive. We had our in-house Python scheduler, that were comparable to an "Airflow without a GUI". We had workflows with multiple Hive queries intertwined with Spark jobs. Flamy wasn't able to automatically manage Spark dependencies, so what we did was that our scheduler used flamy for all our Hive queries by running commands that looked like We also had the same issue of failure recovery: "how to restart only the failed tasks". We solved it differently: instead of storing the result of each Hive query somewhere, we looked at the table's timestamps and compared that with the upstream table's timestamps. If the data was not up to date, it meant it had to re-run. This worked well except for partitioned tables, where we tried to do some super fancy stuff, that worked but was probably an over-complicated way to solve the problem. Anyway, these days I tend to prefer the approach described above: let Airflow run each task, take a little more time to write the dags manually (but at least they look nicer and you can do exactly what you want), and use automated checks to avoid any human mistake. One last thing: I recently participated in a discussion with the Airflow community that might relate on what you are trying to do. In short, Airflow provides an x-com feature that lets tasks talk to each other, but Airflow currently doesn't provide a feature that lets tasks store persisting information for their future retries. What this mean is that if you want to make a DbtOperator that can persist There has been ample discussion between the Airflow maintainers whether they should add such feature or not. |
Describe the bug
When running a specific model using
dbt/run model -m model_1
the command fails because of permissionerrors on other models.
This is super annoying because adding a new model with a configuration error in prod will make ALL the other models fail,
and this is not seen by the compile step.
Steps To Reproduce
Create two models model_1 and model_2
model_1 creates a table in
project_1.dataset_1.table_1
model_2 creates a table in
project_2.dataset_2.table_2
but where the configured connection does not have the rights on project_2.
then run
dbt run -m model_1
you will get an error like this:
Expected behavior
Running
dbt run -m model_1
should not try to create schemas used by model_2Screenshots and log output
If applicable, add screenshots or log output to help explain your problem.
System information
Which database are you using dbt with?
The output of
dbt --version
:The operating system you're using:
Linux Mint 19 (based on ubuntu bionic)
The output of
python --version
:Python 3.6.9
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: