Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run_query works in run-operation mode but doesn't seem to work during dbt_project.yaml querying #1760

Closed
1 of 5 tasks
whisperstream opened this issue Sep 17, 2019 · 7 comments
Labels
stale Issues that have gone stale

Comments

@whisperstream
Copy link

whisperstream commented Sep 17, 2019

Describe the bug

I have a case where I'm trying to get the users roles from the database when dbt runs (in either compile/run mode). I have a macro called get_user_roles and have added the following to my dbt_project.yml

models:
   vars:
     USER_ROLES: {{ get_user_roles() }}

When I execute dbt run or dbt compile, dbt does connect to the database but it does not return any results from the run_query command in my macro. If I run the same macro in run-operation mode it works just fine.

Was trying to see if there was any documentation to say when run_query works or doesn't work but was unable to find it.

Steps To Reproduce

models:
   vars:
     MY_VAR: {{ run_query('select 1 as one') }}

The value of MY_VAR should be the result of the query but instead it's None

Expected behavior

Expected dbt to connect to the database and execute the query

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

installed version: 0.14.2
   latest version: 0.14.2

The operating system you're using:
OSX
The output of python --version:
Python 3.7.3

Additional context

Add any other context about the problem here.

@whisperstream whisperstream added bug Something isn't working triage labels Sep 17, 2019
@drewbanin
Copy link
Contributor

Hey @whisperstream - this won't work for two reasons:

  1. macros are not loaded into the dbt context when the dbt_project.yml file is being evaluated
  2. dbt doesn't run queries during "parsing", which is when the dbt vars are evaluated

The first item here is subject to change, but the second one probably is not. Can you tell me more about why you're storing the results of this query in a dbt variable? Is it to cache the results so they can be used across multiple different models? If so, there might be alternative ways to accomplish this that we can explore!

@drewbanin drewbanin added help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors discussion and removed bug Something isn't working triage help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors labels Sep 17, 2019
@whisperstream
Copy link
Author

Yes, am trying to cache the results, again all this goes back to solving for having different roles access different parts of the dbt model.

In order to rewrite the models correctly (as discussed here: #1695 (comment)) I'm trying to figure out which roles the user has. (because I need to know in my xref function). So was hoping I could get them as early as possible by storing it in a value in the dbt_projects.yml file.

@whisperstream
Copy link
Author

Any other alternatives are greatly welcomed :)

@drewbanin
Copy link
Contributor

drewbanin commented Sep 18, 2019

Ok - thanks for that context! It's hard for me to remember which issues are correlated - I appreciate you referencing #1695 here!

I think that it would be totally reasonable for something like this to work in dbt:

models:
   vars:
     MY_VAR: {{ run_query('select 1 as one') }}

Unfortunately, there's a whole lot of bootstrapping that dbt needs to do at startup, and at the point where the models: block is evaluated, dbt doesn't yet know anything macros (including run_query()) or your database connection. While I hope that we can support something like this in the future, I'm unsure that we're going to be able to get there any time particularly soon.

A much more direct path to supporting this type of use case would be to provide an arbitrary key/value cache in the dbt context. I think we'd just need to expose a dictionary context variable, called run_cache, which could be manipulated and read by models. This might look like:

{% macro get_permissions() %}

    {% if run_cache.get('permissions') %}
        {{ log("cache hit!") }}
    {% else %}
        {{ log("Running query!") }}
        {% set permissions = run_query('select 1 as id') %}
        {% do run_cache.update({'permissions': permissions}) %}
    {% endif %}

    {% do return(run_cache.get('permissions')) %}

{% endmacro %}

So, the first time you call get_permissions, dbt will execute a query, populate the cache, and return the results. In subsequent calls to get_permissions, dbt will just return the value directly from the cache.

There are other ways to implement something like this too -- the run_query macro could accept a cache boolean argument which would cache the query results for the given SQL statement. If an identical SQL statement is made, then dbt could return the cached results instead of re-querying the database.

Both of these are interesting ideas, but there's one very big problem: they don't play well with parallelism. If you run 8 dbt models at in parallel, all of which call the get_permissions macro at the same time, then they will all end up querying the database and overwriting the cache 8 times over. That might be appropriate for certain use cases, but it's not a very good general solution to this problem.

So, this is all to say: the thing you want to do is reasonable, but we don't have a great answer for how to address it in dbt today. In general, the problem is that the state that dbt needs to operate on lives in your database, and not inside of a dbt resource. This is definitely reasonable (and almost necessary) for permissions, but it's not really a pattern that we've designed for explicitly.

Curious to hear what you think about all this!

Edit: maybe there's a way to cache the results in the on-run-start query? Then we can avoid the parallelism issue...

@LVHermosa
Copy link

LVHermosa commented Dec 16, 2021

I am trying to create a database clone before all models start. Models are supposed to use that clone as a target. I can not achieve this with "on-run-start" for the reason explained above. I am trying to do blue-green flow, where a clone is temporally created as stage database. (snowflake)
@drewbanin how could I make run a " CREATE DATABASE ...CLONE <target_model_database> " statement before all models start.

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Nov 12, 2022
@github-actions
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest; add a comment to notify the maintainers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests

4 participants