Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-3012] [Bug] dbt docs generate takes a lot of time #8452

Closed
2 tasks done
SamuelBohumel opened this issue Aug 18, 2023 · 6 comments
Closed
2 tasks done

[CT-3012] [Bug] dbt docs generate takes a lot of time #8452

SamuelBohumel opened this issue Aug 18, 2023 · 6 comments
Labels
bug Something isn't working duplicate This issue or pull request already exists

Comments

@SamuelBohumel
Copy link

SamuelBohumel commented Aug 18, 2023

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Hi there,
I have seen issue #1576, but I think it still needs attention.
I am using dbt to transform data in Greenplum database. When generating catalog with command dbt docs generate, the process sometimes takes around 10 min.
After analysis of a query that is executed against database, I found out that there is no condition specifying tables that are needed and although I am working with ~20 tables, resulting set had 3.3 million records.
Is there a way to optimize the query that gets the metadata to be more efficient, or modify WHERE clause to decrease the size of result?.
Thanks.

Expected Behavior

I would expect the dbt docs generate command to finish generating the catalog within a minute as it is planned for production usage and a lot of pipelines.

Steps To Reproduce

standard dbt project with 20 tables.
run dbt docs generate.
Have around 50 000 tables in database.

Relevant log output

No response

Environment

- OS: wsl Ubuntu 22.04
- Python: 3.10.12
- dbt: 1.5.0

Which database adapter are you using with dbt?

No response

Additional Context

No response

@SamuelBohumel SamuelBohumel added bug Something isn't working triage labels Aug 18, 2023
@github-actions github-actions bot changed the title [Bug] dbt docs generate takes a lot of time [CT-3012] [Bug] dbt docs generate takes a lot of time Aug 18, 2023
@dbeatty10 dbeatty10 self-assigned this Aug 18, 2023
@dbeatty10
Copy link
Contributor

Thanks for reaching out @SamuelBohumel !

It is a known case that catalog generation time is outsized in comparison to the actual number of dbt models in cases like these:

  • source schemas with many thousands of tables, of which a small subset are actually referenced by dbt models
  • lots of non-dbt objects living in the same schemas as dbt models/seeds/snapshots/etc

One option for you

Is there a way to optimize the query that gets the metadata to be more efficient, or modify WHERE clause to decrease the size of result?

There is! I haven't tried it myself, but you can do something like the following:

  1. Since dbt-greenplum inherits get_catalog from dbt-postgres, copy this macro definition into your local project.
  2. Rename postgres__get_catalog to greenplum__get_catalog
  3. Optional sanity check before the next step:
    1. Try out dbt docs generate again; it should yield the same results and take the same amount of time as before
    2. Add and 1 = 0 to the where clause
    3. Re-run dbt docs generate again; it should run fast, but not generate the full intended output
    4. Remove and 1 = 0 from the where clause
  4. Using catalog_2.sql compared to the default bigquery__get_catalog as a guide, add the relevant Jinja that will change the catalog query to filter on dbt objects only and not all objects in a schema. See below for details.

Filter on dbt objects only

Here are the key additions to include that filter within bigquery__get_catalog in the catalog_2.sql example above:

2a3,25
>   {# Here is where we build our list of relations of interest #}
>   {%- set relations_in_project = [] -%}
> 
>   {%- for node in graph.nodes.values() -%}
>     {%- if node.resource_type == 'model' -%}
>       {%- do relations_in_project.append(node.alias) -%}
>     {%- endif -%}
>   {%- endfor -%}
>   {%- for source in graph.sources.values() -%}
>     {%- do relations_in_project.append(source.name) -%}
>   {%- endfor -%}
>   
>   {# Make the list unique #}
>   {%- set relations_in_project = set(relations_in_project) | list -%}
35a59,64
>         {# Here is where we filter the relations so we dont simply fetch all of them #}
>         {%- if relations_in_project | length > 0 %}
>         and coalesce(REGEXP_EXTRACT(table_id, '^(.+)_{1}[0-9]{8}$'), table_id) in (
>             {%- for rel in relations_in_project -%}'{{ rel }}'{%- if not loop.last %}, {% endif -%}{%- endfor -%}
>         )
>         {% endif -%}

Note that Postgres/Greenplum, you don't need to worry about date-shared tables, so I don't think you would need the REGEXP_EXTRACT part.

Here is the full diff
2a3,25
>   {#
>   /* 
>     This macro changes the catalog query to filter on dbt objects only and not all objects in a dataset.
>     The regex is very sensitive to the date sharding pattern of your tables so you may want to
>     change the regex of '^.+[0-9]{8}$' to '^.+[0-9]{6}$' if say your tables are shareded by YYYYMM
>     instead of YYYYMMDD.
>   */
>   #}
> 
>   {# Here is where we build our list of relations of interest #}
>   {%- set relations_in_project = [] -%}
> 
>   {%- for node in graph.nodes.values() -%}
>     {%- if node.resource_type == 'model' -%}
>       {%- do relations_in_project.append(node.alias) -%}
>     {%- endif -%}
>   {%- endfor -%}
>   {%- for source in graph.sources.values() -%}
>     {%- do relations_in_project.append(source.name) -%}
>   {%- endfor -%}
>   
>   {# Make the list unique #}
>   {%- set relations_in_project = set(relations_in_project) | list -%}
26c49
<             REGEXP_CONTAINS(table_id, '^.+[0-9]{8}$') and coalesce(type, 0) = 1 as is_date_shard,
---
>             REGEXP_CONTAINS(table_id, '^.+[0-9]{8}$') and coalesce(type, 0) in (1, 2) as is_date_shard,
35a59,64
>         {# Here is where we filter the relations so we dont simply fetch all of them #}
>         {%- if relations_in_project | length > 0 %}
>         and coalesce(REGEXP_EXTRACT(table_id, '^(.+)_{1}[0-9]{8}$'), table_id) in (
>             {%- for rel in relations_in_project -%}'{{ rel }}'{%- if not loop.last %}, {% endif -%}{%- endfor -%}
>         )
>         {% endif -%}
207c236,240
<   {{ return(run_query(query)) }}
---
>   {%- do log(query) -%}
>   {%- set results = run_query(query) -%}
>   {%- do log(schemas ~ ' - rows returned: ' ~ results | length, True) -%}
> 
>   {{ return(results) }}

Another option

If you have any control over which schemas the dbt dbt sources and models live in, then isolating them into dedicated schemas should also help.

@dbeatty10 dbeatty10 removed their assignment Aug 18, 2023
@SamuelBohumel
Copy link
Author

Thanks for your quick response.

We have seen these solutions when looking for an issue with the performance of the catalog. Unfortunately, this does not solve our use case, as we are trying to enable users to use all filtering functionality (tags, lineage, etc.).

In our environment, we have been given little control over schema. We are working on one of the largest Greenplum database implementations, so there are thousands of users, and schemas are dedicated for security management rather than project separation.

You are correct, separating by schema would work as catalog uses this filter condition, but we would propose something like adding a flag to dbt docs generate that would only retrieve data from selected nodes (flag example: --only-selected). This also makes more sense to us, as the rest of the nodes are thrown away anyway.

We believe this could be useful in other implementations too, whenever there are more complex environments, and would also improve scalability.

@dbeatty10 dbeatty10 self-assigned this Aug 29, 2023
@dbeatty10
Copy link
Contributor

dbeatty10 commented Aug 29, 2023

Thanks for that additional info @SamuelBohumel

we would propose something like adding a flag to dbt docs generate that would only retrieve data from selected nodes

After discussing this with @graciegoheen, we think what you're asking for will be covered by #4997 (which we've started working on) combined with #6014, so I'm going to close this issue in favor of those two.

If you have any further insights, we'd appreciate your comments on either of both of those. Either way, you may want to subscribe to those so you can get notified of any updates.

Workaround in the meantime

In the meantime, you might be able to use the selected_resources context variable within a customized get_catalog macro in your project to speed up the execution time of the underlying query.

See below for a quick proof-of-concept (that I'll leave to you optimize as-desired).

At least two things to note in this implementation that could be improved:

  • looking for matches within the selected_resources is not optimized for time complexity -- it looks to me like it is O(n^2) when it could be better than that
  • if multiple schemas have tables with the exact same name, and only one is included within a --select, then this implementation of the query will include both

Proof-of-concept

Full customized version for postgres / greenplum:

Click to expand
{% macro default__get_catalog(information_schema, schemas) -%}

  {# Here is where we build our list of relations of interest #}
  {%- set relations_in_project = [] -%}

  {%- for node in graph.nodes.values() -%}
    {%- if node.resource_type == 'model' -%}
      {%- if node.unique_id in selected_resources -%}
        {%- do relations_in_project.append(node.alias) -%}
      {%- endif -%}
    {%- endif -%}
  {%- endfor -%}
  {%- for source in graph.sources.values() -%}
    {%- do relations_in_project.append(source.name) -%}
  {%- endfor -%}
  
  {# Make the list unique #}
  {%- set relations_in_project = set(relations_in_project) | list -%}

  {%- call statement('catalog', fetch_result=True) -%}
    {#
      If the user has multiple databases set and the first one is wrong, this will fail.
      But we will not fail in the case where there are multiple quoting-difference-only dbs, which is better.
    #}
    {% set database = information_schema.database %}
    {{ adapter.verify_database(database) }}

    select
        '{{ database }}' as table_database,
        sch.nspname as table_schema,
        tbl.relname as table_name,
        case tbl.relkind
            when 'v' then 'VIEW'
            else 'BASE TABLE'
        end as table_type,
        tbl_desc.description as table_comment,
        col.attname as column_name,
        col.attnum as column_index,
        pg_catalog.format_type(col.atttypid, col.atttypmod) as column_type,
        col_desc.description as column_comment,
        pg_get_userbyid(tbl.relowner) as table_owner

    from pg_catalog.pg_namespace sch
    join pg_catalog.pg_class tbl on tbl.relnamespace = sch.oid
    join pg_catalog.pg_attribute col on col.attrelid = tbl.oid
    left outer join pg_catalog.pg_description tbl_desc on (tbl_desc.objoid = tbl.oid and tbl_desc.objsubid = 0)
    left outer join pg_catalog.pg_description col_desc on (col_desc.objoid = tbl.oid and col_desc.objsubid = col.attnum)

    where (
        {%- for schema in schemas -%}
          upper(sch.nspname) = upper('{{ schema }}'){%- if not loop.last %} or {% endif -%}
        {%- endfor -%}
      )

    {# Here is where we filter the relations so we do not simply fetch all of them #}
    {%- if relations_in_project | length > 0 %}
      and tbl.relname in (
          {%- for rel in relations_in_project -%}'{{ rel }}'{%- if not loop.last %}, {% endif -%}{%- endfor -%}
      )
    {%- endif %}

      and not pg_is_other_temp_schema(sch.oid) -- not a temporary schema belonging to another session
      and tbl.relpersistence in ('p', 'u') -- [p]ermanent table or [u]nlogged table. Exclude [t]emporary tables
      and tbl.relkind in ('r', 'v', 'f', 'p') -- o[r]dinary table, [v]iew, [f]oreign table, [p]artitioned table. Other values are [i]ndex, [S]equence, [c]omposite type, [t]OAST table, [m]aterialized view
      and col.attnum > 0 -- negative numbers are used for system columns such as oid
      and not col.attisdropped -- column as not been dropped

    order by
        sch.nspname,
        tbl.relname,
        col.attnum

  {%- endcall -%}

  {{ return(load_result('catalog').table) }}

{%- endmacro %}

Diff of customized version vs. the default implementation:

Click to expand
3a4,20
>   {# Here is where we build our list of relations of interest #}
>   {%- set relations_in_project = [] -%}
> 
>   {%- for node in graph.nodes.values() -%}
>     {%- if node.resource_type == 'model' -%}
>       {%- if node.unique_id in selected_resources -%}
>         {%- do relations_in_project.append(node.alias) -%}
>       {%- endif -%}
>     {%- endif -%}
>   {%- endfor -%}
>   {%- for source in graph.sources.values() -%}
>     {%- do relations_in_project.append(source.name) -%}
>   {%- endfor -%}
>   
>   {# Make the list unique #}
>   {%- set relations_in_project = set(relations_in_project) | list -%}
> 
37a55,62
> 
>     {# Here is where we filter the relations so we do not simply fetch all of them #}
>     {%- if relations_in_project | length > 0 %}
>       and tbl.relname in (
>           {%- for rel in relations_in_project -%}'{{ rel }}'{%- if not loop.last %}, {% endif -%}{%- endfor -%}
>       )
>     {%- endif %}
> 

Let us know it goes if you try it out!

@dbeatty10 dbeatty10 closed this as not planned Won't fix, can't repro, duplicate, stale Aug 29, 2023
@dbeatty10 dbeatty10 added duplicate This issue or pull request already exists and removed triage labels Aug 29, 2023
@dbeatty10 dbeatty10 removed their assignment Aug 29, 2023
@SamuelBohumel
Copy link
Author

Hi,
I've tried that workaround with macro and time for generating a catalogue has decreased rapidly.
Thank you very much!

@dbeatty10
Copy link
Contributor

dbeatty10 commented Sep 6, 2023

See below for a version of get_catalog that improves the time complexity when looking for matches within the selected_resources.

It also includes a stand-alone get_selected_relations macro that can be reused in an adapter-independent manner.

Overall, the implementation still has this remaining issue though:

  • if multiple schemas have tables with the exact same name, and only one is included within a --select, then this implementation of the query will include both

get_catalog restricted to only selected resources

Click to expand
{% macro postgres__get_catalog(information_schema, schemas) -%}

  {{ log("Got here", True) }}

  {%- call statement('catalog', fetch_result=True) -%}
    {#
      If the user has multiple databases set and the first one is wrong, this will fail.
      But we will not fail in the case where there are multiple quoting-difference-only dbs, which is better.
    #}
    {% set database = information_schema.database %}
    {{ adapter.verify_database(database) }}

    select
        '{{ database }}' as table_database,
        sch.nspname as table_schema,
        tbl.relname as table_name,
        case tbl.relkind
            when 'v' then 'VIEW'
            else 'BASE TABLE'
        end as table_type,
        tbl_desc.description as table_comment,
        col.attname as column_name,
        col.attnum as column_index,
        pg_catalog.format_type(col.atttypid, col.atttypmod) as column_type,
        col_desc.description as column_comment,
        pg_get_userbyid(tbl.relowner) as table_owner

    from pg_catalog.pg_namespace sch
    join pg_catalog.pg_class tbl on tbl.relnamespace = sch.oid
    join pg_catalog.pg_attribute col on col.attrelid = tbl.oid
    left outer join pg_catalog.pg_description tbl_desc on (tbl_desc.objoid = tbl.oid and tbl_desc.objsubid = 0)
    left outer join pg_catalog.pg_description col_desc on (col_desc.objoid = tbl.oid and col_desc.objsubid = col.attnum)

    where (
        {%- for schema in schemas -%}
          upper(sch.nspname) = upper('{{ schema }}'){%- if not loop.last %} or {% endif -%}
        {%- endfor -%}
      )

    {# Here is where we filter the relations so we do not simply fetch all of them #}
    {%- set relations_in_project = get_selected_relations() %}
    {%- if relations_in_project | length > 0 %}
      and tbl.relname in (
          {%- for rel in relations_in_project -%}'{{ rel }}'{%- if not loop.last %}, {% endif -%}{%- endfor -%}
      )
    {%- else %}
      -- since no nodes were selected, there should be no matching relations to include in the catalog
      and 0 = 1
    {%- endif %}

      and not pg_is_other_temp_schema(sch.oid) -- not a temporary schema belonging to another session
      and tbl.relpersistence in ('p', 'u') -- [p]ermanent table or [u]nlogged table. Exclude [t]emporary tables
      and tbl.relkind in ('r', 'v', 'f', 'p') -- o[r]dinary table, [v]iew, [f]oreign table, [p]artitioned table. Other values are [i]ndex, [S]equence, [c]omposite type, [t]OAST table, [m]aterialized view
      and col.attnum > 0 -- negative numbers are used for system columns such as oid
      and not col.attisdropped -- column as not been dropped

    order by
        sch.nspname,
        tbl.relname,
        col.attnum

  {%- endcall -%}

  {{ return(load_result('catalog').table) }}

{%- endmacro %}


{% macro get_selected_relations() -%}

  {# Here is where we build our list of relations of interest #}
  {%- set relations_in_project = [] -%}

  {%- for selected_resource in selected_resources -%}

    {%- if selected_resource in graph.nodes -%}
      {%- set node = graph.nodes[selected_resource] -%}
      {%- if node.resource_type == 'model' -%}
        {%- do relations_in_project.append(node.alias) -%}
      {%- endif -%}
  
    {#- -- This section is commented out since sources can't be selected
    {%- elif selected_resource in graph.sources -%}
      {%- set source = graph.sources[selected_resource] -%}
      {%- do relations_in_project.append(source.name) -%}
    -#}

    {%- endif -%}  
  {%- endfor -%}
  
  {# Make the list unique #}
  {%- set relations_in_project = set(relations_in_project) | list -%}

  {{ return(relations_in_project) }}

{%- endmacro %}

Logic to share across adapter implementations

See below for a diff of customized version vs. the default implementation that can be used to inform the implementation of each dbt adapter.

The reuasable portion that is adapter-independent has been abstracted into a stand-alone macro.

Click to expand
1d0
< 
37a37,45
> 
>     {# Here is where we filter the relations so we do not simply fetch all of them #}
>     {%- set relations_in_project = get_selected_relations() %}
>     {%- if relations_in_project | length > 0 %}
>       and tbl.relname in (
>           {%- for rel in relations_in_project -%}'{{ rel }}'{%- if not loop.last %}, {% endif -%}{%- endfor -%}
>       )
>     {%- else %}
>       -- since no nodes were selected, there should be no matching relations to include in the catalog
>       and 0 = 1
>     {%- endif %}
> 
53a62,91
> 
> 
> {% macro get_selected_relations() -%}
> 
>   {# Here is where we build our list of relations of interest #}
>   {%- set relations_in_project = [] -%}
> 
>   {%- for selected_resource in selected_resources -%}
> 
>     {%- if selected_resource in graph.nodes -%}
>       {%- set node = graph.nodes[selected_resource] -%}
>       {%- if node.resource_type == 'model' -%}
>         {%- do relations_in_project.append(node.alias) -%}
>       {%- endif -%}
>   
>     {#- -- This section is commented out since sources can't be selected
>     {%- elif selected_resource in graph.sources -%}
>       {%- set source = graph.sources[selected_resource] -%}
>       {%- do relations_in_project.append(source.name) -%}
>     -#}
> 
>     {%- endif -%}  
>   {%- endfor -%}
>   
>   {# Make the list unique #}
>   {%- set relations_in_project = set(relations_in_project) | list -%}
> 
>   {{ return(relations_in_project) }}
> 
> {%- endmacro %}

@dbeatty10
Copy link
Contributor

For anyone following here, #8772 allows for the usage of --select within dbt docs generate, and it is shipping in dbt-core v1.7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

2 participants