Generating docs take and long time #1576

whittid4 · 2019-06-20T18:16:03Z

When running dbt docs generate is take over 10 mins to complete and uses lots of CPU resources producing the catalog.

When running with debug turned on the following gets written out:

2019-06-20 16:57:55,546 (MainThread): On "<None>": cache miss for schema "my_project.my_dataset", this is inefficient

along with:

2019-06-20 16:57:56,829 (MainThread): with schema=99439170, model_name=None, relations=[
<BigQueryRelation my_project.my_dataset.`cloudaudit_googleapis_com_data_access_20161114`>, 
<BigQueryRelation my_project.my_dataset.`cloudaudit_googleapis_com_data_access_20161115`>
.....

which is a date sharded table containing hundreds of shards.

The interesting thing is the cloudaudit_googleapis_com_data_access is the only one in the dataset that is not used or referenced by DBT. In the same dataset there is 2 other date sharded tables containing over a thousand shards, which are referenced as a source in DBT, but they are not printed out

sources:
  - name: google_analytics
    database: my_project
    schema: my_dataset
    loader: ga360
    tables:
      - name: ga_sessions_*
      - name: ga_sessions_intraday_*

The text was updated successfully, but these errors were encountered:

drewbanin · 2019-06-20T18:21:20Z

Thanks for the report @whittid4 - we'll check it out! My initial thinking:

high cpu usage sounds like an agate problem? I recall this being an issue with seeds, but not sure if that's exactly what's happening here
we're probably iterating over the individual date shards, which is probably undersirable

EricLeer · 2019-06-27T13:01:23Z

I have run into a similar problem. I also have DBT setup to read from a BQ database with a lot of date sharded tables. Running dbt docs generate takes over an hour for me. There is no real noticable cpu load though

In the dbt logs the following is repeated over 50,000 times:

019-06-26 15:34:11,918 (MainThread): /home/venv/lib/python3.6/site-packages/dbt/adapters/bigquery/impl.py:465: PendingDeprecationWarning: This method will be deprecated in future versions. Please use Table.time_partitioning.type_ instead.

drewbanin · 2019-06-27T20:26:50Z

Thanks for the additional info @EricLeer!

It's pretty clear to me that we need dbt to be smarter about catalog generation with date sharded tables on BQ. I don't think it makes sense for dbt to know about every single individual date shard -- these are probably source tables, and the BQ interface is probably suited for exploring these shards than dbt docs is, at least currently.

Presently, dbt is using an API method to fetch all of the tables/views in every dataset that dbt touches. This means that dbt will fundamentally need to pull down every single one of these tables, then maybe distill the date-sharded tables down to a single source table in-memory.

Maybe an alternative approach is to use BQ's new-ish information schema? I think with some clever SQL, we can push a lot of this filtering into the BQ layer. This should make dbt docs generate a lot snappier, as well as bring BQ in line with the rest of dbt's plugins (for the most part).

I just played around with some code here, what do you guys think of something like this?

with base as (

  select *,
    REGEXP_CONTAINS(table_name, '^.+[0-9]{8}$') as is_date_shard,
    REGEXP_EXTRACT(table_name, '^(.+)[0-9]{8}$') as base_name,
    REGEXP_EXTRACT(table_name, '^.+([0-9]{8})$') as shard_name

  FROM dbt_dbanin.INFORMATION_SCHEMA.TABLES

),

extracted as (

  select *,
    coalesce(base_name, table_name) as root_table_name

  from base

),

unsharded as (

  select
    table_catalog,
    table_schema,
    root_table_name as table_name,
    
    row_number() over (partition by root_table_name order by shard_name desc) as shard_index,
    min(shard_name) over (partition by root_table_name) as first_shard,
    max(shard_name) over (partition by root_table_name) as last_shard,
    count(*) over (partition by root_table_name) as num_shards

  from extracted
  
)

select *
from unsharded
where shard_index = 1
order by table_name

This query will find all of the tables/views in a given dataset, then squash down date shards into a single "root" record. We can pluck out the individual shard names if necessary, maybe aggregating them into an array in SQL? That would be pretty slick.

Curious to hear what you think!

cc @whittid4

EricLeer · 2019-06-28T08:17:55Z

I think this would work. For one of my datasets this would reduce the amount of tables searched from 40000 to 400.

On the other hand I think you would still run into the same problem if the tables are sharded in a different way then yyyymmdd date. In principle with querying a wildcard table any pattern is possible and I think it would be best if this behaviour is also supported with this solution. Maybe the correct information of on what to colapse the shards can be gathered from the schema.yml file where the source is defined? For instance I have defined my source as:

name: dataset_name
    tables:
      - name: table_name
        identifier: table_name_*

and thus I would expect the shards to colapse on table_name_ .

Finally what is the reason that dbt fetches all tables/views in a dataset it touches? Wouldn't it make more sense to only fetch the tables/views that are actually defined as a source?

drewbanin · 2019-06-28T13:46:31Z

That's a really good point - I didn't realize that you could shard on strings other than an 8-char date suffix. I agree - generating some sort of glob query from the specified identifier is probably a good idea.

dbt tries to find all of the tables in a given BQ dataset that match up with models and sources defined in the active dbt project. By querying for all of the relations in a dataset at once, dbt can make one query per dataset referenced in the dbt project. If dbt instead queried for each source/model individually, then dbt would need to execute orders of magnitude more queries like this in the general case. Imagine you had 30 source tables defined in a single dataset -- all things considered, I'd rather make one query against that dataset than 30!

whittid4 · 2019-07-25T13:23:12Z

Sorry for the delay in getting back to you.

@drewbanin, the example SQL you shared above will work for our case as we are only using the standard data sharding method, but as @EricLeer points out it would be nice to make this solution fit most use cases

mr2dark · 2019-09-26T07:08:20Z

I tried to point to a specific table during doc generation with dbt 0.14.2 but it didn't help.
It takes hours to process a referenced dataset with thousand (date) sharded tables even if you only reference a specific non-wildcard table in that dataset.
In my case it's a way too long to generate docs for both dataset_1 and dataset_2 (please see example YAML below). Both contain more than a thousand sharded tables.

sources:

  - name: dataset_1
    database: main_database
    schema: dataset_1_{{ var("tenant_id") }}
    tables:

      - name: table_1
        identifier: table_1_{{ var("tenant_id") }}_{{ "*" if not var("doc_gen", False) else "20190901" }}

  - name: dataset_2
    database: main_database
    schema: dataset_2
    tables:

      - name: table_2
        identifier: table_2
        columns:
          - name: table_id

drewbanin · 2019-09-27T19:48:10Z

Hey @mr2dark - I spent some time yesterday pulling together a more completed "catalog" query for BigQuery which leverages the INFORMATION_SCHEMA. TBD if we're able to slot that in for the 0.15.0 release (we removed it from the milestone, for now) but I'd love to ship it if we can!

Are you able/interested to help test out a branch of dbt which introduces this logic?

mr2dark · 2019-09-27T20:12:48Z

@drewbanin I'll be glad to help. How can I test that?

BTW it looks like my estimation is wrong. It can take 30 minutes and the cases where I didn't receive results for hours were probably caused by power/network state issues when my laptop went to sleep and/or was disconnected from network for a while. I'll try to reproduce that when I'll have some spare time.
But anyway building of a catalog for 30 minutes when you actually use a couple of tables feels like an overkill.

drewbanin · 2019-09-27T20:28:52Z

Cool! I'll follow up with a branch name and some more info when I have something to show for myself :)

drewbanin · 2019-09-29T17:50:44Z

hey @mr2dark - check out the PR here: #1795

This needs a little more love around automated testing, but the idea is there. Let me know if you're able to give it a spin locally. My hope is that the time to generate docs drops from 30 mins to a single-digit number of seconds.

Thanks!

mr2dark · 2019-09-29T20:04:40Z

I've left a comment in #1795

(#1576) use the information schema on BigQuery

chaos87 · 2022-01-18T02:50:46Z

Hello there👋

I am facing the same issue when using v1.0.0

I see there was a PR #1795 mentioned just above so I thought it would have been fixed, but it seems not

I declared ga_sessions_* in source and we have over 10k tables. It takes around 30min to generate the catalog.

drewbanin transferred this issue from dbt-labs/dbt-docs Jun 27, 2019

drewbanin added the bigquery label Jun 27, 2019

drewbanin added this to the Louisa May Alcott milestone Jun 27, 2019

cmcarthur removed this from the Louisa May Alcott milestone Sep 25, 2019

drewbanin added a commit that referenced this issue Sep 29, 2019

(#1576) use the information schema on BigQuery

1a97915

drewbanin mentioned this issue Sep 29, 2019

(#1576) use the information schema on BigQuery #1795

Merged

drewbanin added a commit that referenced this issue Oct 15, 2019

(#1576) use the information schema on BigQuery

4c624d0

drewbanin closed this as completed in #1795 Oct 15, 2019

drewbanin added a commit that referenced this issue Oct 15, 2019

Merge pull request #1795 from fishtown-analytics/fix/bq-catalog-query

00a22e1

(#1576) use the information schema on BigQuery

drewbanin added this to the Louisa May Alcott milestone Nov 18, 2019

chaos87 mentioned this issue Feb 7, 2022

[CT-171] [CT-60] [Bug] It takes too long to generate docs (BigQuery) dbt-labs/dbt-bigquery#115

Closed

1 task

ryanClift-sd mentioned this issue Mar 21, 2022

[CT-202] Workaround for some limitations due to list_relations_without_caching method dbt-labs/dbt-spark#228

Open

SamuelBohumel mentioned this issue Aug 18, 2023

[CT-3012] [Bug] dbt docs generate takes a lot of time #8452

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating docs take and long time #1576

Generating docs take and long time #1576

whittid4 commented Jun 20, 2019

drewbanin commented Jun 20, 2019

EricLeer commented Jun 27, 2019

drewbanin commented Jun 27, 2019 •

edited

Loading

EricLeer commented Jun 28, 2019

drewbanin commented Jun 28, 2019

whittid4 commented Jul 25, 2019

mr2dark commented Sep 26, 2019 •

edited

Loading

drewbanin commented Sep 27, 2019

mr2dark commented Sep 27, 2019 •

edited

Loading

drewbanin commented Sep 27, 2019

drewbanin commented Sep 29, 2019

mr2dark commented Sep 29, 2019

chaos87 commented Jan 18, 2022

Generating docs take and long time #1576

Generating docs take and long time #1576

Comments

whittid4 commented Jun 20, 2019

drewbanin commented Jun 20, 2019

EricLeer commented Jun 27, 2019

drewbanin commented Jun 27, 2019 • edited Loading

EricLeer commented Jun 28, 2019

drewbanin commented Jun 28, 2019

whittid4 commented Jul 25, 2019

mr2dark commented Sep 26, 2019 • edited Loading

drewbanin commented Sep 27, 2019

mr2dark commented Sep 27, 2019 • edited Loading

drewbanin commented Sep 27, 2019

drewbanin commented Sep 29, 2019

mr2dark commented Sep 29, 2019

chaos87 commented Jan 18, 2022

drewbanin commented Jun 27, 2019 •

edited

Loading

mr2dark commented Sep 26, 2019 •

edited

Loading

mr2dark commented Sep 27, 2019 •

edited

Loading