Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating docs take and long time #1576

Closed
whittid4 opened this issue Jun 20, 2019 · 13 comments · Fixed by #1795
Closed

Generating docs take and long time #1576

whittid4 opened this issue Jun 20, 2019 · 13 comments · Fixed by #1795

Comments

@whittid4
Copy link

When running dbt docs generate is take over 10 mins to complete and uses lots of CPU resources producing the catalog.

When running with debug turned on the following gets written out:

2019-06-20 16:57:55,546 (MainThread): On "<None>": cache miss for schema "my_project.my_dataset", this is inefficient

along with:

2019-06-20 16:57:56,829 (MainThread): with schema=99439170, model_name=None, relations=[
<BigQueryRelation my_project.my_dataset.`cloudaudit_googleapis_com_data_access_20161114`>, 
<BigQueryRelation my_project.my_dataset.`cloudaudit_googleapis_com_data_access_20161115`>
.....

which is a date sharded table containing hundreds of shards.

The interesting thing is the cloudaudit_googleapis_com_data_access is the only one in the dataset that is not used or referenced by DBT. In the same dataset there is 2 other date sharded tables containing over a thousand shards, which are referenced as a source in DBT, but they are not printed out

sources:
  - name: google_analytics
    database: my_project
    schema: my_dataset
    loader: ga360
    tables:
      - name: ga_sessions_*
      - name: ga_sessions_intraday_*
@drewbanin
Copy link
Contributor

Thanks for the report @whittid4 - we'll check it out! My initial thinking:

  • high cpu usage sounds like an agate problem? I recall this being an issue with seeds, but not sure if that's exactly what's happening here
  • we're probably iterating over the individual date shards, which is probably undersirable

@drewbanin drewbanin transferred this issue from dbt-labs/dbt-docs Jun 27, 2019
@EricLeer
Copy link

I have run into a similar problem. I also have DBT setup to read from a BQ database with a lot of date sharded tables. Running dbt docs generate takes over an hour for me. There is no real noticable cpu load though

In the dbt logs the following is repeated over 50,000 times:

019-06-26 15:34:11,918 (MainThread): /home/venv/lib/python3.6/site-packages/dbt/adapters/bigquery/impl.py:465: PendingDeprecationWarning: This method will be deprecated in future versions. Please use Table.time_partitioning.type_ instead.

@drewbanin
Copy link
Contributor

drewbanin commented Jun 27, 2019

Thanks for the additional info @EricLeer!

It's pretty clear to me that we need dbt to be smarter about catalog generation with date sharded tables on BQ. I don't think it makes sense for dbt to know about every single individual date shard -- these are probably source tables, and the BQ interface is probably suited for exploring these shards than dbt docs is, at least currently.

Presently, dbt is using an API method to fetch all of the tables/views in every dataset that dbt touches. This means that dbt will fundamentally need to pull down every single one of these tables, then maybe distill the date-sharded tables down to a single source table in-memory.

Maybe an alternative approach is to use BQ's new-ish information schema? I think with some clever SQL, we can push a lot of this filtering into the BQ layer. This should make dbt docs generate a lot snappier, as well as bring BQ in line with the rest of dbt's plugins (for the most part).

I just played around with some code here, what do you guys think of something like this?

with base as (

  select *,
    REGEXP_CONTAINS(table_name, '^.+[0-9]{8}$') as is_date_shard,
    REGEXP_EXTRACT(table_name, '^(.+)[0-9]{8}$') as base_name,
    REGEXP_EXTRACT(table_name, '^.+([0-9]{8})$') as shard_name

  FROM dbt_dbanin.INFORMATION_SCHEMA.TABLES

),

extracted as (

  select *,
    coalesce(base_name, table_name) as root_table_name

  from base

),

unsharded as (

  select
    table_catalog,
    table_schema,
    root_table_name as table_name,
    
    row_number() over (partition by root_table_name order by shard_name desc) as shard_index,
    min(shard_name) over (partition by root_table_name) as first_shard,
    max(shard_name) over (partition by root_table_name) as last_shard,
    count(*) over (partition by root_table_name) as num_shards

  from extracted
  
)

select *
from unsharded
where shard_index = 1
order by table_name

This query will find all of the tables/views in a given dataset, then squash down date shards into a single "root" record. We can pluck out the individual shard names if necessary, maybe aggregating them into an array in SQL? That would be pretty slick.

Curious to hear what you think!

cc @whittid4

@drewbanin drewbanin added this to the Louisa May Alcott milestone Jun 27, 2019
@EricLeer
Copy link

I think this would work. For one of my datasets this would reduce the amount of tables searched from 40000 to 400.

On the other hand I think you would still run into the same problem if the tables are sharded in a different way then yyyymmdd date. In principle with querying a wildcard table any pattern is possible and I think it would be best if this behaviour is also supported with this solution. Maybe the correct information of on what to colapse the shards can be gathered from the schema.yml file where the source is defined? For instance I have defined my source as:

name: dataset_name
    tables:
      - name: table_name
        identifier: table_name_*

and thus I would expect the shards to colapse on table_name_ .

Finally what is the reason that dbt fetches all tables/views in a dataset it touches? Wouldn't it make more sense to only fetch the tables/views that are actually defined as a source?

@drewbanin
Copy link
Contributor

That's a really good point - I didn't realize that you could shard on strings other than an 8-char date suffix. I agree - generating some sort of glob query from the specified identifier is probably a good idea.

dbt tries to find all of the tables in a given BQ dataset that match up with models and sources defined in the active dbt project. By querying for all of the relations in a dataset at once, dbt can make one query per dataset referenced in the dbt project. If dbt instead queried for each source/model individually, then dbt would need to execute orders of magnitude more queries like this in the general case. Imagine you had 30 source tables defined in a single dataset -- all things considered, I'd rather make one query against that dataset than 30!

@whittid4
Copy link
Author

Sorry for the delay in getting back to you.

@drewbanin, the example SQL you shared above will work for our case as we are only using the standard data sharding method, but as @EricLeer points out it would be nice to make this solution fit most use cases

@cmcarthur cmcarthur removed this from the Louisa May Alcott milestone Sep 25, 2019
@mr2dark
Copy link

mr2dark commented Sep 26, 2019

I tried to point to a specific table during doc generation with dbt 0.14.2 but it didn't help.
It takes hours to process a referenced dataset with thousand (date) sharded tables even if you only reference a specific non-wildcard table in that dataset.
In my case it's a way too long to generate docs for both dataset_1 and dataset_2 (please see example YAML below). Both contain more than a thousand sharded tables.

sources:

  - name: dataset_1
    database: main_database
    schema: dataset_1_{{ var("tenant_id") }}
    tables:

      - name: table_1
        identifier: table_1_{{ var("tenant_id") }}_{{ "*" if not var("doc_gen", False) else "20190901" }}

  - name: dataset_2
    database: main_database
    schema: dataset_2
    tables:

      - name: table_2
        identifier: table_2
        columns:
          - name: table_id

@drewbanin
Copy link
Contributor

Hey @mr2dark - I spent some time yesterday pulling together a more completed "catalog" query for BigQuery which leverages the INFORMATION_SCHEMA. TBD if we're able to slot that in for the 0.15.0 release (we removed it from the milestone, for now) but I'd love to ship it if we can!

Are you able/interested to help test out a branch of dbt which introduces this logic?

@mr2dark
Copy link

mr2dark commented Sep 27, 2019

@drewbanin I'll be glad to help. How can I test that?

BTW it looks like my estimation is wrong. It can take 30 minutes and the cases where I didn't receive results for hours were probably caused by power/network state issues when my laptop went to sleep and/or was disconnected from network for a while. I'll try to reproduce that when I'll have some spare time.
But anyway building of a catalog for 30 minutes when you actually use a couple of tables feels like an overkill.

@drewbanin
Copy link
Contributor

Cool! I'll follow up with a branch name and some more info when I have something to show for myself :)

@drewbanin
Copy link
Contributor

hey @mr2dark - check out the PR here: #1795

This needs a little more love around automated testing, but the idea is there. Let me know if you're able to give it a spin locally. My hope is that the time to generate docs drops from 30 mins to a single-digit number of seconds.

Thanks!

@mr2dark
Copy link

mr2dark commented Sep 29, 2019

I've left a comment in #1795

drewbanin added a commit that referenced this issue Oct 15, 2019
@drewbanin drewbanin added this to the Louisa May Alcott milestone Nov 18, 2019
@chaos87
Copy link

chaos87 commented Jan 18, 2022

Hello there👋

I am facing the same issue when using v1.0.0

I see there was a PR #1795 mentioned just above so I thought it would have been fixed, but it seems not

I declared ga_sessions_* in source and we have over 10k tables. It takes around 30min to generate the catalog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants