Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement] too many partitions in int_zendesk__field_history_pivot on bigquery #96

Closed
2 of 4 tasks
jens-koster opened this issue May 9, 2023 · 5 comments
Closed
2 of 4 tasks
Assignees
Labels
priority:p2 Affects most users; fix needed status:accepted Scoped and accepted into queue type:enhancement New functionality or enhancement update_type:models Primary focus requires model updates

Comments

@jens-koster
Copy link

jens-koster commented May 9, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Describe the issue

I am getting partly the same error as reported in issue #39, which is closed and merged in #47

I think the problem was fixed for the int_zendesk__field_calendar_spine model but not for int_zendesk__field_history_pivot. The fix introduced a config variable, ticket_field_history_timeframe_years to limit the number of partitions, but that config variable is not used in int_zendesk__field_history_pivot.
I believe all other partitioned models use the date spine and get their date range limited by that.

I hacked the model and added the config variable to the where condition in the first cte, it seems to do the trick. however I am not 100% sure what side effects this might have in the data.

with field_history as (

    select
        ticket_id,
        field_name,
        valid_ending_at,
        valid_starting_at

        --Only runs if the user passes updater fields through the final ticket field history model
        {% if var('ticket_field_history_updater_columns') %}
        ,
        {{ var('ticket_field_history_updater_columns') | join (", ")}}

        {% endif %}

        -- doing this to figure out what values are actually null and what needs to be backfilled in zendesk__ticket_field_history
        ,case when value is null then 'is_null' else value end as value

    from {{ ref('int_zendesk__field_history_enriched') }}
    where
        cast( {{ dbt.date_trunc('day', 'valid_starting_at') }} as date) >=
            {{ dbt.dateadd('year', - var('ticket_field_history_timeframe_years', 50), dbt.current_timestamp_backcompat() ) }}
            
    {% if is_incremental() %}
        and cast( {{ dbt.date_trunc('day', 'valid_starting_at') }} as date) >= (select max(date_day) from {{ this }})
    {% endif %}
)

where I added this bit, patched together from your code in the date spine.

cast( {{ dbt.date_trunc('day', 'valid_starting_at') }} as date) >=
            {{ dbt.dateadd('year', - var('ticket_field_history_timeframe_years', 50), dbt.current_timestamp_backcompat() ) }}

this limits the amount of data we can analyse, to fix this properly I think the tables should be partitioned by month. (but 10 years is good enough for us...)

Relevant error log or model output

03:30:56  Completed with 1 error and 0 warnings:
03:30:56  
03:30:56  Database Error in model int_zendesk__field_history_pivot (models/ticket_history/int_zendesk__field_history_pivot.sql)
03:30:56    Resources exceeded during query execution: Table datawarehouse-328110:dbt_dwh_zendesk_intermediate.int_zendesk__field_history_pivot will have 4001 partitions when the job finishes, exceeding limit 4000. If partitions were recently expired, it may take some time to be reflected unless explicitly deleted.
03:30:56    compiled Code at target/run/zendesk/models/ticket_history/int_zendesk__field_history_pivot.sql

Expected behavior

I expected the config variable to limit the date range in int_zendesk__field_history_pivot.

dbt Project configurations

at the time of running:

vars:
  zendesk:
    ticket_field_history_timeframe_years: 1

models:
  zendesk:
    +schema: zendesk
    intermediate:
      +schema: zendesk_intermediate
    sla_policy:
      +schema: zendesk_intermediate
    ticket_history:
      +schema: zendesk_intermediate
    utils:
      materialized: table

materializing utils as table was added during my hacking, bigquery fails on too much computing on too little data unless you do this. Same thing happened in the original issue.

Package versions

  • package: fivetran/zendesk
    version: [">=0.9.0", "<0.11.0"]

What database are you using dbt with?

bigquery

dbt Version

dbt cloud does no allow running dbt --version, but it's 1.3.

Additional Context

No response

Are you willing to open a PR to help address this issue?

  • Yes.
  • Yes, but I will need assistance and will schedule time during our office hours for guidance
  • No.
@jens-koster jens-koster added the bug Something isn't working label May 9, 2023
@fivetran-joemarkiewicz
Copy link
Contributor

fivetran-joemarkiewicz commented May 11, 2023

Hi @jens-koster thanks for opening this issue!

I agree that I believe all the downstream models from the spine should pull from the calendar spine which would reduce the amount of data, I will have to take a deeper look into the models to see if the solution you provided is the right path forward, or if there is another option we should consider taking to help reduce the amount of data in the downstream models.

That being said, you can adjust the partition of these incremental models without needing an update within the package. If you wanted to change the partition you can actually take an approach similar to the one shown below:

## Your dbt_project.yml

models:
  zendesk:
    zendesk__ticket_field_history:
      +partition_by: {'field': 'date_day', 'data_type': 'date', 'granularity': 'month'}
    ticket_history:
      int_zendesk__field_calendar_spine:
        +partition_by: {'field': 'date_day', 'data_type': 'date', 'granularity': 'month'}
      int_zendesk__field_history_pivot:
        +partition_by: {'field': 'date_day', 'data_type': 'date', 'granularity': 'month'}

This may help with the too many partitions error you are encountering. That being said, I would still like to do a deep dive on the package to understand if there is some place we could optimize the models.

@jens-koster
Copy link
Author

jens-koster commented May 15, 2023

Of course! that's exactly what I'll do :-)
I think that is the best way to deal with 4000+ days of data.
In regards to the fix I did, that was a safe way to get it done without affecting anything else. Now that I know the package a bit better, I'd say joining the calendar spine would be the better solution. Then everything is controlled by whatever is in the spine, rather than reproducing the configuration logic in a second place.

Going to monthly partitioning I am also adding clustering on the date_day column.
+cluster_by: 'date_day'

For anyone using this solution: I'd like to mention you might run into the bigquery limitation of using up too much processing for too little data scan. I found it helped to materialize the calendar spine as table. (and for good measure partition it). This problem was also encountered in the original bug report.

    utils:
      int_zendesk__calendar_spine:
        +materialized: table
        +partition_by: {'field': 'date_day', 'data_type': 'date', 'granularity': 'month'}
        +cluster_by: 'date_day'

thank you so much!
Jens

@fivetran-joemarkiewicz
Copy link
Contributor

Thanks for sharing that the adjustment in your dbt_project.yml fixed the immediate issue!

I also do agree that this is something we should look to improve within the package to ensure others don't need to make this specific override. I also think you proposed a viable solution with leveraging the already existing calendar spine should help resolve the issue. That being said I will want to do a bit more investigating to ensure this doesn't have any unforeseen consequences.

For the time being I will mark this as a FR for my team to look to improving the model. If anyone else comes across the similar "too many partitions" error you can use the same approach above to resolve the error. I will post back here when we are able to validate the suggestion on joining in the calendar spine to have a more permanent solution.

@fivetran-joemarkiewicz fivetran-joemarkiewicz added priority:p2 Affects most users; fix needed type:enhancement New functionality or enhancement status:scoping Currently being scoped update_type:models Primary focus requires model updates and removed bug Something isn't working labels Jun 27, 2023
@fivetran-joemarkiewicz fivetran-joemarkiewicz changed the title [Bug] too many partitions in int_zendesk__field_history_pivot on bigquery [Improvement] too many partitions in int_zendesk__field_history_pivot on bigquery Jun 27, 2023
@fivetran-joemarkiewicz
Copy link
Contributor

Hi All,

I wanted to post back here and share that we have seen this issue appear a few more times and we're going to explore a more permanent solution. I will be adding this to our upcoming sprint, and will share the plan as we develop it in the coming weeks.

@fivetran-joemarkiewicz fivetran-joemarkiewicz added status:accepted Scoped and accepted into queue and removed status:scoping Currently being scoped labels Aug 14, 2024
@fivetran-joemarkiewicz
Copy link
Contributor

This update has since been rolled out to the package by default within PR #169! As such, I will close out this improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:p2 Affects most users; fix needed status:accepted Scoped and accepted into queue type:enhancement New functionality or enhancement update_type:models Primary focus requires model updates
Projects
None yet
Development

No branches or pull requests

2 participants