[Improvement] too many partitions in int_zendesk__field_history_pivot on bigquery #96

jens-koster · 2023-05-09T14:34:06Z

Is there an existing issue for this?

I have searched the existing issues

Describe the issue

I am getting partly the same error as reported in issue #39, which is closed and merged in #47

I think the problem was fixed for the int_zendesk__field_calendar_spine model but not for int_zendesk__field_history_pivot. The fix introduced a config variable, ticket_field_history_timeframe_years to limit the number of partitions, but that config variable is not used in int_zendesk__field_history_pivot.
I believe all other partitioned models use the date spine and get their date range limited by that.

I hacked the model and added the config variable to the where condition in the first cte, it seems to do the trick. however I am not 100% sure what side effects this might have in the data.

with field_history as (

    select
        ticket_id,
        field_name,
        valid_ending_at,
        valid_starting_at

        --Only runs if the user passes updater fields through the final ticket field history model
        {% if var('ticket_field_history_updater_columns') %}
        ,
        {{ var('ticket_field_history_updater_columns') | join (", ")}}

        {% endif %}

        -- doing this to figure out what values are actually null and what needs to be backfilled in zendesk__ticket_field_history
        ,case when value is null then 'is_null' else value end as value

    from {{ ref('int_zendesk__field_history_enriched') }}
    where
        cast( {{ dbt.date_trunc('day', 'valid_starting_at') }} as date) >=
            {{ dbt.dateadd('year', - var('ticket_field_history_timeframe_years', 50), dbt.current_timestamp_backcompat() ) }}
            
    {% if is_incremental() %}
        and cast( {{ dbt.date_trunc('day', 'valid_starting_at') }} as date) >= (select max(date_day) from {{ this }})
    {% endif %}
)

where I added this bit, patched together from your code in the date spine.

cast( {{ dbt.date_trunc('day', 'valid_starting_at') }} as date) >=
            {{ dbt.dateadd('year', - var('ticket_field_history_timeframe_years', 50), dbt.current_timestamp_backcompat() ) }}

this limits the amount of data we can analyse, to fix this properly I think the tables should be partitioned by month. (but 10 years is good enough for us...)

Relevant error log or model output

03:30:56  Completed with 1 error and 0 warnings:
03:30:56  
03:30:56  Database Error in model int_zendesk__field_history_pivot (models/ticket_history/int_zendesk__field_history_pivot.sql)
03:30:56    Resources exceeded during query execution: Table datawarehouse-328110:dbt_dwh_zendesk_intermediate.int_zendesk__field_history_pivot will have 4001 partitions when the job finishes, exceeding limit 4000. If partitions were recently expired, it may take some time to be reflected unless explicitly deleted.
03:30:56    compiled Code at target/run/zendesk/models/ticket_history/int_zendesk__field_history_pivot.sql

Expected behavior

I expected the config variable to limit the date range in int_zendesk__field_history_pivot.

dbt Project configurations

at the time of running:

vars:
  zendesk:
    ticket_field_history_timeframe_years: 1

models:
  zendesk:
    +schema: zendesk
    intermediate:
      +schema: zendesk_intermediate
    sla_policy:
      +schema: zendesk_intermediate
    ticket_history:
      +schema: zendesk_intermediate
    utils:
      materialized: table

materializing utils as table was added during my hacking, bigquery fails on too much computing on too little data unless you do this. Same thing happened in the original issue.

Package versions

package: fivetran/zendesk
version: [">=0.9.0", "<0.11.0"]

What database are you using dbt with?

bigquery

dbt Version

dbt cloud does no allow running dbt --version, but it's 1.3.

Additional Context

No response

Are you willing to open a PR to help address this issue?

Yes.
Yes, but I will need assistance and will schedule time during our office hours for guidance
No.

The text was updated successfully, but these errors were encountered:

fivetran-joemarkiewicz · 2023-05-11T20:15:39Z

Hi @jens-koster thanks for opening this issue!

I agree that I believe all the downstream models from the spine should pull from the calendar spine which would reduce the amount of data, I will have to take a deeper look into the models to see if the solution you provided is the right path forward, or if there is another option we should consider taking to help reduce the amount of data in the downstream models.

That being said, you can adjust the partition of these incremental models without needing an update within the package. If you wanted to change the partition you can actually take an approach similar to the one shown below:

## Your dbt_project.yml

models:
  zendesk:
    zendesk__ticket_field_history:
      +partition_by: {'field': 'date_day', 'data_type': 'date', 'granularity': 'month'}
    ticket_history:
      int_zendesk__field_calendar_spine:
        +partition_by: {'field': 'date_day', 'data_type': 'date', 'granularity': 'month'}
      int_zendesk__field_history_pivot:
        +partition_by: {'field': 'date_day', 'data_type': 'date', 'granularity': 'month'}

This may help with the too many partitions error you are encountering. That being said, I would still like to do a deep dive on the package to understand if there is some place we could optimize the models.

jens-koster · 2023-05-15T08:03:43Z

Of course! that's exactly what I'll do :-)
I think that is the best way to deal with 4000+ days of data.
In regards to the fix I did, that was a safe way to get it done without affecting anything else. Now that I know the package a bit better, I'd say joining the calendar spine would be the better solution. Then everything is controlled by whatever is in the spine, rather than reproducing the configuration logic in a second place.

Going to monthly partitioning I am also adding clustering on the date_day column.
+cluster_by: 'date_day'

For anyone using this solution: I'd like to mention you might run into the bigquery limitation of using up too much processing for too little data scan. I found it helped to materialize the calendar spine as table. (and for good measure partition it). This problem was also encountered in the original bug report.

    utils:
      int_zendesk__calendar_spine:
        +materialized: table
        +partition_by: {'field': 'date_day', 'data_type': 'date', 'granularity': 'month'}
        +cluster_by: 'date_day'

thank you so much!
Jens

fivetran-joemarkiewicz · 2023-06-27T21:00:49Z

Thanks for sharing that the adjustment in your dbt_project.yml fixed the immediate issue!

I also do agree that this is something we should look to improve within the package to ensure others don't need to make this specific override. I also think you proposed a viable solution with leveraging the already existing calendar spine should help resolve the issue. That being said I will want to do a bit more investigating to ensure this doesn't have any unforeseen consequences.

For the time being I will mark this as a FR for my team to look to improving the model. If anyone else comes across the similar "too many partitions" error you can use the same approach above to resolve the error. I will post back here when we are able to validate the suggestion on joining in the calendar spine to have a more permanent solution.

fivetran-joemarkiewicz · 2024-08-14T17:42:55Z

Hi All,

I wanted to post back here and share that we have seen this issue appear a few more times and we're going to explore a more permanent solution. I will be adding this to our upcoming sprint, and will share the plan as we develop it in the coming weeks.

fivetran-joemarkiewicz · 2024-09-05T15:24:31Z

This update has since been rolled out to the package by default within PR #169! As such, I will close out this improvement.

jens-koster added the bug Something isn't working label May 9, 2023

fivetran-joemarkiewicz added priority:p2 Affects most users; fix needed type:enhancement New functionality or enhancement status:scoping Currently being scoped update_type:models Primary focus requires model updates and removed bug Something isn't working labels Jun 27, 2023

fivetran-joemarkiewicz changed the title ~~[Bug] too many partitions in int_zendesk__field_history_pivot on bigquery~~ [Improvement] too many partitions in int_zendesk__field_history_pivot on bigquery Jun 27, 2023

fivetran-joemarkiewicz added status:accepted Scoped and accepted into queue and removed status:scoping Currently being scoped labels Aug 14, 2024

fivetran-joemarkiewicz closed this as completed Sep 5, 2024

fivetran-joemarkiewicz self-assigned this Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Improvement] too many partitions in int_zendesk__field_history_pivot on bigquery #96

[Improvement] too many partitions in int_zendesk__field_history_pivot on bigquery #96

jens-koster commented May 9, 2023 •

edited

Loading

fivetran-joemarkiewicz commented May 11, 2023 •

edited

Loading

jens-koster commented May 15, 2023 •

edited

Loading

fivetran-joemarkiewicz commented Jun 27, 2023

fivetran-joemarkiewicz commented Aug 14, 2024

fivetran-joemarkiewicz commented Sep 5, 2024

[Improvement] too many partitions in int_zendesk__field_history_pivot on bigquery #96

[Improvement] too many partitions in int_zendesk__field_history_pivot on bigquery #96

Comments

jens-koster commented May 9, 2023 • edited Loading

Is there an existing issue for this?

Describe the issue

Relevant error log or model output

Expected behavior

dbt Project configurations

Package versions

What database are you using dbt with?

dbt Version

Additional Context

Are you willing to open a PR to help address this issue?

fivetran-joemarkiewicz commented May 11, 2023 • edited Loading

jens-koster commented May 15, 2023 • edited Loading

fivetran-joemarkiewicz commented Jun 27, 2023

fivetran-joemarkiewicz commented Aug 14, 2024

fivetran-joemarkiewicz commented Sep 5, 2024

jens-koster commented May 9, 2023 •

edited

Loading

fivetran-joemarkiewicz commented May 11, 2023 •

edited

Loading

jens-koster commented May 15, 2023 •

edited

Loading