fivetran · fivetran-catfritz · Sep 3, 2024 · Aug 13, 2024 · Aug 13, 2024 · Aug 13, 2024
diff --git a/.buildkite/pipeline.yml b/.buildkite/pipeline.yml
@@ -58,7 +58,7 @@ steps:
     commands: |
       bash .buildkite/scripts/run_models.sh redshift
 
-  - label: ":bricks: Run Tests - Databricks"
+  - label: ":databricks: Run Tests - Databricks"
     key: "run_dbt_databricks"
     plugins:
       - docker#v3.13.0:

diff --git a/.buildkite/scripts/run_models.sh b/.buildkite/scripts/run_models.sh
@@ -19,7 +19,7 @@ dbt deps
 dbt seed --target "$db" --full-refresh
 dbt run --target "$db" --full-refresh
 dbt test --target "$db"
-dbt run --vars '{using_schedules: false, using_domain_names: false, using_user_tags: false, using_ticket_form_history: false, using_organization_tags: false}' --target "$db" --full-refresh
+dbt run --vars '{zendesk__unstructured_enabled: true, using_schedules: false, using_domain_names: false, using_user_tags: false, using_ticket_form_history: false, using_organization_tags: false}' --target "$db" --full-refresh
 dbt test --target "$db"
 
 dbt run-operation fivetran_utils.drop_schemas_automation --target "$db"
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,22 @@
+# dbt_zendesk v0.17.0
+[PR #161](https://github.com/fivetran/dbt_zendesk/pull/161) includes the following updates:
+## New model
+- Addition of the `zendesk__document` model, designed to structure Zendesk textual data for vectorization and integration into NLP workflows. The model outputs a table with:
+  - `document_id`: Corresponding to the `ticket_id`
+  - `chunk_index`: For text segmentation
+  - `chunk`: The text chunk itself
+  - `chunk_tokens_approximate`: Approximate token count for each segment
+- This model is currently disabled by default. You may enable it by setting the `zendesk__unstructured_enabled` variable as `true` in your `dbt_project.yml`.
+  - This model was developed to limit the chunk sizes to approximately 5000 tokens for use with OpenAI, however you can change this limit by setting the variable `zendesk_max_tokens` in your `dbt_project.yml`.
+  - See the README section [Enabling the unstructured document model for NLP](https://github.com/fivetran/dbt_zendesk/blob/main/README.md#enabling-the-unstructured-document-model-for-nlp) for more information.
+
+## Breaking changes
+- In the [dbt_zendesk_source v0.12.0 release](https://github.com/fivetran/dbt_zendesk_source/releases/tag/v0.12.0), the field `_fivetran_deleted` was added to the following models for use in `zendesk__document` model:
+  - `stg_zendesk__ticket`
+  - `stg_zendesk__ticket_comment`
+  - `stg_zendesk__user`
+  - If you have already added `_fivetran_deleted` as a passthrough column via the `zendesk__ticket_passthrough_columns` or `zendesk__user_passthrough_columns` variable, you will need to remove or alias this field from the variable to avoid duplicate column errors.
+
 # dbt_zendesk v0.16.0
 ## 🚨 Minor Upgrade 🚨
 Although this update is not a breaking change, it will likely impact the output of the `zendesk__sla_policies` and `zendesk__sla_metrics` models. [PR #154](https://github.com/fivetran/dbt_zendesk/pull/154) includes the following changes:

diff --git a/README.md b/README.md
@@ -37,6 +37,7 @@ The following table provides a detailed list of final models materialized within
 | [zendesk__ticket_backlog](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__ticket_backlog)           | A daily historical view of the ticket field values defined in the `ticket_field_history_columns` variable for all backlog tickets. Backlog tickets being defined as any ticket not in a 'closed', 'deleted', or 'solved' status.                                                             |
 | [zendesk__ticket_field_history](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__ticket_field_history) | A daily historical view of the ticket field values defined in the `ticket_field_history_columns` variable and the corresponding updater fields defined in the `ticket_field_history_updater_columns` variable.                                                        |
 | [zendesk__sla_policies](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__sla_policies)           | Each record represents an SLA policy event and additional sla breach and achievement metrics. Calendar and business hour SLA breaches are supported.    
+| zendesk__document | Each record represents a chunk of text from ticket data, prepared for vectorization. It includes fields for use in NLP workflows. Disabled by default. |
 
 Many of the above reports are now configurable for [visualization via Streamlit](https://github.com/fivetran/streamlit_zendesk)! Check out some [sample reports here](https://fivetran-zendesk.streamlit.app/).
 
@@ -64,7 +65,7 @@ Include the following zendesk package version in your `packages.yml` file:
 ```yml
 packages:
   - package: fivetran/zendesk
-    version: [">=0.16.0", "<0.17.0"]
+    version: [">=0.17.0", "<0.18.0"]
 
 ```
 > **Note**: Do not include the Zendesk Support source package. The Zendesk Support transform package already has a dependency on the source in its own `packages.yml` file.
@@ -91,6 +92,23 @@ vars:
 
 ## (Optional) Step 5: Additional configurations
 
+#### Enabling the unstructured document model for NLP
+This package includes the `zendesk__document` model, which processes and segments Zendesk text data for vectorization, making it suitable for NLP workflows. The model outputs structured chunks of text with associated document IDs, segment indices, and token counts. By default, this model is disabled. To enable it, update the `zendesk__unstructured_enabled` variable to true in your dbt_project.yml:
+
+```yml
+vars:
+  zendesk__unstructured_enabled: true # false by default.
+```
+
+##### Customizing Chunk Size for Vectorization
+
+The `zendesk__document` model was developed to limit approximate chunk sizes to 7,500 tokens, optimized for OpenAI models. However, you can adjust this limit by setting the `max_tokens` variable in your `dbt_project.yml`:
+
+```yml
+vars:
+  zendesk_max_tokens: 7500 # Default value
+```
+
 ### Add passthrough columns
 This package includes all source columns defined in the macros folder. You can add more columns from the `TICKET`, `USER`, and `ORGANIZATION` tables using our pass-through column variables.
 
@@ -211,7 +229,7 @@ This dbt package is dependent on the following dbt packages. Please be aware tha
 ```yml
 packages:
     - package: fivetran/zendesk_source
-      version: [">=0.11.0", "<0.12.0"]
+      version: [">=0.12.0", "<0.13.0"]
 
     - package: fivetran/fivetran_utils
       version: [">=0.4.0", "<0.5.0"]

diff --git a/dbt_project.yml b/dbt_project.yml
@@ -1,5 +1,5 @@
 name: 'zendesk'
-version: '0.16.0'
+version: '0.17.0'
 
 
 config-version: 2
@@ -24,6 +24,9 @@ models:
     ticket_history:
       +schema: zendesk_intermediate
       +materialized: ephemeral
+    unstructured:
+      +schema: zendesk_unstructured
+      +materialized: table
     utils:
       +materialized: ephemeral
 vars:

diff --git a/docs/index.html b/docs/index.html
diff --git a/integration_tests/dbt_project.yml b/integration_tests/dbt_project.yml
@@ -1,7 +1,7 @@
 config-version: 2
 
 name: 'zendesk_integration_tests'
-version: '0.16.0'
+version: '0.17.0'
 
 profile: 'integration_tests'
 

diff --git a/macros/coalesce_cast.sql b/macros/coalesce_cast.sql
@@ -0,0 +1,12 @@
+{% macro coalesce_cast(column_list, datatype) -%}
+  {{ return(adapter.dispatch('coalesce_cast', 'zendesk')(column_list, datatype)) }}
+{%- endmacro %}
+
+{% macro default__coalesce_cast(column_list, datatype) %}
+  coalesce(
+    {%- for column in column_list %}
+      cast({{ column }} as {{ datatype }})
+      {%- if not loop.last -%},{%- endif -%}
+    {% endfor %}
+  )
+{% endmacro %}
diff --git a/macros/count_tokens.sql b/macros/count_tokens.sql
@@ -0,0 +1,7 @@
+{% macro count_tokens(column_name) -%}
+  {{ return(adapter.dispatch('count_tokens', 'zendesk')(column_name)) }}
+{%- endmacro %}
+
+{% macro default__count_tokens(column_name) %}
+  {{ dbt.length(column_name) }} / 4
+{% endmacro %}
diff --git a/models/unstructured/intermediate/int_zendesk__ticket_comment_document.sql b/models/unstructured/intermediate/int_zendesk__ticket_comment_document.sql
@@ -0,0 +1,60 @@
+{{ config(enabled=var('zendesk__unstructured_enabled', False)) }}
+
+with ticket_comments as (
+    select *
+    from {{ var('ticket_comment') }}
+
+), users as (
+    select *
+    from {{ var('user') }}
+
+), comment_details as (
+    select 
+        ticket_comments.ticket_comment_id,
+        ticket_comments.ticket_id,
+        {{ zendesk.coalesce_cast(["users.email", "'UNKNOWN'"], dbt.type_string()) }} as commenter_email,
+        {{ zendesk.coalesce_cast(["users.name", "'UNKNOWN'"], dbt.type_string()) }} as commenter_name,
+        ticket_comments.created_at as comment_time,
+        ticket_comments.body as comment_body
+    from ticket_comments
+    left join users
+        on ticket_comments.user_id = users.user_id
+    where not coalesce(ticket_comments._fivetran_deleted, False)
+        and not coalesce(users._fivetran_deleted, False)
+
+), comment_markdowns as (
+    select
+        ticket_comment_id,
+        ticket_id,
+        comment_time,
+        cast(
+            {{ dbt.concat([
+                "'### message from '", "commenter_name", "' ('", "commenter_email", "')\\n'",
+                "'##### sent @ '", "comment_time", "'\\n'",
+                "comment_body"
+            ]) }} as {{ dbt.type_string() }})
+            as comment_markdown
+    from comment_details
+
+), comments_tokens as (
+    select
+        *,
+        {{ zendesk.count_tokens("comment_markdown") }} as comment_tokens
+    from comment_markdowns
+
+), truncated_comments as (
+    select
+        ticket_comment_id,
+        ticket_id,
+        comment_time,
+        case when comment_tokens > {{ var('zendesk_max_tokens', 5000) }} then left(comment_markdown, {{ var('zendesk_max_tokens', 5000) }} * 4)  -- approximate 4 characters per token
+            else comment_markdown
+            end as comment_markdown,
+        case when comment_tokens > {{ var('zendesk_max_tokens', 5000) }} then {{ var('zendesk_max_tokens', 5000) }}
+            else comment_tokens
+            end as comment_tokens
+    from comments_tokens
+)
+
+select *
+from truncated_comments
diff --git a/models/unstructured/intermediate/int_zendesk__ticket_comment_documents_grouped.sql b/models/unstructured/intermediate/int_zendesk__ticket_comment_documents_grouped.sql
@@ -0,0 +1,32 @@
+{{ config(enabled=var('zendesk__unstructured_enabled', False)) }}
+
+with filtered_comment_documents as (
+  select *
+  from {{ ref('int_zendesk__ticket_comment_document') }}
+),
+
+grouped_comment_documents as (
+  select 
+    ticket_id,
+    comment_markdown,
+    comment_tokens,
+    comment_time,
+    sum(comment_tokens) over (
+      partition by ticket_id 
+      order by comment_time
+      rows between unbounded preceding and current row
+    ) as cumulative_length
+  from filtered_comment_documents
+)
+
+select 
+  ticket_id,
+  cast({{ dbt_utils.safe_divide('floor(cumulative_length - 1)', var('zendesk_max_tokens', 5000)) }} as {{ dbt.type_int() }}) as chunk_index,
+  {{ dbt.listagg(
+    measure="comment_markdown",
+    delimiter_text="'\\n\\n---\\n\\n'",
+    order_by_clause="order by comment_time"
+    ) }} as comments_group_markdown,
+  sum(comment_tokens) as chunk_tokens
+from grouped_comment_documents
+group by 1,2
diff --git a/models/unstructured/intermediate/int_zendesk__ticket_document.sql b/models/unstructured/intermediate/int_zendesk__ticket_document.sql
@@ -0,0 +1,42 @@
+{{ config(enabled=var('zendesk__unstructured_enabled', False)) }}
+
+with tickets as (
+    select *
+    from {{ var('ticket') }}
+
+), users as (
+    select *
+    from {{ var('user') }}
+
+), ticket_details as (
+    select
+        tickets.ticket_id,
+        tickets.subject AS ticket_name,
+        {{ zendesk.coalesce_cast(["users.name", "'UNKNOWN'"], dbt.type_string()) }} as user_name,
+        {{ zendesk.coalesce_cast(["users.email", "'UNKNOWN'"], dbt.type_string()) }} as created_by,
+        tickets.created_at AS created_on,
+        {{ zendesk.coalesce_cast(["tickets.status", "'UNKNOWN'"], dbt.type_string()) }} as status,
+        {{ zendesk.coalesce_cast(["tickets.priority", "'UNKNOWN'"], dbt.type_string()) }} as priority
+    from tickets
+    left join users
+        on tickets.requester_id = users.user_id
+    where not coalesce(tickets._fivetran_deleted, False)
+        and not coalesce(users._fivetran_deleted, False)
+
+), final as (
+    select
+        ticket_id,
+        {{ dbt.concat([
+            "'# Ticket : '", "ticket_name", "'\\n\\n'",
+            "'Created By : '", "user_name", "' ('", "created_by", "')\\n'",
+            "'Created On : '", "created_on", "'\\n'",
+            "'Status : '", "status", "'\\n'",
+            "'Priority : '", "priority"
+        ]) }} as ticket_markdown
+    from ticket_details
+)
+
+select 
+    *,
+    {{ zendesk.count_tokens("ticket_markdown") }} as ticket_tokens
+from final
diff --git a/models/unstructured/zendesk__document.sql b/models/unstructured/zendesk__document.sql
@@ -0,0 +1,27 @@
+{{ config(enabled=var('zendesk__unstructured_enabled', False)) }}
+
+with ticket_document as (
+    select *
+    from {{ ref('int_zendesk__ticket_document') }}
+
+), grouped as (
+    select *
+    from {{ ref('int_zendesk__ticket_comment_documents_grouped') }}
+
+), final as (
+    select
+        cast(ticket_document.ticket_id as {{ dbt.type_string() }}) as document_id,
+        grouped.chunk_index,
+        grouped.chunk_tokens as chunk_tokens_approximate,
+        {{ dbt.concat([
+            "ticket_document.ticket_markdown",
+            "'\\n\\n## COMMENTS\\n\\n'",
+            "grouped.comments_group_markdown"]) }}
+            as chunk
+    from ticket_document
+    join grouped
+        on grouped.ticket_id = ticket_document.ticket_id
+)
+
+select *
+from final
diff --git a/models/unstructured/zendesk_unstructured.yml b/models/unstructured/zendesk_unstructured.yml
@@ -0,0 +1,14 @@
+version: 2
+
+models:
+  - name: zendesk__document
+    description: Each record represents a Zendesk ticket, enriched with data about it's tags, assignees, requester, submitter, organization and group.  
+    columns:
+      - name: document_id
+        description: Equivalent to `ticket_id`.
+      - name: chunk_index
+        description: The index of the chunk associated with the `document_id`.
+      - name: chunk_tokens_approximate
+        description: Approximate number of tokens for the chunk, assuming 4 characters per token.
+      - name: chunk
+        description: The text of the chunk.
diff --git a/models/zendesk.yml b/models/zendesk.yml
@@ -171,6 +171,8 @@ models:
         description: Boolean indicating if the ticket had a satisfaction score went from good to bad.
       - name: is_bad_to_good_satisfaction_score
         description: Boolean indicating if the ticket had a satisfaction score went from bad to good.
+      - name: _fivetran_deleted
+        description: Boolean created by Fivetran to indicate whether the ticket has been deleted.
 
   - name: zendesk__sla_policies
     description: Each record represents an SLA policy event and additional sla breach and achievement metrics. Calendar and business hour SLA breaches for `first_reply_time`, `next_reply_time`, `requester_wait_time`, and `agent_work_time` are supported. If there is a SLA you would like supported that is not included, please create a feature request.
@@ -492,6 +494,8 @@ models:
         description: The time in minutes the ticket was in an unassigned state
       - name: last_status_assignment_date
         description: The time the status was last changed on the ticket
+      - name: _fivetran_deleted
+        description: Boolean created by Fivetran to indicate whether the ticket has been deleted.
 
   - name: zendesk__ticket_summary
     description: A single record table containing Zendesk ticket and user summary metrics. These metrics are updated for the current day the model is run.

diff --git a/packages.yml b/packages.yml
@@ -1,6 +1,8 @@
 packages:
-  - package: fivetran/zendesk_source
-    version: [">=0.11.0", "<0.12.0"]
-
+  # - package: fivetran/zendesk_source
+  #   version: [">=0.12.0", "<0.13.0"]
+  - git: https://github.com/fivetran/dbt_zendesk_source.git
+    revision: feature/unstructured-data
+    warn-unpinned: false
-  # - package: fivetran/zendesk_source
-  #   version: [">=0.12.0", "<0.13.0"]
-  - git: https://github.com/fivetran/dbt_zendesk_source.git
-    revision: feature/unstructured-data
-    warn-unpinned: false
+  - package: fivetran/zendesk_source
+    version: [">=0.12.0", "<0.13.0"]
-  # - package: fivetran/zendesk_source
-  #   version: [">=0.12.0", "<0.13.0"]
-  - git: https://github.com/fivetran/dbt_zendesk_source.git
-    revision: feature/unstructured-data
-    warn-unpinned: false
+  - package: fivetran/zendesk_source
+    version: [">=0.12.0", "<0.13.0"]
   - package: calogica/dbt_date
     version: [">=0.9.0", "<1.0.0"]