Feature/unstructured data #161

fivetran-catfritz · 2024-08-13T20:14:42Z

PR Overview

This PR will address the following Issue/Feature:

internal ticket

This PR will result in the following new package version:

0.17.0 since we're adding a new model

Please provide the finalized CHANGELOG entry which details the relevant changes included in this PR:

New model!

Addition of the zendesk__document model, designed to structure Zendesk textual data for vectorization and integration into NLP workflows. The model outputs a table with:

document_id: Corresponding to the ticket_id

chunk_index: For text segmentation

chunk: The text chunk itself

chunk_tokens_approximate: Approximate token count for each segment

This model is currently disabled by default. You may enable it by setting the zendesk__unstructured_enabled variable as true in your dbt_project.yml.

PR Checklist

Basic Validation

Please acknowledge that you have successfully performed the following commands locally:

dbt run –full-refresh ~~&& dbt test~~
~~dbt run (if incremental models are present) && dbt test~~

Before marking this PR as "ready for review" the following have been applied:

The appropriate issue has been linked, tagged, and properly assigned
All necessary documentation and version upgrades have been applied
docs were regenerated (unless this PR does not include any code or yml updates)
BuildKite integration tests are passing
Detailed validation steps have been provided below

Detailed Validation

Please share any and all of your validation steps:

see internal ticket

If you had to summarize this PR in an emoji, which would it be?

💃

fivetran-catfritz · 2024-08-16T01:30:18Z

macros/count_tokens.sql

+{%- endmacro %}
+
+{% macro default__count_tokens(column_name) %}
+  {{ dbt.length(column_name) }} / 4


Based on this doc and internal discussion, approximate count of tokens is appropriate.

fivetran-catfritz · 2024-08-16T01:33:09Z

packages.yml

+  # - package: fivetran/zendesk_source
+  #   version: [">=0.12.0", "<0.13.0"]
+  - git: https://github.com/fivetran/dbt_zendesk_source.git
+    revision: feature/unstructured-data
+    warn-unpinned: false


Suggested change

# - package: fivetran/zendesk_source

# version: [">=0.12.0", "<0.13.0"]

- git: https://github.com/fivetran/dbt_zendesk_source.git

revision: feature/unstructured-data

warn-unpinned: false

- package: fivetran/zendesk_source

version: [">=0.12.0", "<0.13.0"]

fivetran-catfritz · 2024-08-16T15:01:54Z

models/unstructured/intermediate/int_zendesk__ticket_comment_document.sql

+        ticket_comment_id,
+        ticket_id,
+        comment_time,
+        case when comment_tokens > {{ var('max_tokens', 7500) }} then left(comment_markdown, {{ var('max_tokens', 7500) }} * 4)  -- approximate 4 characters per token


using left() instead of substring() since it's easier to deal with across warehouses and does the same thing since we're just truncating.

fivetran-joemarkiewicz · 2024-08-20T15:52:00Z

dbt_project.yml

+    unstructured:
+      +schema: zendesk_unstructured
+      +materialized: table


Let's sync on this default schema

From our discussion, leaving this as-is so it is separated.

fivetran-catfritz

@fivetran-joemarkiewicz Thank you for reviewing! I also updated the changelog with the changes from the source.

fivetran-catfritz · 2024-08-20T21:38:11Z

dbt_project.yml

+    unstructured:
+      +schema: zendesk_unstructured
+      +materialized: table


From our discussion, leaving this as-is so it is separated.

fivetran-joemarkiewicz

@fivetran-catfritz great work on this PR. Just a few final comments and suggestions before approval. Let me know if you have any questions!

CHANGELOG.md

fivetran-joemarkiewicz · 2024-08-30T21:51:46Z

README.md

@@ -37,6 +37,7 @@ The following table provides a detailed list of final models materialized within
 | [zendesk__ticket_backlog](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__ticket_backlog)           | A daily historical view of the ticket field values defined in the `ticket_field_history_columns` variable for all backlog tickets. Backlog tickets being defined as any ticket not in a 'closed', 'deleted', or 'solved' status.                                                             |
 | [zendesk__ticket_field_history](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__ticket_field_history) | A daily historical view of the ticket field values defined in the `ticket_field_history_columns` variable and the corresponding updater fields defined in the `ticket_field_history_updater_columns` variable.                                                        |
 | [zendesk__sla_policies](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__sla_policies)           | Each record represents an SLA policy event and additional sla breach and achievement metrics. Calendar and business hour SLA breaches are supported.    
+| zendesk__document | Each record represents a chunk of text from ticket data, prepared for vectorization. It includes fields for use in NLP workflows. Disabled by default. |


@elanfivetran this won't be available yet in Quickstart, yet it will be displayed in the UI via this table. Is that okay?

@fivetran-catfritz would you be able to edit this to be the hyperlink to the package docs so users can see the table structure and documentation.

@fivetran-joemarkiewicz this is related to my question below about having this in the manifest.

fivetran-joemarkiewicz · 2024-08-30T21:55:01Z

README.md

@@ -91,6 +92,23 @@ vars:

 ## (Optional) Step 5: Additional configurations

+### Enabling the unstructured document model for NLP
+This package includes the `zendesk__document` model, which processes and segments Zendesk text data for vectorization, making it suitable for NLP workflows. The model outputs structured chunks of text with associated document IDs, segment indices, and token counts. By default, this model is disabled. To enable it, update the `zendesk__unstructured_enabled` variable to true in your dbt_project.yml:


Similar note from above, can we include the dbt docs link to the table zendesk__document here so curious users can go and inspect the structure and documentation of the table.

Also related to my manifest question.

README.md

fivetran-joemarkiewicz · 2024-08-30T21:56:46Z

dbt_project.yml

@@ -24,6 +24,9 @@ models:
    ticket_history:
      +schema: zendesk_intermediate
      +materialized: ephemeral
+    unstructured:
+      +schema: zendesk_unstructured
+      +materialized: table


Do we need all of these models to be materialized as tables? I definitely see the zendesk__document needing to be, but do all the intermediate models as well?

I'm really not sure, so I went with tables to be safe. I wasn't sure how demanding/large all that text data could get for a user, so open to suggestions on this one.

Due to the nature of the end result, I think it makes sense to keep these as tables to help with the query load as best as we can. We can always adjust this default materialization based off feedback.

fivetran-joemarkiewicz · 2024-08-30T22:00:21Z

integration_tests/dbt_project.yml

@@ -1,7 +1,7 @@
 config-version: 2

 name: 'zendesk_integration_tests'
-version: '0.16.0'
+version: '0.17.0'


Can we add the variable configuration zendesk__unstructured_enabled: true in the vars config here so it can be enabled when generating the docs.

@fivetran-joemarkiewicz At the time I did this I thought we didn't want this added to the manifest, and therefore not the docs. Is this not the case?

Co-authored-by: Joe Markiewicz <74217849+fivetran-joemarkiewicz@users.noreply.github.com>

fivetran-catfritz

Thanks @fivetran-joemarkiewicz. I applied your suggestions but also had a couple more questions!

CHANGELOG.md

README.md

fivetran-catfritz · 2024-08-30T22:09:02Z

integration_tests/dbt_project.yml

@@ -1,7 +1,7 @@
 config-version: 2

 name: 'zendesk_integration_tests'
-version: '0.16.0'
+version: '0.17.0'


@fivetran-joemarkiewicz At the time I did this I thought we didn't want this added to the manifest, and therefore not the docs. Is this not the case?

fivetran-catfritz · 2024-08-30T22:10:08Z

README.md

@@ -37,6 +37,7 @@ The following table provides a detailed list of final models materialized within
 | [zendesk__ticket_backlog](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__ticket_backlog)           | A daily historical view of the ticket field values defined in the `ticket_field_history_columns` variable for all backlog tickets. Backlog tickets being defined as any ticket not in a 'closed', 'deleted', or 'solved' status.                                                             |
 | [zendesk__ticket_field_history](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__ticket_field_history) | A daily historical view of the ticket field values defined in the `ticket_field_history_columns` variable and the corresponding updater fields defined in the `ticket_field_history_updater_columns` variable.                                                        |
 | [zendesk__sla_policies](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__sla_policies)           | Each record represents an SLA policy event and additional sla breach and achievement metrics. Calendar and business hour SLA breaches are supported.    
+| zendesk__document | Each record represents a chunk of text from ticket data, prepared for vectorization. It includes fields for use in NLP workflows. Disabled by default. |


@fivetran-joemarkiewicz this is related to my question below about having this in the manifest.

fivetran-catfritz · 2024-08-30T22:10:28Z

README.md

@@ -91,6 +92,23 @@ vars:

 ## (Optional) Step 5: Additional configurations

+### Enabling the unstructured document model for NLP
+This package includes the `zendesk__document` model, which processes and segments Zendesk text data for vectorization, making it suitable for NLP workflows. The model outputs structured chunks of text with associated document IDs, segment indices, and token counts. By default, this model is disabled. To enable it, update the `zendesk__unstructured_enabled` variable to true in your dbt_project.yml:


Also related to my manifest question.

fivetran-catfritz · 2024-08-30T22:11:32Z

dbt_project.yml

@@ -24,6 +24,9 @@ models:
    ticket_history:
      +schema: zendesk_intermediate
      +materialized: ephemeral
+    unstructured:
+      +schema: zendesk_unstructured
+      +materialized: table


I'm really not sure, so I went with tables to be safe. I wasn't sure how demanding/large all that text data could get for a user, so open to suggestions on this one.

fivetran-catfritz

@fivetran-joemarkiewicz I updated the readme to anticipate zendesk__document being included in the docs and add in the links and will merge into the release branch if it looks good! I will regen the docs in the release branch.

fivetran-joemarkiewicz

LGTM!

fivetran-catfritz added 2 commits August 13, 2024 13:50

feature/unstructured-data

7bc9f03

add coalesce_cast

49517ac

fivetran-catfritz self-assigned this Aug 13, 2024

fivetran-catfritz added 7 commits August 13, 2024 15:38

update filters

15d35a6

update and consolidate models

0d55f81

model revisions

30655bb

restructure

a6a4480

documentation

44950e1

remove extra comma

5cd7cdf

regen docs

c1b4586

fivetran-catfritz commented Aug 16, 2024

View reviewed changes

formatting

c8b71f3

fivetran-catfritz commented Aug 16, 2024

View reviewed changes

fivetran-catfritz mentioned this pull request Aug 16, 2024

Feature/unstructured data fivetran/dbt_zendesk_source#53

Merged

7 tasks

fivetran-catfritz commented Aug 16, 2024

View reviewed changes

update max token docs

f4861de

fivetran-joemarkiewicz reviewed Aug 20, 2024

View reviewed changes

Update CHANGELOG.md

c5f8416

fivetran-catfritz commented Aug 20, 2024

View reviewed changes

fivetran-catfritz requested a review from fivetran-joemarkiewicz August 20, 2024 21:43

fivetran-catfritz changed the base branch from main to release/v0.17.0 August 29, 2024 23:02

revert docs to main

f7f69a9

fivetran-catfritz mentioned this pull request Aug 30, 2024

Release/v0.17.0 #169

Merged

fivetran-catfritz added 2 commits August 30, 2024 14:32

update default max_tokens

8d71a40

update changelog

5af90b0

fivetran-joemarkiewicz requested changes Aug 30, 2024

View reviewed changes

Apply suggestions from code review

ecb146f

Co-authored-by: Joe Markiewicz <74217849+fivetran-joemarkiewicz@users.noreply.github.com>

fivetran-catfritz commented Aug 30, 2024

View reviewed changes

fivetran-catfritz added 2 commits September 3, 2024 10:13

Merge branch 'release/v0.17.0' into feature/unstructured-data

404a39b

update readme

1b68e18

fivetran-catfritz commented Sep 3, 2024

View reviewed changes

fivetran-catfritz requested a review from fivetran-joemarkiewicz September 3, 2024 17:17

fivetran-joemarkiewicz approved these changes Sep 3, 2024

View reviewed changes

fivetran-catfritz merged commit 03bf529 into release/v0.17.0 Sep 3, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/unstructured data #161

Feature/unstructured data #161

fivetran-catfritz commented Aug 13, 2024 •

edited

Loading

fivetran-catfritz Aug 16, 2024

fivetran-catfritz Aug 16, 2024

fivetran-catfritz Aug 16, 2024

fivetran-joemarkiewicz Aug 20, 2024

fivetran-catfritz Aug 20, 2024

fivetran-catfritz left a comment

fivetran-catfritz Aug 20, 2024

fivetran-joemarkiewicz left a comment

fivetran-joemarkiewicz Aug 30, 2024

fivetran-joemarkiewicz Aug 30, 2024

fivetran-catfritz Aug 30, 2024

fivetran-joemarkiewicz Aug 30, 2024

fivetran-catfritz Aug 30, 2024

fivetran-joemarkiewicz Aug 30, 2024

fivetran-catfritz Aug 30, 2024

fivetran-joemarkiewicz Sep 3, 2024

fivetran-joemarkiewicz Aug 30, 2024

fivetran-catfritz Aug 30, 2024

fivetran-catfritz left a comment

fivetran-catfritz Aug 30, 2024

fivetran-catfritz Aug 30, 2024

fivetran-catfritz Aug 30, 2024

fivetran-catfritz Aug 30, 2024

fivetran-catfritz left a comment •

edited

Loading

fivetran-joemarkiewicz left a comment

Feature/unstructured data #161

Feature/unstructured data #161

Conversation

fivetran-catfritz commented Aug 13, 2024 • edited Loading

PR Overview

New model!

PR Checklist

Basic Validation

Detailed Validation

If you had to summarize this PR in an emoji, which would it be?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fivetran-catfritz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fivetran-joemarkiewicz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fivetran-catfritz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fivetran-catfritz left a comment • edited Loading

Choose a reason for hiding this comment

fivetran-joemarkiewicz left a comment

Choose a reason for hiding this comment

fivetran-catfritz commented Aug 13, 2024 •

edited

Loading

fivetran-catfritz left a comment •

edited

Loading