-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/unstructured data #161
Changes from 16 commits
7bc9f03
49517ac
15d35a6
0d55f81
30655bb
a6a4480
44950e1
5cd7cdf
c1b4586
c8b71f3
f4861de
c5f8416
f7f69a9
8d71a40
5af90b0
ecb146f
404a39b
1b68e18
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -37,6 +37,7 @@ The following table provides a detailed list of final models materialized within | |
| [zendesk__ticket_backlog](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__ticket_backlog) | A daily historical view of the ticket field values defined in the `ticket_field_history_columns` variable for all backlog tickets. Backlog tickets being defined as any ticket not in a 'closed', 'deleted', or 'solved' status. | | ||
| [zendesk__ticket_field_history](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__ticket_field_history) | A daily historical view of the ticket field values defined in the `ticket_field_history_columns` variable and the corresponding updater fields defined in the `ticket_field_history_updater_columns` variable. | | ||
| [zendesk__sla_policies](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__sla_policies) | Each record represents an SLA policy event and additional sla breach and achievement metrics. Calendar and business hour SLA breaches are supported. | ||
| zendesk__document | Each record represents a chunk of text from ticket data, prepared for vectorization. It includes fields for use in NLP workflows. Disabled by default. | | ||
|
||
Many of the above reports are now configurable for [visualization via Streamlit](https://github.com/fivetran/streamlit_zendesk)! Check out some [sample reports here](https://fivetran-zendesk.streamlit.app/). | ||
|
||
|
@@ -64,7 +65,7 @@ Include the following zendesk package version in your `packages.yml` file: | |
```yml | ||
packages: | ||
- package: fivetran/zendesk | ||
version: [">=0.16.0", "<0.17.0"] | ||
version: [">=0.17.0", "<0.18.0"] | ||
|
||
``` | ||
> **Note**: Do not include the Zendesk Support source package. The Zendesk Support transform package already has a dependency on the source in its own `packages.yml` file. | ||
|
@@ -91,6 +92,23 @@ vars: | |
|
||
## (Optional) Step 5: Additional configurations | ||
|
||
#### Enabling the unstructured document model for NLP | ||
This package includes the `zendesk__document` model, which processes and segments Zendesk text data for vectorization, making it suitable for NLP workflows. The model outputs structured chunks of text with associated document IDs, segment indices, and token counts. By default, this model is disabled. To enable it, update the `zendesk__unstructured_enabled` variable to true in your dbt_project.yml: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar note from above, can we include the dbt docs link to the table zendesk__document here so curious users can go and inspect the structure and documentation of the table. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also related to my manifest question. |
||
|
||
```yml | ||
vars: | ||
zendesk__unstructured_enabled: true # false by default. | ||
``` | ||
|
||
##### Customizing Chunk Size for Vectorization | ||
|
||
The `zendesk__document` model was developed to limit approximate chunk sizes to 7,500 tokens, optimized for OpenAI models. However, you can adjust this limit by setting the `max_tokens` variable in your `dbt_project.yml`: | ||
|
||
```yml | ||
vars: | ||
zendesk_max_tokens: 7500 # Default value | ||
``` | ||
|
||
### Add passthrough columns | ||
This package includes all source columns defined in the macros folder. You can add more columns from the `TICKET`, `USER`, and `ORGANIZATION` tables using our pass-through column variables. | ||
|
||
|
@@ -211,7 +229,7 @@ This dbt package is dependent on the following dbt packages. Please be aware tha | |
```yml | ||
packages: | ||
- package: fivetran/zendesk_source | ||
version: [">=0.11.0", "<0.12.0"] | ||
version: [">=0.12.0", "<0.13.0"] | ||
|
||
- package: fivetran/fivetran_utils | ||
version: [">=0.4.0", "<0.5.0"] | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,5 @@ | ||
name: 'zendesk' | ||
version: '0.16.0' | ||
version: '0.17.0' | ||
|
||
|
||
config-version: 2 | ||
|
@@ -24,6 +24,9 @@ models: | |
ticket_history: | ||
+schema: zendesk_intermediate | ||
+materialized: ephemeral | ||
unstructured: | ||
+schema: zendesk_unstructured | ||
+materialized: table | ||
Comment on lines
+27
to
+29
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's sync on this default schema There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From our discussion, leaving this as-is so it is separated. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need all of these models to be materialized as tables? I definitely see the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm really not sure, so I went with tables to be safe. I wasn't sure how demanding/large all that text data could get for a user, so open to suggestions on this one. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Due to the nature of the end result, I think it makes sense to keep these as tables to help with the query load as best as we can. We can always adjust this default materialization based off feedback. |
||
utils: | ||
+materialized: ephemeral | ||
vars: | ||
|
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,7 @@ | ||
config-version: 2 | ||
|
||
name: 'zendesk_integration_tests' | ||
version: '0.16.0' | ||
version: '0.17.0' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we add the variable configuration There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @fivetran-joemarkiewicz At the time I did this I thought we didn't want this added to the manifest, and therefore not the docs. Is this not the case? |
||
|
||
profile: 'integration_tests' | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
{% macro coalesce_cast(column_list, datatype) -%} | ||
{{ return(adapter.dispatch('coalesce_cast', 'zendesk')(column_list, datatype)) }} | ||
{%- endmacro %} | ||
|
||
{% macro default__coalesce_cast(column_list, datatype) %} | ||
coalesce( | ||
{%- for column in column_list %} | ||
cast({{ column }} as {{ datatype }}) | ||
{%- if not loop.last -%},{%- endif -%} | ||
{% endfor %} | ||
) | ||
{% endmacro %} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
{% macro count_tokens(column_name) -%} | ||
{{ return(adapter.dispatch('count_tokens', 'zendesk')(column_name)) }} | ||
{%- endmacro %} | ||
|
||
{% macro default__count_tokens(column_name) %} | ||
{{ dbt.length(column_name) }} / 4 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Based on this doc and internal discussion, approximate count of tokens is appropriate. |
||
{% endmacro %} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
{{ config(enabled=var('zendesk__unstructured_enabled', False)) }} | ||
|
||
with ticket_comments as ( | ||
select * | ||
from {{ var('ticket_comment') }} | ||
|
||
), users as ( | ||
select * | ||
from {{ var('user') }} | ||
|
||
), comment_details as ( | ||
select | ||
ticket_comments.ticket_comment_id, | ||
ticket_comments.ticket_id, | ||
{{ zendesk.coalesce_cast(["users.email", "'UNKNOWN'"], dbt.type_string()) }} as commenter_email, | ||
{{ zendesk.coalesce_cast(["users.name", "'UNKNOWN'"], dbt.type_string()) }} as commenter_name, | ||
ticket_comments.created_at as comment_time, | ||
ticket_comments.body as comment_body | ||
from ticket_comments | ||
left join users | ||
on ticket_comments.user_id = users.user_id | ||
where not coalesce(ticket_comments._fivetran_deleted, False) | ||
and not coalesce(users._fivetran_deleted, False) | ||
|
||
), comment_markdowns as ( | ||
select | ||
ticket_comment_id, | ||
ticket_id, | ||
comment_time, | ||
cast( | ||
{{ dbt.concat([ | ||
"'### message from '", "commenter_name", "' ('", "commenter_email", "')\\n'", | ||
"'##### sent @ '", "comment_time", "'\\n'", | ||
"comment_body" | ||
]) }} as {{ dbt.type_string() }}) | ||
as comment_markdown | ||
from comment_details | ||
|
||
), comments_tokens as ( | ||
select | ||
*, | ||
{{ zendesk.count_tokens("comment_markdown") }} as comment_tokens | ||
from comment_markdowns | ||
|
||
), truncated_comments as ( | ||
select | ||
ticket_comment_id, | ||
ticket_id, | ||
comment_time, | ||
case when comment_tokens > {{ var('zendesk_max_tokens', 5000) }} then left(comment_markdown, {{ var('zendesk_max_tokens', 5000) }} * 4) -- approximate 4 characters per token | ||
else comment_markdown | ||
end as comment_markdown, | ||
case when comment_tokens > {{ var('zendesk_max_tokens', 5000) }} then {{ var('zendesk_max_tokens', 5000) }} | ||
else comment_tokens | ||
end as comment_tokens | ||
from comments_tokens | ||
) | ||
|
||
select * | ||
from truncated_comments |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
{{ config(enabled=var('zendesk__unstructured_enabled', False)) }} | ||
|
||
with filtered_comment_documents as ( | ||
select * | ||
from {{ ref('int_zendesk__ticket_comment_document') }} | ||
), | ||
|
||
grouped_comment_documents as ( | ||
select | ||
ticket_id, | ||
comment_markdown, | ||
comment_tokens, | ||
comment_time, | ||
sum(comment_tokens) over ( | ||
partition by ticket_id | ||
order by comment_time | ||
rows between unbounded preceding and current row | ||
) as cumulative_length | ||
from filtered_comment_documents | ||
) | ||
|
||
select | ||
ticket_id, | ||
cast({{ dbt_utils.safe_divide('floor(cumulative_length - 1)', var('zendesk_max_tokens', 5000)) }} as {{ dbt.type_int() }}) as chunk_index, | ||
{{ dbt.listagg( | ||
measure="comment_markdown", | ||
delimiter_text="'\\n\\n---\\n\\n'", | ||
order_by_clause="order by comment_time" | ||
) }} as comments_group_markdown, | ||
sum(comment_tokens) as chunk_tokens | ||
from grouped_comment_documents | ||
group by 1,2 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
{{ config(enabled=var('zendesk__unstructured_enabled', False)) }} | ||
|
||
with tickets as ( | ||
select * | ||
from {{ var('ticket') }} | ||
|
||
), users as ( | ||
select * | ||
from {{ var('user') }} | ||
|
||
), ticket_details as ( | ||
select | ||
tickets.ticket_id, | ||
tickets.subject AS ticket_name, | ||
{{ zendesk.coalesce_cast(["users.name", "'UNKNOWN'"], dbt.type_string()) }} as user_name, | ||
{{ zendesk.coalesce_cast(["users.email", "'UNKNOWN'"], dbt.type_string()) }} as created_by, | ||
tickets.created_at AS created_on, | ||
{{ zendesk.coalesce_cast(["tickets.status", "'UNKNOWN'"], dbt.type_string()) }} as status, | ||
{{ zendesk.coalesce_cast(["tickets.priority", "'UNKNOWN'"], dbt.type_string()) }} as priority | ||
from tickets | ||
left join users | ||
on tickets.requester_id = users.user_id | ||
where not coalesce(tickets._fivetran_deleted, False) | ||
and not coalesce(users._fivetran_deleted, False) | ||
|
||
), final as ( | ||
select | ||
ticket_id, | ||
{{ dbt.concat([ | ||
"'# Ticket : '", "ticket_name", "'\\n\\n'", | ||
"'Created By : '", "user_name", "' ('", "created_by", "')\\n'", | ||
"'Created On : '", "created_on", "'\\n'", | ||
"'Status : '", "status", "'\\n'", | ||
"'Priority : '", "priority" | ||
]) }} as ticket_markdown | ||
from ticket_details | ||
) | ||
|
||
select | ||
*, | ||
{{ zendesk.count_tokens("ticket_markdown") }} as ticket_tokens | ||
from final |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
{{ config(enabled=var('zendesk__unstructured_enabled', False)) }} | ||
|
||
with ticket_document as ( | ||
select * | ||
from {{ ref('int_zendesk__ticket_document') }} | ||
|
||
), grouped as ( | ||
select * | ||
from {{ ref('int_zendesk__ticket_comment_documents_grouped') }} | ||
|
||
), final as ( | ||
select | ||
cast(ticket_document.ticket_id as {{ dbt.type_string() }}) as document_id, | ||
grouped.chunk_index, | ||
grouped.chunk_tokens as chunk_tokens_approximate, | ||
{{ dbt.concat([ | ||
"ticket_document.ticket_markdown", | ||
"'\\n\\n## COMMENTS\\n\\n'", | ||
"grouped.comments_group_markdown"]) }} | ||
as chunk | ||
from ticket_document | ||
join grouped | ||
on grouped.ticket_id = ticket_document.ticket_id | ||
) | ||
|
||
select * | ||
from final |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
version: 2 | ||
|
||
models: | ||
- name: zendesk__document | ||
description: Each record represents a Zendesk ticket, enriched with data about it's tags, assignees, requester, submitter, organization and group. | ||
columns: | ||
- name: document_id | ||
description: Equivalent to `ticket_id`. | ||
- name: chunk_index | ||
description: The index of the chunk associated with the `document_id`. | ||
- name: chunk_tokens_approximate | ||
description: Approximate number of tokens for the chunk, assuming 4 characters per token. | ||
- name: chunk | ||
description: The text of the chunk. |
Original file line number | Diff line number | Diff line change | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -1,6 +1,8 @@ | ||||||||||||||||
packages: | ||||||||||||||||
- package: fivetran/zendesk_source | ||||||||||||||||
version: [">=0.11.0", "<0.12.0"] | ||||||||||||||||
|
||||||||||||||||
# - package: fivetran/zendesk_source | ||||||||||||||||
# version: [">=0.12.0", "<0.13.0"] | ||||||||||||||||
- git: https://github.com/fivetran/dbt_zendesk_source.git | ||||||||||||||||
revision: feature/unstructured-data | ||||||||||||||||
warn-unpinned: false | ||||||||||||||||
Comment on lines
+2
to
+6
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||
- package: calogica/dbt_date | ||||||||||||||||
version: [">=0.9.0", "<1.0.0"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elanfivetran this won't be available yet in Quickstart, yet it will be displayed in the UI via this table. Is that okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fivetran-catfritz would you be able to edit this to be the hyperlink to the package docs so users can see the table structure and documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fivetran-joemarkiewicz this is related to my question below about having this in the manifest.