Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/unstructured data #161

Merged
merged 18 commits into from
Sep 3, 2024
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .buildkite/pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ steps:
commands: |
bash .buildkite/scripts/run_models.sh redshift

- label: ":bricks: Run Tests - Databricks"
- label: ":databricks: Run Tests - Databricks"
key: "run_dbt_databricks"
plugins:
- docker#v3.13.0:
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/scripts/run_models.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ dbt deps
dbt seed --target "$db" --full-refresh
dbt run --target "$db" --full-refresh
dbt test --target "$db"
dbt run --vars '{using_schedules: false, using_domain_names: false, using_user_tags: false, using_ticket_form_history: false, using_organization_tags: false}' --target "$db" --full-refresh
dbt run --vars '{zendesk__unstructured_enabled: true, using_schedules: false, using_domain_names: false, using_user_tags: false, using_ticket_form_history: false, using_organization_tags: false}' --target "$db" --full-refresh
dbt test --target "$db"

dbt run-operation fivetran_utils.drop_schemas_automation --target "$db"
19 changes: 19 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,22 @@
# dbt_zendesk v0.17.0
[PR #161](https://github.com/fivetran/dbt_zendesk/pull/161) includes the following updates:
## New model
- Addition of the `zendesk__document` model, designed to structure Zendesk textual data for vectorization and integration into NLP workflows. The model outputs a table with:
- `document_id`: Corresponding to the `ticket_id`
- `chunk_index`: For text segmentation
- `chunk`: The text chunk itself
- `chunk_tokens_approximate`: Approximate token count for each segment
- This model is currently disabled by default. You may enable it by setting the `zendesk__unstructured_enabled` variable as `true` in your `dbt_project.yml`.
- This model was developed to limit the chunk sizes to approximately 5000 tokens for use with OpenAI, however you can change this limit by setting the variable `zendesk_max_tokens` in your `dbt_project.yml`.
- See the README section [Enabling the unstructured document model for NLP](https://github.com/fivetran/dbt_zendesk/blob/main/README.md#enabling-the-unstructured-document-model-for-nlp) for more information.

## Breaking changes
- In the [dbt_zendesk_source v0.12.0 release](https://github.com/fivetran/dbt_zendesk_source/releases/tag/v0.12.0), the field `_fivetran_deleted` was added to the following models for use in `zendesk__document` model:
- `stg_zendesk__ticket`
- `stg_zendesk__ticket_comment`
- `stg_zendesk__user`
- If you have already added `_fivetran_deleted` as a passthrough column via the `zendesk__ticket_passthrough_columns` or `zendesk__user_passthrough_columns` variable, you will need to remove or alias this field from the variable to avoid duplicate column errors.

# dbt_zendesk v0.16.0
## 🚨 Minor Upgrade 🚨
Although this update is not a breaking change, it will likely impact the output of the `zendesk__sla_policies` and `zendesk__sla_metrics` models. [PR #154](https://github.com/fivetran/dbt_zendesk/pull/154) includes the following changes:
Expand Down
22 changes: 20 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ The following table provides a detailed list of final models materialized within
| [zendesk__ticket_backlog](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__ticket_backlog) | A daily historical view of the ticket field values defined in the `ticket_field_history_columns` variable for all backlog tickets. Backlog tickets being defined as any ticket not in a 'closed', 'deleted', or 'solved' status. |
| [zendesk__ticket_field_history](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__ticket_field_history) | A daily historical view of the ticket field values defined in the `ticket_field_history_columns` variable and the corresponding updater fields defined in the `ticket_field_history_updater_columns` variable. |
| [zendesk__sla_policies](https://fivetran.github.io/dbt_zendesk/#!/model/model.zendesk.zendesk__sla_policies) | Each record represents an SLA policy event and additional sla breach and achievement metrics. Calendar and business hour SLA breaches are supported.
| zendesk__document | Each record represents a chunk of text from ticket data, prepared for vectorization. It includes fields for use in NLP workflows. Disabled by default. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elanfivetran this won't be available yet in Quickstart, yet it will be displayed in the UI via this table. Is that okay?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-catfritz would you be able to edit this to be the hyperlink to the package docs so users can see the table structure and documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-joemarkiewicz this is related to my question below about having this in the manifest.


Many of the above reports are now configurable for [visualization via Streamlit](https://github.com/fivetran/streamlit_zendesk)! Check out some [sample reports here](https://fivetran-zendesk.streamlit.app/).

Expand Down Expand Up @@ -64,7 +65,7 @@ Include the following zendesk package version in your `packages.yml` file:
```yml
packages:
- package: fivetran/zendesk
version: [">=0.16.0", "<0.17.0"]
version: [">=0.17.0", "<0.18.0"]

```
> **Note**: Do not include the Zendesk Support source package. The Zendesk Support transform package already has a dependency on the source in its own `packages.yml` file.
Expand All @@ -91,6 +92,23 @@ vars:

## (Optional) Step 5: Additional configurations

#### Enabling the unstructured document model for NLP
This package includes the `zendesk__document` model, which processes and segments Zendesk text data for vectorization, making it suitable for NLP workflows. The model outputs structured chunks of text with associated document IDs, segment indices, and token counts. By default, this model is disabled. To enable it, update the `zendesk__unstructured_enabled` variable to true in your dbt_project.yml:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar note from above, can we include the dbt docs link to the table zendesk__document here so curious users can go and inspect the structure and documentation of the table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also related to my manifest question.


```yml
vars:
zendesk__unstructured_enabled: true # false by default.
```

##### Customizing Chunk Size for Vectorization

The `zendesk__document` model was developed to limit approximate chunk sizes to 7,500 tokens, optimized for OpenAI models. However, you can adjust this limit by setting the `max_tokens` variable in your `dbt_project.yml`:

```yml
vars:
zendesk_max_tokens: 7500 # Default value
```

### Add passthrough columns
This package includes all source columns defined in the macros folder. You can add more columns from the `TICKET`, `USER`, and `ORGANIZATION` tables using our pass-through column variables.

Expand Down Expand Up @@ -211,7 +229,7 @@ This dbt package is dependent on the following dbt packages. Please be aware tha
```yml
packages:
- package: fivetran/zendesk_source
version: [">=0.11.0", "<0.12.0"]
version: [">=0.12.0", "<0.13.0"]

- package: fivetran/fivetran_utils
version: [">=0.4.0", "<0.5.0"]
Expand Down
5 changes: 4 additions & 1 deletion dbt_project.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: 'zendesk'
version: '0.16.0'
version: '0.17.0'


config-version: 2
Expand All @@ -24,6 +24,9 @@ models:
ticket_history:
+schema: zendesk_intermediate
+materialized: ephemeral
unstructured:
+schema: zendesk_unstructured
+materialized: table
Comment on lines +27 to +29
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's sync on this default schema

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our discussion, leaving this as-is so it is separated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need all of these models to be materialized as tables? I definitely see the zendesk__document needing to be, but do all the intermediate models as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really not sure, so I went with tables to be safe. I wasn't sure how demanding/large all that text data could get for a user, so open to suggestions on this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the nature of the end result, I think it makes sense to keep these as tables to help with the query load as best as we can. We can always adjust this default materialization based off feedback.

utils:
+materialized: ephemeral
vars:
Expand Down
47 changes: 10 additions & 37 deletions docs/index.html

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion integration_tests/dbt_project.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
config-version: 2

name: 'zendesk_integration_tests'
version: '0.16.0'
version: '0.17.0'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the variable configuration zendesk__unstructured_enabled: true in the vars config here so it can be enabled when generating the docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-joemarkiewicz At the time I did this I thought we didn't want this added to the manifest, and therefore not the docs. Is this not the case?


profile: 'integration_tests'

Expand Down
12 changes: 12 additions & 0 deletions macros/coalesce_cast.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{% macro coalesce_cast(column_list, datatype) -%}
{{ return(adapter.dispatch('coalesce_cast', 'zendesk')(column_list, datatype)) }}
{%- endmacro %}

{% macro default__coalesce_cast(column_list, datatype) %}
coalesce(
{%- for column in column_list %}
cast({{ column }} as {{ datatype }})
{%- if not loop.last -%},{%- endif -%}
{% endfor %}
)
{% endmacro %}
7 changes: 7 additions & 0 deletions macros/count_tokens.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{% macro count_tokens(column_name) -%}
{{ return(adapter.dispatch('count_tokens', 'zendesk')(column_name)) }}
{%- endmacro %}

{% macro default__count_tokens(column_name) %}
{{ dbt.length(column_name) }} / 4
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this doc and internal discussion, approximate count of tokens is appropriate.

{% endmacro %}
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
{{ config(enabled=var('zendesk__unstructured_enabled', False)) }}

with ticket_comments as (
select *
from {{ var('ticket_comment') }}

), users as (
select *
from {{ var('user') }}

), comment_details as (
select
ticket_comments.ticket_comment_id,
ticket_comments.ticket_id,
{{ zendesk.coalesce_cast(["users.email", "'UNKNOWN'"], dbt.type_string()) }} as commenter_email,
{{ zendesk.coalesce_cast(["users.name", "'UNKNOWN'"], dbt.type_string()) }} as commenter_name,
ticket_comments.created_at as comment_time,
ticket_comments.body as comment_body
from ticket_comments
left join users
on ticket_comments.user_id = users.user_id
where not coalesce(ticket_comments._fivetran_deleted, False)
and not coalesce(users._fivetran_deleted, False)

), comment_markdowns as (
select
ticket_comment_id,
ticket_id,
comment_time,
cast(
{{ dbt.concat([
"'### message from '", "commenter_name", "' ('", "commenter_email", "')\\n'",
"'##### sent @ '", "comment_time", "'\\n'",
"comment_body"
]) }} as {{ dbt.type_string() }})
as comment_markdown
from comment_details

), comments_tokens as (
select
*,
{{ zendesk.count_tokens("comment_markdown") }} as comment_tokens
from comment_markdowns

), truncated_comments as (
select
ticket_comment_id,
ticket_id,
comment_time,
case when comment_tokens > {{ var('zendesk_max_tokens', 5000) }} then left(comment_markdown, {{ var('zendesk_max_tokens', 5000) }} * 4) -- approximate 4 characters per token
else comment_markdown
end as comment_markdown,
case when comment_tokens > {{ var('zendesk_max_tokens', 5000) }} then {{ var('zendesk_max_tokens', 5000) }}
else comment_tokens
end as comment_tokens
from comments_tokens
)

select *
from truncated_comments
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
{{ config(enabled=var('zendesk__unstructured_enabled', False)) }}

with filtered_comment_documents as (
select *
from {{ ref('int_zendesk__ticket_comment_document') }}
),

grouped_comment_documents as (
select
ticket_id,
comment_markdown,
comment_tokens,
comment_time,
sum(comment_tokens) over (
partition by ticket_id
order by comment_time
rows between unbounded preceding and current row
) as cumulative_length
from filtered_comment_documents
)

select
ticket_id,
cast({{ dbt_utils.safe_divide('floor(cumulative_length - 1)', var('zendesk_max_tokens', 5000)) }} as {{ dbt.type_int() }}) as chunk_index,
{{ dbt.listagg(
measure="comment_markdown",
delimiter_text="'\\n\\n---\\n\\n'",
order_by_clause="order by comment_time"
) }} as comments_group_markdown,
sum(comment_tokens) as chunk_tokens
from grouped_comment_documents
group by 1,2
42 changes: 42 additions & 0 deletions models/unstructured/intermediate/int_zendesk__ticket_document.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
{{ config(enabled=var('zendesk__unstructured_enabled', False)) }}

with tickets as (
select *
from {{ var('ticket') }}

), users as (
select *
from {{ var('user') }}

), ticket_details as (
select
tickets.ticket_id,
tickets.subject AS ticket_name,
{{ zendesk.coalesce_cast(["users.name", "'UNKNOWN'"], dbt.type_string()) }} as user_name,
{{ zendesk.coalesce_cast(["users.email", "'UNKNOWN'"], dbt.type_string()) }} as created_by,
tickets.created_at AS created_on,
{{ zendesk.coalesce_cast(["tickets.status", "'UNKNOWN'"], dbt.type_string()) }} as status,
{{ zendesk.coalesce_cast(["tickets.priority", "'UNKNOWN'"], dbt.type_string()) }} as priority
from tickets
left join users
on tickets.requester_id = users.user_id
where not coalesce(tickets._fivetran_deleted, False)
and not coalesce(users._fivetran_deleted, False)

), final as (
select
ticket_id,
{{ dbt.concat([
"'# Ticket : '", "ticket_name", "'\\n\\n'",
"'Created By : '", "user_name", "' ('", "created_by", "')\\n'",
"'Created On : '", "created_on", "'\\n'",
"'Status : '", "status", "'\\n'",
"'Priority : '", "priority"
]) }} as ticket_markdown
from ticket_details
)

select
*,
{{ zendesk.count_tokens("ticket_markdown") }} as ticket_tokens
from final
27 changes: 27 additions & 0 deletions models/unstructured/zendesk__document.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
{{ config(enabled=var('zendesk__unstructured_enabled', False)) }}

with ticket_document as (
select *
from {{ ref('int_zendesk__ticket_document') }}

), grouped as (
select *
from {{ ref('int_zendesk__ticket_comment_documents_grouped') }}

), final as (
select
cast(ticket_document.ticket_id as {{ dbt.type_string() }}) as document_id,
grouped.chunk_index,
grouped.chunk_tokens as chunk_tokens_approximate,
{{ dbt.concat([
"ticket_document.ticket_markdown",
"'\\n\\n## COMMENTS\\n\\n'",
"grouped.comments_group_markdown"]) }}
as chunk
from ticket_document
join grouped
on grouped.ticket_id = ticket_document.ticket_id
)

select *
from final
14 changes: 14 additions & 0 deletions models/unstructured/zendesk_unstructured.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
version: 2

models:
- name: zendesk__document
description: Each record represents a Zendesk ticket, enriched with data about it's tags, assignees, requester, submitter, organization and group.
columns:
- name: document_id
description: Equivalent to `ticket_id`.
- name: chunk_index
description: The index of the chunk associated with the `document_id`.
- name: chunk_tokens_approximate
description: Approximate number of tokens for the chunk, assuming 4 characters per token.
- name: chunk
description: The text of the chunk.
4 changes: 4 additions & 0 deletions models/zendesk.yml
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,8 @@ models:
description: Boolean indicating if the ticket had a satisfaction score went from good to bad.
- name: is_bad_to_good_satisfaction_score
description: Boolean indicating if the ticket had a satisfaction score went from bad to good.
- name: _fivetran_deleted
description: Boolean created by Fivetran to indicate whether the ticket has been deleted.

- name: zendesk__sla_policies
description: Each record represents an SLA policy event and additional sla breach and achievement metrics. Calendar and business hour SLA breaches for `first_reply_time`, `next_reply_time`, `requester_wait_time`, and `agent_work_time` are supported. If there is a SLA you would like supported that is not included, please create a feature request.
Expand Down Expand Up @@ -492,6 +494,8 @@ models:
description: The time in minutes the ticket was in an unassigned state
- name: last_status_assignment_date
description: The time the status was last changed on the ticket
- name: _fivetran_deleted
description: Boolean created by Fivetran to indicate whether the ticket has been deleted.

- name: zendesk__ticket_summary
description: A single record table containing Zendesk ticket and user summary metrics. These metrics are updated for the current day the model is run.
Expand Down
8 changes: 5 additions & 3 deletions packages.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
packages:
- package: fivetran/zendesk_source
version: [">=0.11.0", "<0.12.0"]

# - package: fivetran/zendesk_source
# version: [">=0.12.0", "<0.13.0"]
- git: https://github.com/fivetran/dbt_zendesk_source.git
revision: feature/unstructured-data
warn-unpinned: false
Comment on lines +2 to +6
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# - package: fivetran/zendesk_source
# version: [">=0.12.0", "<0.13.0"]
- git: https://github.com/fivetran/dbt_zendesk_source.git
revision: feature/unstructured-data
warn-unpinned: false
- package: fivetran/zendesk_source
version: [">=0.12.0", "<0.13.0"]

- package: calogica/dbt_date
version: [">=0.9.0", "<1.0.0"]