-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optional field cardinality in documentation #2929
Comments
Or adding an @stevebuckingham I buy it, I think this is a really cool idea. Let's figure out what this will involve. (And I agree that having it off by default is critical, especially on databases like BigQuery where non-metadata queries can be costly.) under the hooddbt powers the documentation site through two artifacts:
Presently, the catalog contains information about each relational object (view, table, etc) that dbt grabs, either from the In order to power the inclusion of additional info, such as cardinality, we'd need to:
developer experienceWe already have a general resource property, docs, which is actually a dictionary: it could contain multiple attributes. At present, the only available attribute is models:
- name: my_model
docs:
show: false Per your proposal, I think we could extend Let's take this one step further:
(If it's not clear, I'm inspired by R's Should we offer each of models:
- name: my_model
docs:
summary: true dbt will calculate the 10 ten values of each string-type column, the min/mean/max/etc of each int column, the bucketed counts of each date/time column, and so on. Curious to hear what you think :) |
Yes - there are a few gotchas - for instance in Redshift it is probably better to use
Agreed. Being able to set at table level is a nice to have - but maybe dangerous, e.g. someone applies it to a large time-series table.
I like this idea. I would make Top 10 either optional or configurable with a default to 10. But I would also allow it for integers because things like priority are often integers even though we use them like dimensions.
A simple split might be What do you think? |
More performant than
Heard—I think of those as "factors" (again inspired by R). They're really enum values that should be thought of like strings. You're right to raise it, dbt would not be able to infer the correct move based on its data type alone. So, to get started, it sounds like you're thinking about
|
Thanks for the feedback and advice on where to look. I'll have a deeper look at the code now. |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
Describe the feature
Understanding field cardinality is often important and useful for the effective use of data, e.g. what are the possible values of a
customer_status
field. This change would add a field level flag, defaulted to no, to add a list of the field values and maybe their count to the DBT documentation. In simple terms it would run one of these for each field:SELECT DISTINCT field_name FROM schema_name.table_name
and possiblyLIMIT 10
OR
SELECT field_name, COUNT(*) as field_total FROM schema_name.table_name ORDER BY COUNT(*) DESC
and possiblyLIMIT 10
Describe alternatives you've considered
Manually annotating the descriptions with expected values.
Additional context
This could be applied to all databases. Default to off for a field would help eliminate accidental long build times for the documentation.
Who will this benefit?
Any organization that uses the DBT documentation to share their model definitions and would like a deeper understanding of what key fields look like.
Are you interested in contributing this feature?
I would be interested in writing some documentation code to deliver this.
The text was updated successfully, but these errors were encountered: