inline documentation rendered as eg. Markdown #375

jthandy · 2017-04-12T11:32:21Z

This is from @pedromachado in slack:

I've been thinking that it would be good to have a documentation feature built into dbt. Perhaps something that could leverage 1) especially formatted comments embedded in the SQL and 2) the model dependency graph to generate some documentation or at least a diagram with all the tables/columns.

drewbanin · 2017-04-12T12:09:41Z

This is absolutely where my head is at. I think this could look something like Javadoc.

That, coupled with some dbt queries to infer the schemas of the models, plus constraint info provided by schema tests, would be incredibly cool.

drewbanin · 2017-05-10T14:33:40Z

@dwallace0723 I understand that you've been doing some documentation work within a dbt project -- what kinds of things do you find yourself documenting? We've thought about

interactions between models
high-level information about specific models
information about the SQL within a model (eg. unusual sql, odd business logic, etc)

Is that about congruent with your experiences?

dwallace0723 · 2017-05-10T18:52:44Z

@drewbanin So far, I've been using documentation mostly for the third bullet point you listed. That being said, I can definitely see myself needing proper documentation for the first two bullets as well.

The way I've been doing this for now is having a comment above a line of SQL that links to the corresponding piece of documentation located in a .md file within the dbt project directory.

In practice, the flow looks like this:

Not the most seamless experience, but it gets the job done for now.

jthandy · 2017-05-10T18:56:38Z

that is cool

HarlanH · 2017-08-02T21:05:26Z

I'd be interested in working with someone on a minimal initial version of this. I'd be perfectly happy with the following minimal features:

extraction of a block of markdown for each model
some sort of column-level syntax for documenting columns
ideally, an auto-generated list of models/tables used by the given model

Something like:

{{ usual dbt stuff goes here }}

--@ multiline block markdown goes here
--@ and here...
--@ don't really know if @ is the best character

SELECT 
  cat,  --@col strays are not included
  dog1 AS dog --@col not the same as dog2!
FROM {{ ref('whatever') }}

Stripping these out via regex and dropping them into a single markdown or similar file shouldn't be a lot of work. Dealing with references wouldn't be too bad either. Thoughts? Tons of room for enhancements, but even this would be fantastic.

drewbanin · 2017-08-02T21:54:31Z

@HarlanH I'm super happy to see some pseudo code for this :)

I like the idea of using --@ for multi-line block comments, but regex parsing the raw SQL is a little scary to me. I've given this some thought, and I think the SQL implementation might have to look something like:

select
  {{ column('cat', comment='strays are not included') }},
  {{ column('dog1', as='dog', comment='not the same as dog2!') }}
from {{ ref('whatever') }}

^ I think that's super gross, and would prefer something that looks a little more like regular SQL.

Another option is to repurpose the schema.yml file that's currently used for testing. I can imagine something like this:

my_model:
  # same as before
  constraints:
    not_null:
      - field_1
      - field_2

  # also provide for block comments, etc
  documentation:
    - field: field_1
      comment: This field is made of grass

    - field: field_2
      comment: This field is made of Athenry

While this is obviously more readily parseable, i do really like the idea of colocating the docs with the code. There's also the question of where these markdown docs get compiled to, if they're versioned, etc etc. So, still some fundamental and unanswered questions here I think.

I'd love to keep pulling on this thread, so please let us know if you have more thoughts here!

HarlanH · 2017-08-03T15:36:10Z

I guess I'm OK with convention over code for the column docs, and agree that jinja's pretty ugly for this sort of thing. I also am unsure about how to handle the first of these:

(CASE WHEN campaign_id LIKE 'job-notification-email - weekly %' THEN 'weekly'
		  WHEN campaign_id LIKE 'job-notification-email - daily %' THEN 'daily'
		  WHEN campaign_id LIKE 'MPC%' THEN 'MPC'
		  WHEN campaign_id LIKE '%every_time%' then 'every_time' 
		  ELSE NULL END)::varchar(25) AS campaign_type, -- @col one of daily, weekly, every_time, MPC
rcpt_to::varchar(254) AS user_email, -- @col envelope recipient

I definitely would not want to embed that in a jinja block... On the other hand, I'd expect the output as written to look something like:

* `campaign_type` (complex) -- one of daily, weekly, every_time, MPC
* `user_email` (`varchar(254)`, from `rcpt_to`) -- envelope recipient

At least that'd be perfectly adequate initially. If the line can't be parsed into <from_col>(::<cast>) (AS <to_col>)(,) -- @col <comment>, then do nothing.

I would prioritize in-line docs over a solution like leveraging the schema, even if it reduces functionality -- too much sad experience trying to keep separate files in sync!

On compilation, I was thinking of a dbt doc step that would compile them to a docs target directory, maybe initially to a single markdown file (a ToC would be cool). That could be run as part of dbt run, or not, I dunno.

On versioning, what do you have in mind?

HarlanH · 2017-08-03T21:00:25Z

I wrote a quick-and-dirty standalone script that seems to do a fair amount of what I want. It's regex-heavy, so a bit fragile, but seems to be a decent start.

https://gist.github.com/HarlanH/277006989774372515c7130e63809315

drewbanin · 2017-08-03T21:45:46Z

@HarlanH cool! Unsure if you're aware, but dbt generates a Networkx graph file in target/graph.gpickle when dbt compile or dbt run are invoked. That file contains all of the SQL models, as well as info about model materialization and macro/node dependencies. That file isn't currently documented anywhere, but you can read it with nx.read_gpickle(...) and take a look at the contents.

pedromachados · 2017-08-06T17:48:18Z

I also would prefer to define the comments in the model file, close to the query it documents. If the issue is parsing sql reliably, can we not define the commens in a block that is easy to isolate from the query, similar to the way configuration code is inserted? I would prefer a syntax that allows me to define start and end of comment block and to put comments in a very readable format. Something like yml could work but something less verbose would be even better.

As far a output, I would love to see an interactive, graphical representation of the network chart that shows the model doc and compiled sql upon you clicking a on a node. It would also be good to have a printable format but I don't know how to make this work with large graphs.

HarlanH · 2017-08-07T14:58:51Z

Pedro, yeah, the biggest design trade-off seems to me to be whether you want to wrap the SELECT columns in non-SQL syntax, ala Drew's suggestion, or if you want to try to parse the SQL (either buggily with regexes, or with a real parser), or if you want to force the author to repeat herself.

Drew, does the existing dbt implementation actually run a SQL parser at any point?

drewbanin · 2017-08-07T15:14:42Z

@HarlanH dbt does do an iota of sql parsing, but the plan is to get rid of that soon. I definitely think some sort of jinja function is the way to go here. For me, the core unanswered questions are:

What's a sane interface for defining these comments?
How much should dbt know about documentation?

For 1) I think we'll just need to play around with some different options. It could be jinja functions, or some sort of jinja block.... tbd

Point 2) is a little more open-ended. It's tempting to just write a ton of python code to generate markdown files, but I really like the idea of making dbt mostly unaware of things like documentation. Instead, dbt would provide all of the relevant information, and we could encode the documentation logic into something like a package which could very well be shipped along with dbt.

dbt already has a notion of "analysis" files which are compiled, but not executed. I think in a similar way, these docs would be "compiled" to markdown, but of course not executed. We'd just need to figure out templating, expose the list of models / comments in the context, and figure out how to change the output directory for the compiled markdown.

^ That's kind of where my head is at. Does that seem reasonable?

HarlanH · 2017-08-07T15:51:09Z

Hm, I don't really like this syntax for SELECT clauses, especially for really complex column definitions:

{{ column('cat', comment='strays are not included') }},
{{ column('dog1', as='dog', comment='not the same as dog2!') }}

Does the contents of a {{ }} block have to be Python? And/or is there some other way to use jinja to extract information out of the SQL?

Could something like this be made to work, or does it violate dbt design assumptions?

{% col  one of daily, weekly, every_time, MPC %}
(CASE WHEN campaign_id LIKE 'job-notification-email - weekly %' THEN 'weekly'
		  WHEN campaign_id LIKE 'job-notification-email - daily %' THEN 'daily'
		  WHEN campaign_id LIKE 'MPC%' THEN 'MPC'
		  WHEN campaign_id LIKE '%every_time%' then 'every_time' 
		  ELSE NULL END)::varchar(25) {% AS %} campaign_type, 
{% endcol %}
{% col envelope recipient %}
rcpt_to::varchar(254) {% AS %} user_email, 
{% endcol %}
{% col user ID %}
id,
{% endcol %}

clrcrl · 2017-11-16T22:30:01Z

+1 to this feature! Ultimately, I want to build a databook (like this!) that is tightly coupled to my code - one thing I've thought about is if I could store these definitions in another table, similar to pg_table_def or information_schema.columns. That way I can pull out definitions pretty easily.

Something that I find interesting about this approach:

{{ column('cat', comment='strays are not included') }},
{{ column('dog1', as='dog', comment='not the same as dog2!') }}

is that you could then potentially open up the door to referencing previous columns when building fields (I miss this feature of LookML and find myself repeating code or having to write lots of cascading CTEs).

e.g. (running with cats and dogs, let's pretend this is a list of people in my office and the pets they own, and that the only pets you can have are cat or dogs)

{{ column('employee_id') }}
{{ column('number_of_cats', comment='strays are not included') }},
{{ column(col('number_of_cats') > 0 , as 'has_cat', comment='TRUE when number_of_cats > 0') }},
{{ column(col('number_of_cats') >= 3 , as 'is_crazy_cat_person', comment='TRUE when number_of_cats > 3') }},
{{ column('number_of_dogs', as='dog', comment='not the same as dog2!') }},
{{ column(col('number_of_dogs') > 0 , as 'has_dog', comment='TRUE when number_of_dogs > 0') }},
{{ column(col('has_cat') or col ('has_dog), as 'has_pet', comment='TRUE when has_cat or has_dog') }}

Agree that then gets a hard to read, and that probably strays too far from regular SQL, and also is not the point of this issue, but just a thought 😸

norton120 · 2017-11-17T23:05:38Z

Funny I was thinking something exactly along the same lines as @HarlanH mentioned here parsing sql comments. What felt good about that was that it still serves some value as raw sql:

SELECT 
        number_of_cats, --@ strays not included
       number_of_dogs AS dog_number --@ also includes large hamsters

is still valid sql and even serves as local documentation if nothing else.
right now at the bottom of each analytics model we drop a config block with all the comments as post hooks; it is not perfect but at least they all live in the same file.

{{ config({
     "post-hook":[
          "COMMENT IF EXISTS ON COLUMN {{this}}.number_of_cats IS 'strays not included'",
          "COMMENT IF EXISTS ON COLUMN {{this}}.number_of_dogs IS 'also includes large hamsters'"]
})}}

Could probably clean it up further with a macro, but maybe doing something similar to what Drew said inside the models?

Another thing that seemed interesting would be a config value to "inherit" comments from ephemeral models - ie define the comments as far up the pipe as possible (or as close to the defining transforms as possible).

HarlanH · 2017-11-18T13:43:26Z

Just a note that I've got a lot of columns now that are defined by macros, so column comments often can't extract a useful name from the pre-macro SQL (although I suppose if the comment-extractor worked on the compiled versions? hm...)

...
{{ my_macro('thing', 'that', 'constructs', 'one or more', 'columns') }}, -- @col what am I supposed to do with this?
...

drewbanin · 2018-09-04T21:25:23Z

Closing this - docs site is going out in 0.11.0 on 9/6 :)

drewbanin changed the title ~~Documentation~~ inline documentation rendered as eg. Markdown May 10, 2017

HarlanH mentioned this issue Sep 7, 2017

dbt info command for listing information about a model #500

Closed

drewbanin mentioned this issue Nov 8, 2017

test coverage #589

Closed

drewbanin added the auto-generated docs label Nov 10, 2017

drewbanin mentioned this issue Jun 12, 2018

new schema.yml syntax #790

Closed

drewbanin added dbt-docs [dbt feature] documentation site, powered by metadata artifacts and removed auto-generated docs labels Jun 25, 2018

drewbanin mentioned this issue Jun 26, 2018

Docs blocks #810

Closed

drewbanin added this to the 0.11.0 - Isaac Asimov (unreleased) milestone Aug 17, 2018

drewbanin closed this as completed Sep 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inline documentation rendered as eg. Markdown #375

inline documentation rendered as eg. Markdown #375

jthandy commented Apr 12, 2017

drewbanin commented Apr 12, 2017 •

edited

Loading

drewbanin commented May 10, 2017

dwallace0723 commented May 10, 2017

jthandy commented May 10, 2017

HarlanH commented Aug 2, 2017

drewbanin commented Aug 2, 2017

HarlanH commented Aug 3, 2017

HarlanH commented Aug 3, 2017

drewbanin commented Aug 3, 2017

pedromachados commented Aug 6, 2017

HarlanH commented Aug 7, 2017

drewbanin commented Aug 7, 2017

HarlanH commented Aug 7, 2017

clrcrl commented Nov 16, 2017

norton120 commented Nov 17, 2017 •

edited

Loading

HarlanH commented Nov 18, 2017

drewbanin commented Sep 4, 2018 •

edited

Loading

inline documentation rendered as eg. Markdown #375

inline documentation rendered as eg. Markdown #375

Comments

jthandy commented Apr 12, 2017

drewbanin commented Apr 12, 2017 • edited Loading

drewbanin commented May 10, 2017

dwallace0723 commented May 10, 2017

jthandy commented May 10, 2017

HarlanH commented Aug 2, 2017

drewbanin commented Aug 2, 2017

HarlanH commented Aug 3, 2017

HarlanH commented Aug 3, 2017

drewbanin commented Aug 3, 2017

pedromachados commented Aug 6, 2017

HarlanH commented Aug 7, 2017

drewbanin commented Aug 7, 2017

HarlanH commented Aug 7, 2017

clrcrl commented Nov 16, 2017

norton120 commented Nov 17, 2017 • edited Loading

HarlanH commented Nov 18, 2017

drewbanin commented Sep 4, 2018 • edited Loading

drewbanin commented Apr 12, 2017 •

edited

Loading

norton120 commented Nov 17, 2017 •

edited

Loading

drewbanin commented Sep 4, 2018 •

edited

Loading