-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add iasWorld dbt tests and namespace all tests #236
Conversation
This reverts commit aa1ad67.
6102040
to
10da78c
Compare
@@ -109,7 +109,7 @@ models: | |||
|
|||
tests: | |||
- unique_combination_of_columns: | |||
name: vw_card_res_char_unique_by_pin_card_and_year |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All changes in default/
are related to namespace changes. See PR body for more details.
@@ -11,14 +11,14 @@ | |||
{%- set columns_csv = additional_select_columns | join(", ") %} | |||
|
|||
{%- set length_columns = [] %} | |||
{% for column in columns %} | |||
{%- for column in columns %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes here are just to control whitespace in the compiled SQL output.
where not ({{ expression }}) | ||
where not ({{ column_name }} {{ expression }}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this in order to use expression_is_true
in column-level tests, i.e.:
- name: rmbed
tests:
- expression_is_true:
name: iasworld_dweldat_rmbed_lte_rmtot
expression: <= rmtot
select_columns:
- parid
- taxyr
- card
It still works fine in the original context of passing a full expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in here are to facilitate testing for leading and trailing whitespace in strings such as addresses, i.e. we want 123 FISH STREET
to pass but 123 FISH STREET
to fail.
{%- if "." in column -%} {%- set model_col = column -%} | ||
{%- else -%} {%- set model_col = "model" ~ "." ~ column -%} | ||
{%- endif -%} | ||
|
||
{%- if "." in external_column -%} {%- set external_model_col = external_column -%} | ||
{%- else -%} | ||
{%- set external_model_col = "external_model" ~ "." ~ external_column -%} | ||
{%- endif -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The yucky formatting here is a result of sqlfmt
🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is almost explicitly for testing that the class column matches across different tables i.e. a PIN's class in PARDAT
matches its class in LEGDAT
. One subtlety is that this will not return a row if any row has a matching class after the join. So, if a PARDAT
PIN is class 211 and it joins to a DWELDAT
record with a two buildings on the same PIN - one class 211, one class 204 - then no records are returned (since the DWELDAT
record has a 211).
name: reporting_ratio_stats_no_nulls | ||
expression: | | ||
year IS NOT NULL | ||
AND triad IS NOT NULL | ||
AND geography_type IS NOT NULL | ||
AND property_group IS NOT NULL | ||
AND assessment_stage IS NOT NULL | ||
AND sale_year IS NOT NULL | ||
AND triad IS NOT NULL | ||
AND geography_type IS NOT NULL | ||
AND property_group IS NOT NULL | ||
AND assessment_stage IS NOT NULL | ||
AND sale_year IS NOT NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the deindenting and changes to |
are just to get a cleaner compiled output i.e. one with line breaks.
tests: | ||
- relationships: | ||
name: iasworld_addn_class_in_ccao_class_dict | ||
to: source('ccao', 'class_dict') | ||
field: class_code | ||
config: | ||
where: | | ||
taxyr >= '2022' | ||
AND class != 'EX' | ||
AND cur = 'Y' | ||
AND deactivat IS NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is on basically every class
column in iasWorld now and is designed to catch issues like people adding a dash to classes, e.g. 2-12
.
config: &unique-conditions | ||
where: cur = 'Y' AND deactivat IS NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since many tests require the same conditionals, I used YAML anchors a lot to repeat the same conditions across tests in the same schema file.
where: | | ||
taxyr >= '2022' | ||
AND class NOT IN ('EX', 'RR', '999') | ||
AND NOT REGEXP_LIKE(class, '[0-9]{3}[A|B]') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is excluding any subclasses such as 693A
or similar.
where: | | ||
cur = 'Y' | ||
AND deactivat IS NULL | ||
AND class NOT IN ('201', '213', '218', '219', '220', '221', '224', '225', '236', '240', '241', '290', '294', '297') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DWELDAT
has a bunch of non-regression classes that don't actually have any characteristic data, hence the exclusion here.
expression: | | ||
CASE | ||
WHEN LENGTH(REGEXP_EXTRACT(addr1, '^[N|S|E|W]{0,2}\s{0,1}([0-9]{1,5})',1)) <= 5 THEN TRUE | ||
WHEN REGEXP_LIKE(UPPER(addr1), '^[PO BOX|P\.O\. BOX|P O BOX|BOX|P O BX|PO BX|PO BX|P.O.BOX]') THEN TRUE | ||
WHEN REGEXP_LIKE(UPPER(addr1), '^[C\/O|ONE|TWO|THREE|FOUR|FIVE|SIX|SEVEN|EIGHT|NINE]') THEN TRUE | ||
ELSE FALSE | ||
END |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically just testing whether the address starts with a street number less than 5 digits or is a PO Box.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome job with these tests! I'm surprised they execute so quickly (8m9s in the test run you linked); not bad for 294 queries.
I'm not exactly sure what the process should be to make corrections/decisions, given there are so many failures. We could:
- Append an
error_if
threshold to every failing test, then slowly work our way through them ourselves.- Collect all the failures in a spreadsheet (or multiple sheets), then send the sheet to Valuations.
- Use this PR to work through all the failing tests, and try to fix/address each one.
So far we've been following approach 1, but I don't necessarily think that means it's the best option. Ideally we would do 2 or 3, but I'm concerned about spamming ourselves with failures for the duration of the time it takes us to correct mistakes, particularly given that tests might end up running on PRs if someone were to edit an underlying model. (I suppose that's not an issue for the iasWorld tests, though, since we don't actually edit those tables directly.)
In an ideal world, I think the approach we discussed last week is probably the right one:
- Tag all unit tests that are intended to test that our transformations do what we expect
- Update the
build_and_test_dbt
workflow to only run unit tests - Allow integration tests to fail while we fix them, but insist that unit tests pass
I don't think this work would be particularly difficult, but I'm not actually sure which of our existing tests (if any) correspond to unit tests, so I don't have high confidence that this would make sense for our current set of tests.
All that being said, I'm comfortable with all of the options you outlined, as long as we all understand which approach we're deciding to take.
# - FP Checklist - Non-EX, RR parcels with 0 land value | ||
# - FP Checklist - Non-EX, RR parcels with 0 value | ||
tests: | ||
- dbt_utils.accepted_range: &test-non-negative |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Praise] Nice use of anchors here!
# Note that this test is NOT actually the unique primary key of | ||
# this table, since there doesn't seem to BE a unique combination | ||
# of identifying columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Question, non-blocking] Hm, does that mean we should just remove this test then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's still useful, since an error firing would indicate an increasing number of duplicate values.
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Question, non-blocking] Any particular meaning to these newlines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 70b5902!
Okay I think I've landed on a good approach based on this. I tagged all QC/source tests and updated the Thoughts @jeancochrane? Here are the commits since last time if you want to review. |
This PR adds 250+ dbt tests covering major iasWorld tables and columns. Tests definitions are drawn from internal QC queries (via Inquire) where possible. However, most tests are checking "common sense" things like no assessed values less than 0, all PINs in a table exist in
PARDAT
, etc.Many of the tests in this PR are currently failing (99 out of 294 are failing). This is by design. In most cases, the failures reveal some kind of data error that needs a correction or decision. Some of them may be false positives, but many of them I've checked by hand to ensure they work.
I'm not exactly sure what the process should be to make corrections/decisions, given there are so many failures. We could:
error_if
threshold to every failing test, then slowly work our way through them ourselves.Curious to hear thoughts @jeancochrane @jeancochrane @wrridgeway.
Note
This PR also adds name spacing to test names in the form of
$SCHEMA_$TABLE_$COLUMN_$TESTNAME
. This makes the test output easier to parse. I also truncated the test names so they don't overflow the default test output width. All file changes tolocation/
,reporting/
, anddefault/
are related to this change and can be largely ignored.I apologize in advance for the massive diff. I should've PR'd this sooner but got carried away. I don't think it's actually necessary to review this whole PR, happy to chat about how to manage it.
Link to full test output