Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add iasWorld dbt tests and namespace all tests #236

Merged
merged 32 commits into from
Nov 21, 2023

Conversation

dfsnow
Copy link
Member

@dfsnow dfsnow commented Nov 17, 2023

This PR adds 250+ dbt tests covering major iasWorld tables and columns. Tests definitions are drawn from internal QC queries (via Inquire) where possible. However, most tests are checking "common sense" things like no assessed values less than 0, all PINs in a table exist in PARDAT, etc.

Many of the tests in this PR are currently failing (99 out of 294 are failing). This is by design. In most cases, the failures reveal some kind of data error that needs a correction or decision. Some of them may be false positives, but many of them I've checked by hand to ensure they work.

I'm not exactly sure what the process should be to make corrections/decisions, given there are so many failures. We could:

  1. Append an error_if threshold to every failing test, then slowly work our way through them ourselves.
  2. Collect all the failures in a spreadsheet (or multiple sheets), then send the sheet to Valuations.
  3. Use this PR to work through all the failing tests, and try to fix/address each one.

Curious to hear thoughts @jeancochrane @jeancochrane @wrridgeway.

Note

This PR also adds name spacing to test names in the form of $SCHEMA_$TABLE_$COLUMN_$TESTNAME. This makes the test output easier to parse. I also truncated the test names so they don't overflow the default test output width. All file changes to location/, reporting/, and default/ are related to this change and can be largely ignored.

I apologize in advance for the massive diff. I should've PR'd this sooner but got carried away. I don't think it's actually necessary to review this whole PR, happy to chat about how to manage it.

Link to full test output

@dfsnow dfsnow requested a review from a team as a code owner November 17, 2023 02:09
@dfsnow dfsnow marked this pull request as draft November 17, 2023 02:09
@dfsnow dfsnow force-pushed the dansnow/add-iasworld-tests branch from 6102040 to 10da78c Compare November 17, 2023 23:05
@@ -109,7 +109,7 @@ models:

tests:
- unique_combination_of_columns:
name: vw_card_res_char_unique_by_pin_card_and_year
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All changes in default/ are related to namespace changes. See PR body for more details.

@@ -11,14 +11,14 @@
{%- set columns_csv = additional_select_columns | join(", ") %}

{%- set length_columns = [] %}
{% for column in columns %}
{%- for column in columns %}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here are just to control whitespace in the compiled SQL output.

where not ({{ expression }})
where not ({{ column_name }} {{ expression }})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this in order to use expression_is_true in column-level tests, i.e.:

          - name: rmbed
            tests:
              - expression_is_true:
                  name: iasworld_dweldat_rmbed_lte_rmtot
                  expression: <= rmtot
                  select_columns:
                    - parid
                    - taxyr
                    - card

It still works fine in the original context of passing a full expression.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in here are to facilitate testing for leading and trailing whitespace in strings such as addresses, i.e. we want 123 FISH STREET to pass but 123 FISH STREET to fail.

Comment on lines +11 to +18
{%- if "." in column -%} {%- set model_col = column -%}
{%- else -%} {%- set model_col = "model" ~ "." ~ column -%}
{%- endif -%}

{%- if "." in external_column -%} {%- set external_model_col = external_column -%}
{%- else -%}
{%- set external_model_col = "external_model" ~ "." ~ external_column -%}
{%- endif -%}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The yucky formatting here is a result of sqlfmt 🤷

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is almost explicitly for testing that the class column matches across different tables i.e. a PIN's class in PARDAT matches its class in LEGDAT. One subtlety is that this will not return a row if any row has a matching class after the join. So, if a PARDAT PIN is class 211 and it joins to a DWELDAT record with a two buildings on the same PIN - one class 211, one class 204 - then no records are returned (since the DWELDAT record has a 211).

Comment on lines +10 to +17
name: reporting_ratio_stats_no_nulls
expression: |
year IS NOT NULL
AND triad IS NOT NULL
AND geography_type IS NOT NULL
AND property_group IS NOT NULL
AND assessment_stage IS NOT NULL
AND sale_year IS NOT NULL
AND triad IS NOT NULL
AND geography_type IS NOT NULL
AND property_group IS NOT NULL
AND assessment_stage IS NOT NULL
AND sale_year IS NOT NULL
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the deindenting and changes to | are just to get a cleaner compiled output i.e. one with line breaks.

Comment on lines +42 to +52
tests:
- relationships:
name: iasworld_addn_class_in_ccao_class_dict
to: source('ccao', 'class_dict')
field: class_code
config:
where: |
taxyr >= '2022'
AND class != 'EX'
AND cur = 'Y'
AND deactivat IS NULL
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is on basically every class column in iasWorld now and is designed to catch issues like people adding a dash to classes, e.g. 2-12.

Comment on lines +39 to +40
config: &unique-conditions
where: cur = 'Y' AND deactivat IS NULL
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since many tests require the same conditionals, I used YAML anchors a lot to repeat the same conditions across tests in the same schema file.

where: |
taxyr >= '2022'
AND class NOT IN ('EX', 'RR', '999')
AND NOT REGEXP_LIKE(class, '[0-9]{3}[A|B]')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is excluding any subclasses such as 693A or similar.

where: |
cur = 'Y'
AND deactivat IS NULL
AND class NOT IN ('201', '213', '218', '219', '220', '221', '224', '225', '236', '240', '241', '290', '294', '297')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DWELDAT has a bunch of non-regression classes that don't actually have any characteristic data, hence the exclusion here.

Comment on lines +255 to +261
expression: |
CASE
WHEN LENGTH(REGEXP_EXTRACT(addr1, '^[N|S|E|W]{0,2}\s{0,1}([0-9]{1,5})',1)) <= 5 THEN TRUE
WHEN REGEXP_LIKE(UPPER(addr1), '^[PO BOX|P\.O\. BOX|P O BOX|BOX|P O BX|PO BX|PO BX|P.O.BOX]') THEN TRUE
WHEN REGEXP_LIKE(UPPER(addr1), '^[C\/O|ONE|TWO|THREE|FOUR|FIVE|SIX|SEVEN|EIGHT|NINE]') THEN TRUE
ELSE FALSE
END
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically just testing whether the address starts with a street number less than 5 digits or is a PO Box.

@dfsnow dfsnow changed the title Add iasWorld dbt tests Add iasWorld dbt tests and namespace all tests Nov 21, 2023
@dfsnow dfsnow marked this pull request as ready for review November 21, 2023 00:14
Copy link
Contributor

@jeancochrane jeancochrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome job with these tests! I'm surprised they execute so quickly (8m9s in the test run you linked); not bad for 294 queries.

I'm not exactly sure what the process should be to make corrections/decisions, given there are so many failures. We could:

  1. Append an error_if threshold to every failing test, then slowly work our way through them ourselves.
  2. Collect all the failures in a spreadsheet (or multiple sheets), then send the sheet to Valuations.
  3. Use this PR to work through all the failing tests, and try to fix/address each one.

So far we've been following approach 1, but I don't necessarily think that means it's the best option. Ideally we would do 2 or 3, but I'm concerned about spamming ourselves with failures for the duration of the time it takes us to correct mistakes, particularly given that tests might end up running on PRs if someone were to edit an underlying model. (I suppose that's not an issue for the iasWorld tests, though, since we don't actually edit those tables directly.)

In an ideal world, I think the approach we discussed last week is probably the right one:

  1. Tag all unit tests that are intended to test that our transformations do what we expect
  2. Update the build_and_test_dbt workflow to only run unit tests
  3. Allow integration tests to fail while we fix them, but insist that unit tests pass

I don't think this work would be particularly difficult, but I'm not actually sure which of our existing tests (if any) correspond to unit tests, so I don't have high confidence that this would make sense for our current set of tests.

All that being said, I'm comfortable with all of the options you outlined, as long as we all understand which approach we're deciding to take.

# - FP Checklist - Non-EX, RR parcels with 0 land value
# - FP Checklist - Non-EX, RR parcels with 0 value
tests:
- dbt_utils.accepted_range: &test-non-negative
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Praise] Nice use of anchors here!

Comment on lines +489 to +491
# Note that this test is NOT actually the unique primary key of
# this table, since there doesn't seem to BE a unique combination
# of identifying columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question, non-blocking] Hm, does that mean we should just remove this test then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's still useful, since an error firing would indicate an increasing number of duplicate values.

Comment on lines 98 to 99


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question, non-blocking] Any particular meaning to these newlines?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 70b5902!

@dfsnow
Copy link
Member Author

dfsnow commented Nov 21, 2023

In an ideal world, I think the approach we discussed last week is probably the right one:

  1. Tag all unit tests that are intended to test that our transformations do what we expect
  2. Update the build_and_test_dbt workflow to only run unit tests
  3. Allow integration tests to fail while we fix them, but insist that unit tests pass

Okay I think I've landed on a good approach based on this. I tagged all QC/source tests and updated the build_and_test_dbt workflow to exclude all QC tests. This way we won't be blocked by QC test failures but can still see them happen in the test_dbt_models output. We can then work to get the test_dbt_models output "back to green" by correcting the tests (it can serve as a running log of things to fix).

Thoughts @jeancochrane?

Here are the commits since last time if you want to review.

@dfsnow dfsnow merged commit 3551103 into master Nov 21, 2023
9 checks passed
@dfsnow dfsnow deleted the dansnow/add-iasworld-tests branch November 21, 2023 20:30
@dfsnow dfsnow mentioned this pull request Nov 28, 2023
23 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants