-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix reporting.res_report_summary
builds by turning input CTEs to tables
#654
Fix reporting.res_report_summary
builds by turning input CTEs to tables
#654
Conversation
Aggregates statistics on characteristics, classes, AVs, and sales | ||
by assessment stage, property groups, year, and various geographies. | ||
Feeds public reporting assets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, this text was included in an unused view_res_report_summary
docs block that I'm guessing we forgot to remove when we switched to materializing the table. I removed that docs block and consolidated its text over here.
## Nuance | ||
|
||
- Model and assessment values are gathered independently and | ||
aggregated via a union rather than a join, so it's important to keep in mind | ||
that years for model and assessment stages do NOT need to match, i.e. we can | ||
have 2023 model values in the table before there are any 2023 assessment | ||
values to report on. Sales are added via a lagged join, so `sale_year` should | ||
always be `year - 1`. It is also worth nothing that "model year" is | ||
incremented by 1 solely for the sake of reporting in this table, meaning that | ||
models with a `meta_year` value of 2022 in `model.assessment_pin` will | ||
populate the table with a value of 2023 for `year`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This nuance comes from the model docstring. I think it's useful, so I copied it into the dbt docs as well.
`res_report_summary`, since otherwise it needs to rerun the query logic | ||
for every possible geography and reporting group combination. | ||
|
||
**Primary Key**: `pin`, `year` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that, due to the problem I shared in our chat, this view isn't actually truly unique by pin
and year
since there's one PIN in the 2023 condo model that has two rows in model.assessment_pin
. I think that's a minor detail and I doubt anyone will need to rely on the uniqueness of this table, however, so I left that context out of the docs -- happy to put it back in if you think it's appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine. I'll debug the dupe PIN. I'm going to add a follow-up issue for data sets on the model
DB assets.
@@ -157,25 +60,25 @@ aggregated_values AS ( | |||
), | |||
|
|||
-- Aggregate and stack stats on sales for each reporting group | |||
all_sales AS ( | |||
aggregated_sales AS ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No logic has changed in this CTE; I've just renamed it to more closely match the naming pattern used by aggregated_values
and its ancestors.
-- Sales, filtered to exclude outliers and mutlisales | ||
SELECT | ||
vwps.pin, | ||
vwps.doc_no, | ||
vwps.sale_price, | ||
tc.property_group, | ||
tc.township_code, | ||
vwps.nbhd AS townnbhd, | ||
vwps.year AS sale_year | ||
FROM {{ ref('default.vw_pin_sale') }} AS vwps | ||
LEFT JOIN {{ ref('reporting.vw_pin_township_class') }} AS tc | ||
ON vwps.pin = tc.pin | ||
AND vwps.year = tc.year | ||
WHERE NOT vwps.is_multisale | ||
AND NOT vwps.sale_filter_is_outlier | ||
AND NOT vwps.sale_filter_deed_type | ||
AND NOT vwps.sale_filter_less_than_10k | ||
AND NOT vwps.sale_filter_same_sale_within_365 | ||
AND tc.property_group IS NOT NULL | ||
AND tc.triad_name IS NOT NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic in this model should be identical to this block from the old res_report_summary
query, with the only difference being that we add pin
and doc_no
to the list of selected columns so that the table has a distinct primary key:
data-architecture/dbt/models/reporting/reporting.res_report_summary.sql
Lines 113 to 132 in 1ae727e
-- Sales, filtered to exclude outliers and mutlisales | |
sales AS ( | |
SELECT | |
vwps.sale_price, | |
vwps.year AS sale_year, | |
tc.property_group, | |
tc.township_code, | |
vwps.nbhd AS townnbhd | |
FROM {{ ref('default.vw_pin_sale') }} AS vwps | |
LEFT JOIN {{ ref('reporting.vw_pin_township_class') }} AS tc | |
ON vwps.pin = tc.pin | |
AND vwps.year = tc.year | |
WHERE NOT vwps.is_multisale | |
AND NOT vwps.sale_filter_is_outlier | |
AND NOT vwps.sale_filter_deed_type | |
AND NOT vwps.sale_filter_less_than_10k | |
AND NOT vwps.sale_filter_same_sale_within_365 | |
AND tc.property_group IS NOT NULL | |
AND tc.triad_name IS NOT NULL | |
), |
-- AVs and model values | ||
WITH all_fmvs AS ( | ||
SELECT | ||
ap.meta_pin AS pin, | ||
ap.year, | ||
'model' AS assessment_stage, | ||
ap.pred_pin_final_fmv_round AS total | ||
FROM {{ source('model', 'assessment_pin') }} AS ap | ||
INNER JOIN {{ ref('model.final_model') }} AS fm | ||
ON ap.run_id = fm.run_id | ||
AND ap.year = fm.year | ||
AND ( | ||
-- If reassessment year, use different models for different towns | ||
( | ||
CONTAINS(fm.township_code_coverage, ap.township_code) | ||
AND ap.meta_triad_code = fm.triad_code | ||
) | ||
-- Otherwise, just use whichever model is "final" | ||
OR (ap.meta_triad_code != fm.triad_code AND fm.is_final) | ||
) | ||
|
||
UNION ALL | ||
|
||
SELECT | ||
pin, | ||
year, | ||
stage_name AS assessment_stage, | ||
tot * 10 AS total | ||
FROM {{ ref('reporting.vw_pin_value_long') }} | ||
WHERE year >= '2021' | ||
), | ||
|
||
-- Combined SF/MF and condo characteristics | ||
chars AS ( | ||
SELECT | ||
parid AS pin, | ||
taxyr AS year, | ||
MIN(yrblt) AS yrblt, | ||
SUM(sfla) AS total_bldg_sf | ||
FROM {{ source('iasworld', 'dweldat') }} | ||
WHERE cur = 'Y' | ||
AND deactivat IS NULL | ||
GROUP BY parid, taxyr | ||
UNION ALL | ||
SELECT | ||
pin, | ||
year, | ||
char_yrblt AS yrblt, | ||
char_building_sf AS total_bldg_sf | ||
FROM {{ ref('default.vw_pin_condo_char') }} | ||
WHERE NOT is_parking_space | ||
AND NOT is_common_area | ||
) | ||
|
||
-- Join land, chars, and reporting groups to values | ||
SELECT | ||
fmvs.pin, | ||
vptc.property_group, | ||
vptc.class, | ||
vptc.triad_name AS triad, | ||
vptc.township_code, | ||
CONCAT(vptc.township_code, vptc.nbhd) AS townnbhd, | ||
fmvs.assessment_stage, | ||
fmvs.total, | ||
chars.yrblt, | ||
chars.total_bldg_sf, | ||
vpl.sf AS total_land_sf, | ||
fmvs.year | ||
FROM all_fmvs AS fmvs | ||
LEFT JOIN {{ ref('reporting.vw_pin_township_class') }} AS vptc | ||
ON fmvs.pin = vptc.pin | ||
AND fmvs.year = vptc.year | ||
INNER JOIN chars | ||
ON fmvs.pin = chars.pin | ||
AND fmvs.year = chars.year | ||
LEFT JOIN {{ ref('default.vw_pin_land') }} AS vpl | ||
ON fmvs.pin = vpl.pin | ||
AND fmvs.year = vpl.year | ||
WHERE vptc.property_group IS NOT NULL | ||
AND vptc.triad_name IS NOT NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic in this model should be identical to this block from the old res_report_summary
, with the only change being that we move year
to the bottom of the list of selected columns in order to enable its use as a partition key:
data-architecture/dbt/models/reporting/reporting.res_report_summary.sql
Lines 30 to 111 in 1ae727e
-- AVs and model values | |
WITH all_fmvs AS ( | |
SELECT | |
ap.meta_pin AS pin, | |
ap.year, | |
'model' AS assessment_stage, | |
ap.pred_pin_final_fmv_round AS total | |
FROM {{ source('model', 'assessment_pin') }} AS ap | |
INNER JOIN {{ ref('model.final_model') }} AS fm | |
ON ap.run_id = fm.run_id | |
AND ap.year = fm.year | |
AND ( | |
-- If reassessment year, use different models for different towns | |
( | |
CONTAINS(fm.township_code_coverage, ap.township_code) | |
AND ap.meta_triad_code = fm.triad_code | |
) | |
-- Otherwise, just use whichever model is "final" | |
OR (ap.meta_triad_code != fm.triad_code AND fm.is_final) | |
) | |
UNION ALL | |
SELECT | |
pin, | |
year, | |
stage_name AS assessment_stage, | |
tot * 10 AS total | |
FROM {{ ref('reporting.vw_pin_value_long') }} | |
WHERE year >= '2021' | |
), | |
-- Combined SF/MF and condo characteristics | |
chars AS ( | |
SELECT | |
parid AS pin, | |
taxyr AS year, | |
MIN(yrblt) AS yrblt, | |
SUM(sfla) AS total_bldg_sf | |
FROM {{ source('iasworld', 'dweldat') }} | |
WHERE cur = 'Y' | |
AND deactivat IS NULL | |
GROUP BY parid, taxyr | |
UNION ALL | |
SELECT | |
pin, | |
year, | |
char_yrblt AS yrblt, | |
char_building_sf AS total_bldg_sf | |
FROM {{ ref('default.vw_pin_condo_char') }} | |
WHERE NOT is_parking_space | |
AND NOT is_common_area | |
), | |
-- Join land, chars, and reporting groups to values | |
all_values AS ( | |
SELECT | |
fmvs.pin, | |
vptc.property_group, | |
vptc.class, | |
vptc.triad_name AS triad, | |
vptc.township_code, | |
CONCAT(vptc.township_code, vptc.nbhd) AS townnbhd, | |
fmvs.year, | |
fmvs.assessment_stage, | |
fmvs.total, | |
chars.yrblt, | |
chars.total_bldg_sf, | |
vpl.sf AS total_land_sf | |
FROM all_fmvs AS fmvs | |
LEFT JOIN {{ ref('reporting.vw_pin_township_class') }} AS vptc | |
ON fmvs.pin = vptc.pin | |
AND fmvs.year = vptc.year | |
INNER JOIN chars | |
ON fmvs.pin = chars.pin | |
AND fmvs.year = chars.year | |
LEFT JOIN {{ ref('default.vw_pin_land') }} AS vpl | |
ON fmvs.pin = vpl.pin | |
AND fmvs.year = vpl.year | |
WHERE vptc.property_group IS NOT NULL | |
AND vptc.triad_name IS NOT NULL | |
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love our recent pattern of separating things into _input
tables to get around Athena resource constraints. It's just more stuff to maintain, more opportunities for things to get out-of-sync, etc. I'm positive that with some clever querying we could probably make the original view work.
That said, I think this is a perfectly fine stop-gap until we can either:
- Find an optimized SQL query that works
- Refactor this into a Python model, which is what it probably deserves due to its complexity
@jeancochrane Let's merge this ASAP to unblock our daily builds.
`res_report_summary`, since otherwise it needs to rerun the query logic | ||
for every possible geography and reporting group combination. | ||
|
||
**Primary Key**: `pin`, `year` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine. I'll debug the dupe PIN. I'm going to add a follow-up issue for data sets on the model
DB assets.
Background
This PR fixes our broken
build-daily-dbt-models
workflow by improving build performance for thereporting.res_report_summary
table, which has been exhausting Athena query resources ever since we merged #641.The approach in this PR is to take the CTEs that the query reuses for aggregation and split them out into input tables that we materialize on the same schedule as
reporting.res_report_summary
. Tables improve performance over CTEs in this case because Athena treats CTEs the same as subqueries and views, that is to say, it will rerun the underlying query code every time the CTE is referenced.Closes #653.
Tradeoffs
While materializing the input data to the model speeds it up and reduces its memory usage, it also requires that we remember to refresh the inputs any time we want to refresh the model. This should not be a problem for the daily build task, since we can just tag the inputs as
daily
and dbt will understand the correct order to run the models, but it might be a gotcha if someone is ever asked to refresh this table on command.One idea to mitigate this downside might be to add a when-loaded column to
res_report_summary
and its inputs, then add a test to the model confirming that the when-loaded column for its inputs are within some small bound (say, 15 minutes). This could provide a layer of protection against the models getting out of sync since thebuild-and-test-dbt
workflow that we use to rebuild production models also runs tests for any models it builds. If you think this is a good idea, I'll go ahead and set it up before merging.Alternatives I considered
Other alternatives I considered was to improve the memory use of the query itself without factoring out any extra tables. The most memory-intensive part of the query is the repeated unions that aggregate from the source data at varying levels of granularity, and I spent a while brainstorming ways of doing this without having to aggregate different groups and union them, but I couldn't think of anything better.
Testing
See this workflow run for evidence that
build-daily-dbt-models
now runs successfully.I built the version of the table and its ancestors that existed prior to #641 in the
z_dev_jecochr_reporting
database to test that the contents of the view don't change following this refactor.I checked to make sure row counts match in the two assets:
Confirming that PIN and sale counts are the same across the two versions of the table: