Fix `reporting.res_report_summary` builds by turning input CTEs to tables #654

jeancochrane · 2024-11-22T20:37:20Z

Background

This PR fixes our broken build-daily-dbt-models workflow by improving build performance for the reporting.res_report_summary table, which has been exhausting Athena query resources ever since we merged #641.

The approach in this PR is to take the CTEs that the query reuses for aggregation and split them out into input tables that we materialize on the same schedule as reporting.res_report_summary. Tables improve performance over CTEs in this case because Athena treats CTEs the same as subqueries and views, that is to say, it will rerun the underlying query code every time the CTE is referenced.

Closes #653.

Tradeoffs

While materializing the input data to the model speeds it up and reduces its memory usage, it also requires that we remember to refresh the inputs any time we want to refresh the model. This should not be a problem for the daily build task, since we can just tag the inputs as daily and dbt will understand the correct order to run the models, but it might be a gotcha if someone is ever asked to refresh this table on command.

One idea to mitigate this downside might be to add a when-loaded column to res_report_summary and its inputs, then add a test to the model confirming that the when-loaded column for its inputs are within some small bound (say, 15 minutes). This could provide a layer of protection against the models getting out of sync since the build-and-test-dbt workflow that we use to rebuild production models also runs tests for any models it builds. If you think this is a good idea, I'll go ahead and set it up before merging.

Alternatives I considered

Other alternatives I considered was to improve the memory use of the query itself without factoring out any extra tables. The most memory-intensive part of the query is the repeated unions that aggregate from the source data at varying levels of granularity, and I spent a while brainstorming ways of doing this without having to aggregate different groups and union them, but I couldn't think of anything better.

Testing

See this workflow run for evidence that build-daily-dbt-models now runs successfully.

I built the version of the table and its ancestors that existed prior to #641 in the z_dev_jecochr_reporting database to test that the contents of the view don't change following this refactor.

I checked to make sure row counts match in the two assets:

select
    assessment_stage,
    prod_count,
    dev_count
from (
    select
        assessment_stage,
        count(*) as prod_count
    from z_dev_jecochr_reporting.res_report_summary
    group by assessment_stage
) prod
left join (
    select
        assessment_stage,
        count(*) as dev_count
    from z_ci_jeancochrane_653_refactor_res_report_summary_to_fix_failing_daily_dbt_model_builds_reporting.res_report_summary
    where assessment_stage in ('ASSESSOR CERTIFIED', 'MAILED', 'BOR CERTIFIED', 'model')
    group by assessment_stage
) dev
using (assessment_stage)

assessment_stage	prod_count	dev_count
MAILED	12306	12306
BOR CERTIFIED	9329	9329
model	8920	8920
ASSESSOR CERTIFIED	11702	11702
ASSESSOR PRE-CERTIFIED	0	260

Confirming that PIN and sale counts are the same across the two versions of the table:

select
    triad,
    geography_type,
    geography_id,
    property_group,
    assessment_stage,
    year,
    dev.pin_n as dev_pin_n,
    prod.pin_n as prod_pin_n,
    abs(coalesce(dev.pin_n, 0) - coalesce(prod.pin_n, 0)) as pin_n_mismatch,
    dev.sale_n as dev_sale_n,
    prod.sale_n as prod_sale_n,
    abs(coalesce(dev.sale_n, 0) - coalesce(prod.sale_n, 0)) as sale_n_mismatch
from z_ci_jeancochrane_653_refactor_res_report_summary_to_fix_failing_daily_dbt_model_builds_reporting.res_report_summary dev
full outer join z_dev_jecochr_reporting.res_report_summary prod
using ( 
    triad,
    geography_type,
    geography_id,
    property_group,
    assessment_stage,
    year
)
where assessment_stage not in ('ASSESSOR PRE-CERTIFIED')
and (
  coalesce(dev.pin_n, 0) != coalesce(prod.pin_n, 0) or
  coalesce(dev.sale_n, 0) != coalesce(prod.sale_n, 0)
)

triad	geography_type	geography_id	property_group	assessment_stage	year	dev_pin_n	prod_pin_n	pin_n_mismatch	dev_sale_n	prod_sale_n	sale_n_mismatch

jeancochrane · 2024-11-22T20:38:14Z

dbt/models/reporting/docs.md

+Aggregates statistics on characteristics, classes, AVs, and sales
+by assessment stage, property groups, year, and various geographies.
+Feeds public reporting assets.


Previously, this text was included in an unused view_res_report_summary docs block that I'm guessing we forgot to remove when we switched to materializing the table. I removed that docs block and consolidated its text over here.

jeancochrane · 2024-11-22T20:38:35Z

dbt/models/reporting/docs.md

+## Nuance
+
+- Model and assessment values are gathered independently and
+  aggregated via a union rather than a join, so it's important to keep in mind
+  that years for model and assessment stages do NOT need to match, i.e. we can
+  have 2023 model values in the table before there are any 2023 assessment
+  values to report on. Sales are added via a lagged join, so `sale_year` should
+  always be `year - 1`. It is also worth nothing that "model year" is
+  incremented by 1 solely for the sake of reporting in this table, meaning that
+  models with a `meta_year` value of 2022 in `model.assessment_pin` will
+  populate the table with a value of 2023 for `year`.


This nuance comes from the model docstring. I think it's useful, so I copied it into the dbt docs as well.

jeancochrane · 2024-11-22T20:41:33Z

dbt/models/reporting/docs.md

+`res_report_summary`, since otherwise it needs to rerun the query logic
+for every possible geography and reporting group combination.
+
+**Primary Key**: `pin`, `year`


Note that, due to the problem I shared in our chat, this view isn't actually truly unique by pin and year since there's one PIN in the 2023 condo model that has two rows in model.assessment_pin. I think that's a minor detail and I doubt anyone will need to rely on the uniqueness of this table, however, so I left that context out of the docs -- happy to put it back in if you think it's appropriate.

That's fine. I'll debug the dupe PIN. I'm going to add a follow-up issue for data sets on the model DB assets.

jeancochrane · 2024-11-22T20:42:12Z

dbt/models/reporting/reporting.res_report_summary.sql

@@ -157,25 +60,25 @@ aggregated_values AS (
 ),

 -- Aggregate and stack stats on sales for each reporting group
-all_sales AS (
+aggregated_sales AS (


No logic has changed in this CTE; I've just renamed it to more closely match the naming pattern used by aggregated_values and its ancestors.

jeancochrane · 2024-11-22T20:43:08Z

dbt/models/reporting/reporting.res_report_summary_sales_input.sql

+-- Sales, filtered to exclude outliers and mutlisales
+SELECT
+    vwps.pin,
+    vwps.doc_no,
+    vwps.sale_price,
+    tc.property_group,
+    tc.township_code,
+    vwps.nbhd AS townnbhd,
+    vwps.year AS sale_year
+FROM {{ ref('default.vw_pin_sale') }} AS vwps
+LEFT JOIN {{ ref('reporting.vw_pin_township_class') }} AS tc
+    ON vwps.pin = tc.pin
+    AND vwps.year = tc.year
+WHERE NOT vwps.is_multisale
+    AND NOT vwps.sale_filter_is_outlier
+    AND NOT vwps.sale_filter_deed_type
+    AND NOT vwps.sale_filter_less_than_10k
+    AND NOT vwps.sale_filter_same_sale_within_365
+    AND tc.property_group IS NOT NULL
+    AND tc.triad_name IS NOT NULL


The logic in this model should be identical to this block from the old res_report_summary query, with the only difference being that we add pin and doc_no to the list of selected columns so that the table has a distinct primary key:

data-architecture/dbt/models/reporting/reporting.res_report_summary.sql

Lines 113 to 132 in 1ae727e

-- Sales, filtered to exclude outliers and mutlisales

sales AS (

SELECT

vwps.sale_price,

vwps.year AS sale_year,

tc.property_group,

tc.township_code,

vwps.nbhd AS townnbhd

FROM {{ ref('default.vw_pin_sale') }} AS vwps

LEFT JOIN {{ ref('reporting.vw_pin_township_class') }} AS tc

ON vwps.pin = tc.pin

AND vwps.year = tc.year

WHERE NOT vwps.is_multisale

AND NOT vwps.sale_filter_is_outlier

AND NOT vwps.sale_filter_deed_type

AND NOT vwps.sale_filter_less_than_10k

AND NOT vwps.sale_filter_same_sale_within_365

AND tc.property_group IS NOT NULL

AND tc.triad_name IS NOT NULL

),

jeancochrane · 2024-11-22T20:44:07Z

dbt/models/reporting/reporting.res_report_summary_values_input.sql

+-- AVs and model values
+WITH all_fmvs AS (
+    SELECT
+        ap.meta_pin AS pin,
+        ap.year,
+        'model' AS assessment_stage,
+        ap.pred_pin_final_fmv_round AS total
+    FROM {{ source('model', 'assessment_pin') }} AS ap
+    INNER JOIN {{ ref('model.final_model') }} AS fm
+        ON ap.run_id = fm.run_id
+        AND ap.year = fm.year
+        AND (
+            -- If reassessment year, use different models for different towns
+            (
+                CONTAINS(fm.township_code_coverage, ap.township_code)
+                AND ap.meta_triad_code = fm.triad_code
+            )
+            -- Otherwise, just use whichever model is "final"
+            OR (ap.meta_triad_code != fm.triad_code AND fm.is_final)
+        )
+
+    UNION ALL
+
+    SELECT
+        pin,
+        year,
+        stage_name AS assessment_stage,
+        tot * 10 AS total
+    FROM {{ ref('reporting.vw_pin_value_long') }}
+    WHERE year >= '2021'
+),
+
+-- Combined SF/MF and condo characteristics
+chars AS (
+    SELECT
+        parid AS pin,
+        taxyr AS year,
+        MIN(yrblt) AS yrblt,
+        SUM(sfla) AS total_bldg_sf
+    FROM {{ source('iasworld', 'dweldat') }}
+    WHERE cur = 'Y'
+        AND deactivat IS NULL
+    GROUP BY parid, taxyr
+    UNION ALL
+    SELECT
+        pin,
+        year,
+        char_yrblt AS yrblt,
+        char_building_sf AS total_bldg_sf
+    FROM {{ ref('default.vw_pin_condo_char') }}
+    WHERE NOT is_parking_space
+        AND NOT is_common_area
+)
+
+-- Join land, chars, and reporting groups to values
+SELECT
+    fmvs.pin,
+    vptc.property_group,
+    vptc.class,
+    vptc.triad_name AS triad,
+    vptc.township_code,
+    CONCAT(vptc.township_code, vptc.nbhd) AS townnbhd,
+    fmvs.assessment_stage,
+    fmvs.total,
+    chars.yrblt,
+    chars.total_bldg_sf,
+    vpl.sf AS total_land_sf,
+    fmvs.year
+FROM all_fmvs AS fmvs
+LEFT JOIN {{ ref('reporting.vw_pin_township_class') }} AS vptc
+    ON fmvs.pin = vptc.pin
+    AND fmvs.year = vptc.year
+INNER JOIN chars
+    ON fmvs.pin = chars.pin
+    AND fmvs.year = chars.year
+LEFT JOIN {{ ref('default.vw_pin_land') }} AS vpl
+    ON fmvs.pin = vpl.pin
+    AND fmvs.year = vpl.year
+WHERE vptc.property_group IS NOT NULL
+    AND vptc.triad_name IS NOT NULL


The logic in this model should be identical to this block from the old res_report_summary, with the only change being that we move year to the bottom of the list of selected columns in order to enable its use as a partition key:

data-architecture/dbt/models/reporting/reporting.res_report_summary.sql

Lines 30 to 111 in 1ae727e

-- AVs and model values

WITH all_fmvs AS (

SELECT

ap.meta_pin AS pin,

ap.year,

'model' AS assessment_stage,

ap.pred_pin_final_fmv_round AS total

FROM {{ source('model', 'assessment_pin') }} AS ap

INNER JOIN {{ ref('model.final_model') }} AS fm

ON ap.run_id = fm.run_id

AND ap.year = fm.year

AND (

-- If reassessment year, use different models for different towns

(

CONTAINS(fm.township_code_coverage, ap.township_code)

AND ap.meta_triad_code = fm.triad_code

)

-- Otherwise, just use whichever model is "final"

OR (ap.meta_triad_code != fm.triad_code AND fm.is_final)

)

UNION ALL

SELECT

pin,

year,

stage_name AS assessment_stage,

tot * 10 AS total

FROM {{ ref('reporting.vw_pin_value_long') }}

WHERE year >= '2021'

),

-- Combined SF/MF and condo characteristics

chars AS (

SELECT

parid AS pin,

taxyr AS year,

MIN(yrblt) AS yrblt,

SUM(sfla) AS total_bldg_sf

FROM {{ source('iasworld', 'dweldat') }}

WHERE cur = 'Y'

AND deactivat IS NULL

GROUP BY parid, taxyr

UNION ALL

SELECT

pin,

year,

char_yrblt AS yrblt,

char_building_sf AS total_bldg_sf

FROM {{ ref('default.vw_pin_condo_char') }}

WHERE NOT is_parking_space

AND NOT is_common_area

),

-- Join land, chars, and reporting groups to values

all_values AS (

SELECT

fmvs.pin,

vptc.property_group,

vptc.class,

vptc.triad_name AS triad,

vptc.township_code,

CONCAT(vptc.township_code, vptc.nbhd) AS townnbhd,

fmvs.year,

fmvs.assessment_stage,

fmvs.total,

chars.yrblt,

chars.total_bldg_sf,

vpl.sf AS total_land_sf

FROM all_fmvs AS fmvs

LEFT JOIN {{ ref('reporting.vw_pin_township_class') }} AS vptc

ON fmvs.pin = vptc.pin

AND fmvs.year = vptc.year

INNER JOIN chars

ON fmvs.pin = chars.pin

AND fmvs.year = chars.year

LEFT JOIN {{ ref('default.vw_pin_land') }} AS vpl

ON fmvs.pin = vpl.pin

AND fmvs.year = vpl.year

WHERE vptc.property_group IS NOT NULL

AND vptc.triad_name IS NOT NULL

),

dfsnow

I don't love our recent pattern of separating things into _input tables to get around Athena resource constraints. It's just more stuff to maintain, more opportunities for things to get out-of-sync, etc. I'm positive that with some clever querying we could probably make the original view work.

That said, I think this is a perfectly fine stop-gap until we can either:

Find an optimized SQL query that works
Refactor this into a Python model, which is what it probably deserves due to its complexity

@jeancochrane Let's merge this ASAP to unblock our daily builds.

dfsnow · 2024-11-26T17:20:19Z

dbt/models/reporting/docs.md

+`res_report_summary`, since otherwise it needs to rerun the query logic
+for every possible geography and reporting group combination.
+
+**Primary Key**: `pin`, `year`


That's fine. I'll debug the dupe PIN. I'm going to add a follow-up issue for data sets on the model DB assets.

Split reporting.res_report_summary input CTEs out into dedicated tables

dd4b6e4

jeancochrane linked an issue Nov 22, 2024 that may be closed by this pull request

Refactor res_report_summary to fix failing daily dbt model builds #653

Closed

jeancochrane commented Nov 22, 2024

View reviewed changes

jeancochrane marked this pull request as ready for review November 22, 2024 22:30

jeancochrane requested a review from a team as a code owner November 22, 2024 22:30

Fix small typo in reporting/docs.md

5d940ee

dfsnow approved these changes Nov 26, 2024

View reviewed changes

jeancochrane merged commit a06fd39 into master Nov 26, 2024
7 checks passed

jeancochrane deleted the jeancochrane/653-refactor-res_report_summary-to-fix-failing-daily-dbt-model-builds branch November 26, 2024 23:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `reporting.res_report_summary` builds by turning input CTEs to tables #654

Fix `reporting.res_report_summary` builds by turning input CTEs to tables #654

jeancochrane commented Nov 22, 2024 •

edited

Loading

jeancochrane Nov 22, 2024

jeancochrane Nov 22, 2024

jeancochrane Nov 22, 2024

dfsnow Nov 26, 2024

jeancochrane Nov 22, 2024

jeancochrane Nov 22, 2024

jeancochrane Nov 22, 2024

dfsnow left a comment

dfsnow Nov 26, 2024

	-- Sales, filtered to exclude outliers and mutlisales
	sales AS (
	SELECT
	vwps.sale_price,
	vwps.year AS sale_year,
	tc.property_group,
	tc.township_code,
	vwps.nbhd AS townnbhd
	FROM {{ ref('default.vw_pin_sale') }} AS vwps
	LEFT JOIN {{ ref('reporting.vw_pin_township_class') }} AS tc
	ON vwps.pin = tc.pin
	AND vwps.year = tc.year
	WHERE NOT vwps.is_multisale
	AND NOT vwps.sale_filter_is_outlier
	AND NOT vwps.sale_filter_deed_type
	AND NOT vwps.sale_filter_less_than_10k
	AND NOT vwps.sale_filter_same_sale_within_365
	AND tc.property_group IS NOT NULL
	AND tc.triad_name IS NOT NULL
	),

	-- AVs and model values
	WITH all_fmvs AS (
	SELECT
	ap.meta_pin AS pin,
	ap.year,
	'model' AS assessment_stage,
	ap.pred_pin_final_fmv_round AS total
	FROM {{ source('model', 'assessment_pin') }} AS ap
	INNER JOIN {{ ref('model.final_model') }} AS fm
	ON ap.run_id = fm.run_id
	AND ap.year = fm.year
	AND (
	-- If reassessment year, use different models for different towns
	(
	CONTAINS(fm.township_code_coverage, ap.township_code)
	AND ap.meta_triad_code = fm.triad_code
	)
	-- Otherwise, just use whichever model is "final"
	OR (ap.meta_triad_code != fm.triad_code AND fm.is_final)
	)

	UNION ALL

	SELECT
	pin,
	year,
	stage_name AS assessment_stage,
	tot * 10 AS total
	FROM {{ ref('reporting.vw_pin_value_long') }}
	WHERE year >= '2021'
	),

	-- Combined SF/MF and condo characteristics
	chars AS (
	SELECT
	parid AS pin,
	taxyr AS year,
	MIN(yrblt) AS yrblt,
	SUM(sfla) AS total_bldg_sf
	FROM {{ source('iasworld', 'dweldat') }}
	WHERE cur = 'Y'
	AND deactivat IS NULL
	GROUP BY parid, taxyr
	UNION ALL
	SELECT
	pin,
	year,
	char_yrblt AS yrblt,
	char_building_sf AS total_bldg_sf
	FROM {{ ref('default.vw_pin_condo_char') }}
	WHERE NOT is_parking_space
	AND NOT is_common_area
	),

	-- Join land, chars, and reporting groups to values
	all_values AS (
	SELECT
	fmvs.pin,
	vptc.property_group,
	vptc.class,
	vptc.triad_name AS triad,
	vptc.township_code,
	CONCAT(vptc.township_code, vptc.nbhd) AS townnbhd,
	fmvs.year,
	fmvs.assessment_stage,
	fmvs.total,
	chars.yrblt,
	chars.total_bldg_sf,
	vpl.sf AS total_land_sf
	FROM all_fmvs AS fmvs
	LEFT JOIN {{ ref('reporting.vw_pin_township_class') }} AS vptc
	ON fmvs.pin = vptc.pin
	AND fmvs.year = vptc.year
	INNER JOIN chars
	ON fmvs.pin = chars.pin
	AND fmvs.year = chars.year
	LEFT JOIN {{ ref('default.vw_pin_land') }} AS vpl
	ON fmvs.pin = vpl.pin
	AND fmvs.year = vpl.year
	WHERE vptc.property_group IS NOT NULL
	AND vptc.triad_name IS NOT NULL
	),

Fix reporting.res_report_summary builds by turning input CTEs to tables #654

Fix reporting.res_report_summary builds by turning input CTEs to tables #654

Conversation

jeancochrane commented Nov 22, 2024 • edited Loading

Background

Tradeoffs

Alternatives I considered

Testing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dfsnow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fix `reporting.res_report_summary` builds by turning input CTEs to tables #654

Fix `reporting.res_report_summary` builds by turning input CTEs to tables #654

jeancochrane commented Nov 22, 2024 •

edited

Loading