Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reporting.res_report_summary builds by turning input CTEs to tables #654

Conversation

jeancochrane
Copy link
Contributor

@jeancochrane jeancochrane commented Nov 22, 2024

Background

This PR fixes our broken build-daily-dbt-models workflow by improving build performance for the reporting.res_report_summary table, which has been exhausting Athena query resources ever since we merged #641.

The approach in this PR is to take the CTEs that the query reuses for aggregation and split them out into input tables that we materialize on the same schedule as reporting.res_report_summary. Tables improve performance over CTEs in this case because Athena treats CTEs the same as subqueries and views, that is to say, it will rerun the underlying query code every time the CTE is referenced.

Closes #653.

Tradeoffs

While materializing the input data to the model speeds it up and reduces its memory usage, it also requires that we remember to refresh the inputs any time we want to refresh the model. This should not be a problem for the daily build task, since we can just tag the inputs as daily and dbt will understand the correct order to run the models, but it might be a gotcha if someone is ever asked to refresh this table on command.

One idea to mitigate this downside might be to add a when-loaded column to res_report_summary and its inputs, then add a test to the model confirming that the when-loaded column for its inputs are within some small bound (say, 15 minutes). This could provide a layer of protection against the models getting out of sync since the build-and-test-dbt workflow that we use to rebuild production models also runs tests for any models it builds. If you think this is a good idea, I'll go ahead and set it up before merging.

Alternatives I considered

Other alternatives I considered was to improve the memory use of the query itself without factoring out any extra tables. The most memory-intensive part of the query is the repeated unions that aggregate from the source data at varying levels of granularity, and I spent a while brainstorming ways of doing this without having to aggregate different groups and union them, but I couldn't think of anything better.

Testing

See this workflow run for evidence that build-daily-dbt-models now runs successfully.

I built the version of the table and its ancestors that existed prior to #641 in the z_dev_jecochr_reporting database to test that the contents of the view don't change following this refactor.

I checked to make sure row counts match in the two assets:

select
    assessment_stage,
    prod_count,
    dev_count
from (
    select
        assessment_stage,
        count(*) as prod_count
    from z_dev_jecochr_reporting.res_report_summary
    group by assessment_stage
) prod
left join (
    select
        assessment_stage,
        count(*) as dev_count
    from z_ci_jeancochrane_653_refactor_res_report_summary_to_fix_failing_daily_dbt_model_builds_reporting.res_report_summary
    where assessment_stage in ('ASSESSOR CERTIFIED', 'MAILED', 'BOR CERTIFIED', 'model')
    group by assessment_stage
) dev
using (assessment_stage)
assessment_stage prod_count dev_count
MAILED 12306 12306
BOR CERTIFIED 9329 9329
model 8920 8920
ASSESSOR CERTIFIED 11702 11702
ASSESSOR PRE-CERTIFIED 0 260

Confirming that PIN and sale counts are the same across the two versions of the table:

select
    triad,
    geography_type,
    geography_id,
    property_group,
    assessment_stage,
    year,
    dev.pin_n as dev_pin_n,
    prod.pin_n as prod_pin_n,
    abs(coalesce(dev.pin_n, 0) - coalesce(prod.pin_n, 0)) as pin_n_mismatch,
    dev.sale_n as dev_sale_n,
    prod.sale_n as prod_sale_n,
    abs(coalesce(dev.sale_n, 0) - coalesce(prod.sale_n, 0)) as sale_n_mismatch
from z_ci_jeancochrane_653_refactor_res_report_summary_to_fix_failing_daily_dbt_model_builds_reporting.res_report_summary dev
full outer join z_dev_jecochr_reporting.res_report_summary prod
using ( 
    triad,
    geography_type,
    geography_id,
    property_group,
    assessment_stage,
    year
)
where assessment_stage not in ('ASSESSOR PRE-CERTIFIED')
and (
  coalesce(dev.pin_n, 0) != coalesce(prod.pin_n, 0) or
  coalesce(dev.sale_n, 0) != coalesce(prod.sale_n, 0)
)
triad geography_type geography_id property_group assessment_stage year dev_pin_n prod_pin_n pin_n_mismatch dev_sale_n prod_sale_n sale_n_mismatch

@jeancochrane jeancochrane linked an issue Nov 22, 2024 that may be closed by this pull request
Comment on lines +35 to +37
Aggregates statistics on characteristics, classes, AVs, and sales
by assessment stage, property groups, year, and various geographies.
Feeds public reporting assets.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, this text was included in an unused view_res_report_summary docs block that I'm guessing we forgot to remove when we switched to materializing the table. I removed that docs block and consolidated its text over here.

Comment on lines +41 to +51
## Nuance

- Model and assessment values are gathered independently and
aggregated via a union rather than a join, so it's important to keep in mind
that years for model and assessment stages do NOT need to match, i.e. we can
have 2023 model values in the table before there are any 2023 assessment
values to report on. Sales are added via a lagged join, so `sale_year` should
always be `year - 1`. It is also worth nothing that "model year" is
incremented by 1 solely for the sake of reporting in this table, meaning that
models with a `meta_year` value of 2022 in `model.assessment_pin` will
populate the table with a value of 2023 for `year`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This nuance comes from the model docstring. I think it's useful, so I copied it into the dbt docs as well.

`res_report_summary`, since otherwise it needs to rerun the query logic
for every possible geography and reporting group combination.

**Primary Key**: `pin`, `year`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that, due to the problem I shared in our chat, this view isn't actually truly unique by pin and year since there's one PIN in the 2023 condo model that has two rows in model.assessment_pin. I think that's a minor detail and I doubt anyone will need to rely on the uniqueness of this table, however, so I left that context out of the docs -- happy to put it back in if you think it's appropriate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine. I'll debug the dupe PIN. I'm going to add a follow-up issue for data sets on the model DB assets.

@@ -157,25 +60,25 @@ aggregated_values AS (
),

-- Aggregate and stack stats on sales for each reporting group
all_sales AS (
aggregated_sales AS (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No logic has changed in this CTE; I've just renamed it to more closely match the naming pattern used by aggregated_values and its ancestors.

Comment on lines +20 to +39
-- Sales, filtered to exclude outliers and mutlisales
SELECT
vwps.pin,
vwps.doc_no,
vwps.sale_price,
tc.property_group,
tc.township_code,
vwps.nbhd AS townnbhd,
vwps.year AS sale_year
FROM {{ ref('default.vw_pin_sale') }} AS vwps
LEFT JOIN {{ ref('reporting.vw_pin_township_class') }} AS tc
ON vwps.pin = tc.pin
AND vwps.year = tc.year
WHERE NOT vwps.is_multisale
AND NOT vwps.sale_filter_is_outlier
AND NOT vwps.sale_filter_deed_type
AND NOT vwps.sale_filter_less_than_10k
AND NOT vwps.sale_filter_same_sale_within_365
AND tc.property_group IS NOT NULL
AND tc.triad_name IS NOT NULL
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic in this model should be identical to this block from the old res_report_summary query, with the only difference being that we add pin and doc_no to the list of selected columns so that the table has a distinct primary key:

-- Sales, filtered to exclude outliers and mutlisales
sales AS (
SELECT
vwps.sale_price,
vwps.year AS sale_year,
tc.property_group,
tc.township_code,
vwps.nbhd AS townnbhd
FROM {{ ref('default.vw_pin_sale') }} AS vwps
LEFT JOIN {{ ref('reporting.vw_pin_township_class') }} AS tc
ON vwps.pin = tc.pin
AND vwps.year = tc.year
WHERE NOT vwps.is_multisale
AND NOT vwps.sale_filter_is_outlier
AND NOT vwps.sale_filter_deed_type
AND NOT vwps.sale_filter_less_than_10k
AND NOT vwps.sale_filter_same_sale_within_365
AND tc.property_group IS NOT NULL
AND tc.triad_name IS NOT NULL
),

Comment on lines +27 to +106
-- AVs and model values
WITH all_fmvs AS (
SELECT
ap.meta_pin AS pin,
ap.year,
'model' AS assessment_stage,
ap.pred_pin_final_fmv_round AS total
FROM {{ source('model', 'assessment_pin') }} AS ap
INNER JOIN {{ ref('model.final_model') }} AS fm
ON ap.run_id = fm.run_id
AND ap.year = fm.year
AND (
-- If reassessment year, use different models for different towns
(
CONTAINS(fm.township_code_coverage, ap.township_code)
AND ap.meta_triad_code = fm.triad_code
)
-- Otherwise, just use whichever model is "final"
OR (ap.meta_triad_code != fm.triad_code AND fm.is_final)
)

UNION ALL

SELECT
pin,
year,
stage_name AS assessment_stage,
tot * 10 AS total
FROM {{ ref('reporting.vw_pin_value_long') }}
WHERE year >= '2021'
),

-- Combined SF/MF and condo characteristics
chars AS (
SELECT
parid AS pin,
taxyr AS year,
MIN(yrblt) AS yrblt,
SUM(sfla) AS total_bldg_sf
FROM {{ source('iasworld', 'dweldat') }}
WHERE cur = 'Y'
AND deactivat IS NULL
GROUP BY parid, taxyr
UNION ALL
SELECT
pin,
year,
char_yrblt AS yrblt,
char_building_sf AS total_bldg_sf
FROM {{ ref('default.vw_pin_condo_char') }}
WHERE NOT is_parking_space
AND NOT is_common_area
)

-- Join land, chars, and reporting groups to values
SELECT
fmvs.pin,
vptc.property_group,
vptc.class,
vptc.triad_name AS triad,
vptc.township_code,
CONCAT(vptc.township_code, vptc.nbhd) AS townnbhd,
fmvs.assessment_stage,
fmvs.total,
chars.yrblt,
chars.total_bldg_sf,
vpl.sf AS total_land_sf,
fmvs.year
FROM all_fmvs AS fmvs
LEFT JOIN {{ ref('reporting.vw_pin_township_class') }} AS vptc
ON fmvs.pin = vptc.pin
AND fmvs.year = vptc.year
INNER JOIN chars
ON fmvs.pin = chars.pin
AND fmvs.year = chars.year
LEFT JOIN {{ ref('default.vw_pin_land') }} AS vpl
ON fmvs.pin = vpl.pin
AND fmvs.year = vpl.year
WHERE vptc.property_group IS NOT NULL
AND vptc.triad_name IS NOT NULL
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic in this model should be identical to this block from the old res_report_summary, with the only change being that we move year to the bottom of the list of selected columns in order to enable its use as a partition key:

-- AVs and model values
WITH all_fmvs AS (
SELECT
ap.meta_pin AS pin,
ap.year,
'model' AS assessment_stage,
ap.pred_pin_final_fmv_round AS total
FROM {{ source('model', 'assessment_pin') }} AS ap
INNER JOIN {{ ref('model.final_model') }} AS fm
ON ap.run_id = fm.run_id
AND ap.year = fm.year
AND (
-- If reassessment year, use different models for different towns
(
CONTAINS(fm.township_code_coverage, ap.township_code)
AND ap.meta_triad_code = fm.triad_code
)
-- Otherwise, just use whichever model is "final"
OR (ap.meta_triad_code != fm.triad_code AND fm.is_final)
)
UNION ALL
SELECT
pin,
year,
stage_name AS assessment_stage,
tot * 10 AS total
FROM {{ ref('reporting.vw_pin_value_long') }}
WHERE year >= '2021'
),
-- Combined SF/MF and condo characteristics
chars AS (
SELECT
parid AS pin,
taxyr AS year,
MIN(yrblt) AS yrblt,
SUM(sfla) AS total_bldg_sf
FROM {{ source('iasworld', 'dweldat') }}
WHERE cur = 'Y'
AND deactivat IS NULL
GROUP BY parid, taxyr
UNION ALL
SELECT
pin,
year,
char_yrblt AS yrblt,
char_building_sf AS total_bldg_sf
FROM {{ ref('default.vw_pin_condo_char') }}
WHERE NOT is_parking_space
AND NOT is_common_area
),
-- Join land, chars, and reporting groups to values
all_values AS (
SELECT
fmvs.pin,
vptc.property_group,
vptc.class,
vptc.triad_name AS triad,
vptc.township_code,
CONCAT(vptc.township_code, vptc.nbhd) AS townnbhd,
fmvs.year,
fmvs.assessment_stage,
fmvs.total,
chars.yrblt,
chars.total_bldg_sf,
vpl.sf AS total_land_sf
FROM all_fmvs AS fmvs
LEFT JOIN {{ ref('reporting.vw_pin_township_class') }} AS vptc
ON fmvs.pin = vptc.pin
AND fmvs.year = vptc.year
INNER JOIN chars
ON fmvs.pin = chars.pin
AND fmvs.year = chars.year
LEFT JOIN {{ ref('default.vw_pin_land') }} AS vpl
ON fmvs.pin = vpl.pin
AND fmvs.year = vpl.year
WHERE vptc.property_group IS NOT NULL
AND vptc.triad_name IS NOT NULL
),

@jeancochrane jeancochrane marked this pull request as ready for review November 22, 2024 22:30
@jeancochrane jeancochrane requested a review from a team as a code owner November 22, 2024 22:30
Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love our recent pattern of separating things into _input tables to get around Athena resource constraints. It's just more stuff to maintain, more opportunities for things to get out-of-sync, etc. I'm positive that with some clever querying we could probably make the original view work.

That said, I think this is a perfectly fine stop-gap until we can either:

  • Find an optimized SQL query that works
  • Refactor this into a Python model, which is what it probably deserves due to its complexity

@jeancochrane Let's merge this ASAP to unblock our daily builds.

`res_report_summary`, since otherwise it needs to rerun the query logic
for every possible geography and reporting group combination.

**Primary Key**: `pin`, `year`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine. I'll debug the dupe PIN. I'm going to add a follow-up issue for data sets on the model DB assets.

@jeancochrane jeancochrane merged commit a06fd39 into master Nov 26, 2024
7 checks passed
@jeancochrane jeancochrane deleted the jeancochrane/653-refactor-res_report_summary-to-fix-failing-daily-dbt-model-builds branch November 26, 2024 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor res_report_summary to fix failing daily dbt model builds
2 participants