Refactor ratio stats Glue job to dbt python model #422

dfsnow · 2024-05-01T22:42:12Z

This PR migrates the ratio_stats Glue job to a dbt Python model that uses Athena's Spark integration as the Python runtime. There are some tradeoffs here:

Pros

Vastly simplifies the Glue job code (no need for polling, context setup, etc)
Integrates the ratio_stats table with the DAG, meaning it will get rebuild on upstream changes AND it works with dbt profiles (so changes can first be built in CI buckets)
Very easy to add further Python models now

Cons

Needs external packages. We'll have to setup zip files of packages in order to use third-party stuff.
Cannot source from views. Athena Spark pulls from the files rather than SQL, so it doesn't work with views at all.
Very new integration. Support was only added in the most recent version of dbt-athena.
Hard to debug
Ugly linting

IMO the pros are worth it here, but it's probably worth it to merge this and keep it up as a test for a month or so before integrating it more deeply.

.gitignore

.github/workflows/build_daily_dbt_models.yaml

aws-athena/ctas/reporting-ratio_stats_input.sql

dbt/dbt_project.yml

dbt/models/reporting/reporting.ratio_stats.py

dfsnow · 2024-05-02T15:00:16Z

dbt/models/reporting/reporting.ratio_stats.py

+# type: ignore
+# pylint: skip-file


Annoyingly, our linters and pre-commit checks absolutely hate this setup. There are a two problems:

sc is undefined in the script (it's added when you submit the job).

The imports come necessarily after addPyFile(), which loads the packages from S3.

I'm not sure if there's a way to disable these linting toggles in a prettier way.

[Question, non-blocking] Unfortunately this is pretty common when 1) writing code for environments with implicit context and B) writing setup code that needs to run prior to import. I don't think there's an obviously better way to do it, although we might consider switching to a different linter that supports block-level ignores (unlike flake8).

That being said, can you provide some context on why pylint is turned off? Is that just for your editor integration?

Ah, it's because it's also mad about sc not being defined. Pylint in this case is kind of extraneous since flake8 is running already in precommit.

dbt/models/reporting/reporting.ratio_stats.py

dfsnow · 2024-05-02T15:06:57Z

dbt/models/reporting/reporting.ratio_stats.py

+
+    # Convert the Spark input dataframes to Pandas for compatibility
+    # with assesspy functions
+    df = df.toPandas()


This line is absolutely critical to making this work. It converts the spark table returned by dbt-athena (in this case all of reporting.ratio_stats_input) to a pandas data frame. IMO, we should refactor this to use pyspark since it will probably be much faster.

dfsnow · 2024-05-02T15:08:19Z

dbt/models/reporting/reporting.ratio_stats.py

@@ -276,7 +174,6 @@ def report_summarise(df, geography_id, geography_type):
                "mki": ccao_mki(fmv=x["fmv"], sale_price=x["sale_price"]),
                "prd": ccao_prd(fmv=x["fmv"], sale_price=x["sale_price"]),
                "prb": ccao_prb(fmv=x["fmv"], sale_price=x["sale_price"]),
-                "detect_chasing": detect_chasing(ratio=x["ratio"]),


I dropped the detect_chasing() stuff because:

a) It's not actually super effective at detecting sales chasing.
b) It was causing extremely difficult to debug errors in the spark job.

We can work to add it back later if it's truly needed. FYI @ccao-jardine.

dbt/models/reporting/schema.yml

dbt/profiles.yml

dbt/requirements.txt

dfsnow · 2024-05-04T02:49:43Z

@jeancochrane Hit another small roadblock on this after a bunch of permissions debugging. Seems like spark can't write tables with a dash in the name. Not an issue in prod, but doesn't work with our kebab-case name spacing schema. Do you think it's worth changing the macro to force everything to snake_case?

jeancochrane · 2024-05-06T15:37:45Z

@jeancochrane Hit another small roadblock on this after a bunch of permissions debugging. Seems like spark can't write tables with a dash in the name. Not an issue in prod, but doesn't work with our kebab-case name spacing schema. Do you think it's worth changing the macro to force everything to snake_case?

@dfsnow It's a little bit annoying to have to change our naming scheme, but there's not actually a huge difference between the hyphenated vs. underscored versions so I don't mind. Were you able to find documentation confirming this limitation with Athena PySpark? I couldn't find anything so I'm wondering if the root cause is actually an error in the underlying plugin implementation (e.g. not using backticks to escape special characters).

dfsnow · 2024-05-06T17:56:35Z

@jeancochrane Hit another small roadblock on this after a bunch of permissions debugging. Seems like spark can't write tables with a dash in the name. Not an issue in prod, but doesn't work with our kebab-case name spacing schema. Do you think it's worth changing the macro to force everything to snake_case?

@dfsnow It's a little bit annoying to have to change our naming scheme, but there's not actually a huge difference between the hyphenated vs. underscored versions so I don't mind. Were you able to find documentation confirming this limitation with Athena PySpark? I couldn't find anything so I'm wondering if the root cause is actually an error in the underlying plugin implementation (e.g. not using backticks to escape special characters).

Sadly, there are official docs supporting this. I also forked dbt-athena and tested various backtick configurations with the fork, none of them worked. I think basically only the z_ci_ DBs ever even use hyphens, so it's not a major change. But yes, annoying.

dfsnow · 2024-05-08T03:44:44Z

@jeancochrane @wrridgeway This could probably use one more look before merging. I'll post the diff between the current reporting.ratio_stats and the new version tomorrow.

dfsnow · 2024-05-08T16:06:22Z

A quick diff shows that the old ratio_stats and new one are basically identical but for the following differences:

Bootstrapping is not deterministic and therefore all _ci columns have different values
The mki column has different values due to Switch MKI sorting from quicksort to mergesort assesspy#19
detect_chasing is removed in the new version (replaced with False), due to reasons I outlined earlier. It's not a reliable measurement of chasing and it doesn't work in Spark for some reason
The float columns have a slightly different precision between versions. The new version records more digits

All that said, I think this is ready to merge. I've kept it as close as I can to the original ratio_stats table. FYI @ccao-jardine

aws-athena/ctas/reporting-ratio_stats_input.sql

wrridgeway · 2024-05-08T17:42:07Z

Assuming testing comes back fine, looks good to me, though this all begs the question: is there a point to having the input table exist rather than just being a pull in the python script to generate stats any longer if both processes are handled by dbt on a daily basis and the input table serves no function outside feeding the script? It probably didn't need to exist in the first place.

This seems similar to EI issues where we have a separate SQL script and read it in as a pull in the R script.

Co-authored-by: William Ridgeway <10358980+wrridgeway@users.noreply.github.com>

dfsnow · 2024-05-08T17:55:55Z

Assuming testing comes back fine, looks good to me, though this all begs the question: is there a point to having the input table exist rather than just being a pull in the python script to generate stats any longer if both processes are handled by dbt on a daily basis and the input table serves no function outside feeding the script? It probably didn't need to exist in the first place. Just a thought.

Probably not tbh, we can likely ditch the input table. However, I'm going to punt that to a follow-up issue since I want to get this merged.

jeancochrane

Awesome! I can't really speak to the accuracy of the underlying table, but the model definition portion of the PR looks good to go.

jeancochrane · 2024-05-08T22:09:49Z

dbt/models/reporting/reporting.ratio_stats.py

+    schema = (
+        "year: bigint, triad: bigint, geography_type: string, "
+        + "property_group: string, assessment_stage: string, "
+        + "geography_id: string, sale_year: bigint, sale_n: bigint, "
+        + "median_ratio: double, median_ratio_ci: string, cod: double, "
+        + "cod_ci: string, cod_n: bigint, prd: double, prd_ci: string, "
+        + "prd_n: bigint, prb: double, prb_ci: string, prb_n: bigint, "
+        + "mki: double, mki_n: bigint, detect_chasing: boolean, "
+        + "ratio_met: boolean, cod_met: boolean, prd_met: boolean, "
+        + "prb_met: boolean, mki_met: boolean, vertical_equity_met: boolean, "
+        + "within_20_pct: bigint, within_10_pct: bigint, within_05_pct: bigint"
+    )


[Suggestion, non-blocking] Since this is essentially just a comma-and-space-separated string of key-value pairs, it might make future development more convenient to define it as a dict and then serialize it to this format with a oneliner like so:

schema_dict = { ... } schema = ", ".join(f"{key}: {val}" for key, val in schema_dict.items())

But I leave it up to you to decide which data structure you find more convenient!

I don't expect that we'll persist this schema definition. It will probably only live until Nicole gets back and can confirm that Tableau is still working fine.

dfsnow added 13 commits May 1, 2024 16:59

Ignore dev venv

06bb719

Add initial python model

63bd8cb

Bump dbt-athena version

f3012e1

Lint ratio stats

1e981ec

Change project name to fix compile issue

af7f504

Add Spark workgroup

02fab5d

Finalize ratio_stats spark job

eacd40a

Convert vw_ratio_stats into table

26f9013

Move package zip location

3e7f558

Remove deprecated glue job

199c405

Convert spark to pandas df

f66ed93

Drop detect_chasing()

8fdc7a4

Fix linting issues

8dbec57

dfsnow force-pushed the dfsnow/spike-ratio-stats-refactor branch from 5889e5d to 8dbec57 Compare May 2, 2024 14:26

Rename project (prefix with org name)

487b8a7