Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor ratio stats Glue job to dbt python model #422

Merged
merged 28 commits into from
May 8, 2024

Conversation

dfsnow
Copy link
Member

@dfsnow dfsnow commented May 1, 2024

This PR migrates the ratio_stats Glue job to a dbt Python model that uses Athena's Spark integration as the Python runtime. There are some tradeoffs here:

Pros

  • Vastly simplifies the Glue job code (no need for polling, context setup, etc)
  • Integrates the ratio_stats table with the DAG, meaning it will get rebuild on upstream changes AND it works with dbt profiles (so changes can first be built in CI buckets)
  • Very easy to add further Python models now

Cons

  • Needs external packages. We'll have to setup zip files of packages in order to use third-party stuff.
  • Cannot source from views. Athena Spark pulls from the files rather than SQL, so it doesn't work with views at all.
  • Very new integration. Support was only added in the most recent version of dbt-athena.
  • Hard to debug
  • Ugly linting

IMO the pros are worth it here, but it's probably worth it to merge this and keep it up as a test for a month or so before integrating it more deeply.

@dfsnow dfsnow force-pushed the dfsnow/spike-ratio-stats-refactor branch from 5889e5d to 8dbec57 Compare May 2, 2024 14:26
.gitignore Show resolved Hide resolved
@dfsnow dfsnow force-pushed the dfsnow/spike-ratio-stats-refactor branch from b6fd45e to 15b9af7 Compare May 2, 2024 14:44
Comment on lines +1 to +2
# type: ignore
# pylint: skip-file
Copy link
Member Author

@dfsnow dfsnow May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annoyingly, our linters and pre-commit checks absolutely hate this setup. There are a two problems:

  1. sc is undefined in the script (it's added when you submit the job).
  2. The imports come necessarily after addPyFile(), which loads the packages from S3.

I'm not sure if there's a way to disable these linting toggles in a prettier way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question, non-blocking] Unfortunately this is pretty common when 1) writing code for environments with implicit context and B) writing setup code that needs to run prior to import. I don't think there's an obviously better way to do it, although we might consider switching to a different linter that supports block-level ignores (unlike flake8).

That being said, can you provide some context on why pylint is turned off? Is that just for your editor integration?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it's because it's also mad about sc not being defined. Pylint in this case is kind of extraneous since flake8 is running already in precommit.


# Convert the Spark input dataframes to Pandas for compatibility
# with assesspy functions
df = df.toPandas()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is absolutely critical to making this work. It converts the spark table returned by dbt-athena (in this case all of reporting.ratio_stats_input) to a pandas data frame. IMO, we should refactor this to use pyspark since it will probably be much faster.

@@ -276,7 +174,6 @@ def report_summarise(df, geography_id, geography_type):
"mki": ccao_mki(fmv=x["fmv"], sale_price=x["sale_price"]),
"prd": ccao_prd(fmv=x["fmv"], sale_price=x["sale_price"]),
"prb": ccao_prb(fmv=x["fmv"], sale_price=x["sale_price"]),
"detect_chasing": detect_chasing(ratio=x["ratio"]),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped the detect_chasing() stuff because:

a) It's not actually super effective at detecting sales chasing.
b) It was causing extremely difficult to debug errors in the spark job.

We can work to add it back later if it's truly needed. FYI @ccao-jardine.

dbt/profiles.yml Outdated Show resolved Hide resolved
@dfsnow
Copy link
Member Author

dfsnow commented May 4, 2024

@jeancochrane Hit another small roadblock on this after a bunch of permissions debugging. Seems like spark can't write tables with a dash in the name. Not an issue in prod, but doesn't work with our kebab-case name spacing schema. Do you think it's worth changing the macro to force everything to snake_case?

@jeancochrane
Copy link
Contributor

jeancochrane commented May 6, 2024

@jeancochrane Hit another small roadblock on this after a bunch of permissions debugging. Seems like spark can't write tables with a dash in the name. Not an issue in prod, but doesn't work with our kebab-case name spacing schema. Do you think it's worth changing the macro to force everything to snake_case?

@dfsnow It's a little bit annoying to have to change our naming scheme, but there's not actually a huge difference between the hyphenated vs. underscored versions so I don't mind. Were you able to find documentation confirming this limitation with Athena PySpark? I couldn't find anything so I'm wondering if the root cause is actually an error in the underlying plugin implementation (e.g. not using backticks to escape special characters).

@dfsnow
Copy link
Member Author

dfsnow commented May 6, 2024

@jeancochrane Hit another small roadblock on this after a bunch of permissions debugging. Seems like spark can't write tables with a dash in the name. Not an issue in prod, but doesn't work with our kebab-case name spacing schema. Do you think it's worth changing the macro to force everything to snake_case?

@dfsnow It's a little bit annoying to have to change our naming scheme, but there's not actually a huge difference between the hyphenated vs. underscored versions so I don't mind. Were you able to find documentation confirming this limitation with Athena PySpark? I couldn't find anything so I'm wondering if the root cause is actually an error in the underlying plugin implementation (e.g. not using backticks to escape special characters).

Sadly, there are official docs supporting this. I also forked dbt-athena and tested various backtick configurations with the fork, none of them worked. I think basically only the z_ci_ DBs ever even use hyphens, so it's not a major change. But yes, annoying.

@dfsnow dfsnow requested a review from jeancochrane May 8, 2024 03:43
@dfsnow
Copy link
Member Author

dfsnow commented May 8, 2024

@jeancochrane @wrridgeway This could probably use one more look before merging. I'll post the diff between the current reporting.ratio_stats and the new version tomorrow.

@dfsnow
Copy link
Member Author

dfsnow commented May 8, 2024

A quick diff shows that the old ratio_stats and new one are basically identical but for the following differences:

  • Bootstrapping is not deterministic and therefore all _ci columns have different values
  • The mki column has different values due to Switch MKI sorting from quicksort to mergesort assesspy#19
  • detect_chasing is removed in the new version (replaced with False), due to reasons I outlined earlier. It's not a reliable measurement of chasing and it doesn't work in Spark for some reason
  • The float columns have a slightly different precision between versions. The new version records more digits

All that said, I think this is ready to merge. I've kept it as close as I can to the original ratio_stats table. FYI @ccao-jardine

@wrridgeway
Copy link
Member

wrridgeway commented May 8, 2024

Assuming testing comes back fine, looks good to me, though this all begs the question: is there a point to having the input table exist rather than just being a pull in the python script to generate stats any longer if both processes are handled by dbt on a daily basis and the input table serves no function outside feeding the script? It probably didn't need to exist in the first place.

This seems similar to EI issues where we have a separate SQL script and read it in as a pull in the R script.

Co-authored-by: William Ridgeway <10358980+wrridgeway@users.noreply.github.com>
@dfsnow
Copy link
Member Author

dfsnow commented May 8, 2024

Assuming testing comes back fine, looks good to me, though this all begs the question: is there a point to having the input table exist rather than just being a pull in the python script to generate stats any longer if both processes are handled by dbt on a daily basis and the input table serves no function outside feeding the script? It probably didn't need to exist in the first place. Just a thought.

Probably not tbh, we can likely ditch the input table. However, I'm going to punt that to a follow-up issue since I want to get this merged.

Copy link
Contributor

@jeancochrane jeancochrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I can't really speak to the accuracy of the underlying table, but the model definition portion of the PR looks good to go.

Comment on lines +259 to +270
schema = (
"year: bigint, triad: bigint, geography_type: string, "
+ "property_group: string, assessment_stage: string, "
+ "geography_id: string, sale_year: bigint, sale_n: bigint, "
+ "median_ratio: double, median_ratio_ci: string, cod: double, "
+ "cod_ci: string, cod_n: bigint, prd: double, prd_ci: string, "
+ "prd_n: bigint, prb: double, prb_ci: string, prb_n: bigint, "
+ "mki: double, mki_n: bigint, detect_chasing: boolean, "
+ "ratio_met: boolean, cod_met: boolean, prd_met: boolean, "
+ "prb_met: boolean, mki_met: boolean, vertical_equity_met: boolean, "
+ "within_20_pct: bigint, within_10_pct: bigint, within_05_pct: bigint"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion, non-blocking] Since this is essentially just a comma-and-space-separated string of key-value pairs, it might make future development more convenient to define it as a dict and then serialize it to this format with a oneliner like so:

schema_dict = { ... }
schema = ", ".join(f"{key}: {val}" for key, val in schema_dict.items())

But I leave it up to you to decide which data structure you find more convenient!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect that we'll persist this schema definition. It will probably only live until Nicole gets back and can confirm that Tableau is still working fine.

@dfsnow dfsnow merged commit b13870e into master May 8, 2024
11 checks passed
@dfsnow dfsnow deleted the dfsnow/spike-ratio-stats-refactor branch May 8, 2024 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aws Change something related to AWS dbt Related to dbt (tests, docs, schema, etc)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move reporting.ratio_stats from a Glue job to a dbt Python model
3 participants