Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor ratio_stats job to use Pyspark #436

Closed
dfsnow opened this issue May 9, 2024 · 3 comments · Fixed by #521
Closed

Refactor ratio_stats job to use Pyspark #436

dfsnow opened this issue May 9, 2024 · 3 comments · Fixed by #521
Assignees
Labels
dbt Related to dbt (tests, docs, schema, etc)

Comments

@dfsnow
Copy link
Member

dfsnow commented May 9, 2024

The recently merged #422 has a Python dbt model (ratio_stats.py) that runs on Athena's Spark backend. The model almost exclusively uses pandas for data munging and processing. This works well and is simple, but misses out on some of the benefits of using Spark (parallelization). We should try a quick refactor of the ratio_stats model using PySpark code to see if we can gain some of the benefits of Spark. Mainly, the current Pandas job takes 1 hour to finish, while the Spark version is likely to be much faster.

We can also make a few other enhancements here at the same time. Namely:

  • Change the data types of the ratio_stats table to be slightly more sensible
  • Possibly factor out the ratio_stats_input table entirely

These will need input from @ccao-jardine and @wrridgeway.

@dfsnow dfsnow added the dbt Related to dbt (tests, docs, schema, etc) label May 9, 2024
@dfsnow dfsnow self-assigned this May 9, 2024
@dfsnow dfsnow assigned wagnerlmichael and unassigned dfsnow Jun 7, 2024
@dfsnow
Copy link
Member Author

dfsnow commented Jun 7, 2024

@wagnerlmichael This one is yours now. Let's use it to pilot use of Spark models within dbt, since we may want to convert sales val, source-of-truth, etc to Spark. Let's also take this opportunity to clean up the ratio_stats table a little bit (get the dtypes corrected, drop extraneous columns, etc.).

@ccao-jardine
Copy link
Member

ratio_stats is used in production for our public-facing ratio study dashboards, which are published for mailed stage each reassessed township when it mails.

This is one dashboard serving all townships, with an extract of the ratio_stats table that is refreshed with each 2024 township mailing. Because of that I'd very strongly prefer to not make any changes to the production table until after we have mailed the last tri town this year.

If changes must be made now because it's blocking other work, please sequence with me on schedule so that changes aren't pushed close to a town mail date.

If it helps, the current structure of the reporting depends on no changes (data type, etc.) to the following columns in the production table:

  • geography_id
  • property_group
  • assessment_stage
  • sale_year
  • sale_n
  • detect_chasing
  • med_ratio, cod, prb, prd
  • ratio_met, cod_met, prb_met, prd_met

This table is filtered to geography_type = "Town", so if other types are added, it should be robust to those changes.

Which extraneous columns are you thinking of getting rid of?

@dfsnow
Copy link
Member Author

dfsnow commented Jun 12, 2024

Got it. @wagnerlmichael don't mess with any of the column dtypes. We'll move the cleanup stuff to a separate issue.

@wagnerlmichael wagnerlmichael linked a pull request Jun 25, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dbt Related to dbt (tests, docs, schema, etc)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants