-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor ratio_stats
job to use Pyspark
#436
Comments
@wagnerlmichael This one is yours now. Let's use it to pilot use of Spark models within dbt, since we may want to convert sales val, source-of-truth, etc to Spark. Let's also take this opportunity to clean up the |
This is one dashboard serving all townships, with an extract of the If changes must be made now because it's blocking other work, please sequence with me on schedule so that changes aren't pushed close to a town mail date. If it helps, the current structure of the reporting depends on no changes (data type, etc.) to the following columns in the production table:
This table is filtered to geography_type = "Town", so if other types are added, it should be robust to those changes. Which extraneous columns are you thinking of getting rid of? |
Got it. @wagnerlmichael don't mess with any of the column dtypes. We'll move the cleanup stuff to a separate issue. |
The recently merged #422 has a Python dbt model (
ratio_stats.py
) that runs on Athena's Spark backend. The model almost exclusively uses pandas for data munging and processing. This works well and is simple, but misses out on some of the benefits of using Spark (parallelization). We should try a quick refactor of theratio_stats
model using PySpark code to see if we can gain some of the benefits of Spark. Mainly, the current Pandas job takes 1 hour to finish, while the Spark version is likely to be much faster.We can also make a few other enhancements here at the same time. Namely:
ratio_stats
table to be slightly more sensibleratio_stats_input
table entirelyThese will need input from @ccao-jardine and @wrridgeway.
The text was updated successfully, but these errors were encountered: