Refactor ratio stats for build speed increase #521

wagnerlmichael · 2024-06-25T16:34:20Z

Summary

This is not a final PR for review, but a progress update to determine next steps.

Currently all of the assesspy functions used in this script are copied in. If we were to move forward with this solution, they would need to be refactored in the actual package rather than copy/pasted and changed in this script.

Currently with these changes the build time for reporting.ratio_stats table is ~15 minutes, a large improvement over the previous ~1 hour. All of this speed up came from editing the boot_ci function. I'm not sure how much speed we could get from editing the other functions. Changing sampling from pandas to numpy index sampling contributed to about ~10% of the speed up whereas parallel processing contributed to ~90% of the speed up.

Dev table here: "z_ci_436_refactor_ratio_stats_job_to_use_pyspark_reporting"."ratio_stats"

Other strategies tried

Spark

I tried for a while with different spark strategies. First I attempted to convert the data frame to a spark data frame and sample on that, but that didn't work. It was extremely slow, I'm assuming this was the case due to computationally intensive transformations from pandas to spark to pandas.

I tried to get around this issue by using a pandas udf. Supposedly, this allows the spark api to operate on the pandas data frame in a columnar format, maintaining speed increases from distributed processing. This also resulted in much longer build times or errors I couldn't work through.

I also tried a single pandas df conversion to spark, and then edit the remaining data structures in boot_ci so that they were all spark compatible, I also could not get this speed up working.

I am new to spark, so it is very possible I missed something obvious or there are remaining workable solutions.

Numba and Dask

I tried basic numba parallelization and Dask parallelization, but neither were able to be imported in properly. I because this is because they both have C bindings and Athena doesn't allow for this with third-party package additions.

`concurrent.futures`

I tried using this built-in python func but the parallelization was failing due to a pickling error, I switched to multiprocessing and that finally worked.

Considerations on current strategy

If we were to move forward with this solution, we would need to decide how to reconcile the changed boot_ci function with the assesspy build. One option is to edit the package itself and include a boolean param that turns parallel processing on/off. Another option is too just keep the copy pasted functions in this script, but that creates two sources of truth for the assesspy functions which isn't ideal.

One potential upside of not using spark is that we can potentially maintain these functions in assesspy rather than building out an entirely new set of spark assesspy functions.

Other ways forward

We could also continue to develop here. Two other paths forward for me could be:

Try to spend more time figuring out spark
Try further non-spark speed up in other functions

wagnerlmichael · 2024-06-26T14:57:30Z

dbt/models/reporting/reporting.ratio_stats.py

+    def bootstrap_worker(
+        data_array, fun, num_kwargs, n, nboot, start, end, result_queue
+    ):
+        ests = []
+        for _ in range(start, end):
+            sample_indices = np.random.choice(
+                data_array.shape[0], size=n, replace=True
+            )
+            sample_array = data_array[sample_indices]
+            if fun.__name__ == "cod" or num_kwargs == 1:
+                ests.append(fun(sample_array[:, 0]))
+            elif fun.__name__ == "prd":
+                ests.append(fun(sample_array[:, 0], sample_array[:, 1]))
+            else:
+                raise Exception(
+                    "Input function should require 1 argument or be assesspy.prd."  # noqa
+                )
+        result_queue.put(ests)


This function sets up for our parallel processing, it is the unit of work that a single core will be doing. It randomly samples just like the prior code.

wagnerlmichael · 2024-06-26T14:58:05Z

dbt/models/reporting/reporting.ratio_stats.py

+        for _ in range(start, end):
+            sample_indices = np.random.choice(
+                data_array.shape[0], size=n, replace=True
+            )


We substitute the old pandas sampling for a faster np.random.choice() sampling.

wagnerlmichael · 2024-06-26T15:41:10Z

dbt/models/reporting/reporting.ratio_stats.py

+    def parallel_bootstrap(
+        data_array, fun, num_kwargs, n, nboot, num_processes=4
+    ):
+        processes = []
+        result_queue = mp.Queue()
+        chunk_size = nboot // num_processes
+
+        for i in range(num_processes):
+            start = i * chunk_size
+            end = start + chunk_size if i < num_processes - 1 else nboot
+            p = mp.Process(
+                target=bootstrap_worker,


This function allocates the size of each process, and divides the ests bootstrap calculation into n_processes. This conditional if i < num_processes - 1 else nboot handles the case in which the nboot number isn't cleanly divisble by num_processes

wagnerlmichael · 2024-06-26T15:42:02Z

dbt/models/reporting/reporting.ratio_stats.py

+        results = []
+        for _ in range(num_processes):
+            results.extend(result_queue.get())
+
+        for p in processes:
+            p.join()
+
+        return results


Grabs the data from all the processes and combines them at the end

wagnerlmichael · 2024-06-26T15:52:01Z

dbt/models/reporting/reporting.ratio_stats.py

+        result_queue.put(ests)
+
+    def parallel_bootstrap(
+        data_array, fun, num_kwargs, n, nboot, num_processes=4


num_processes=4 is the optimal number of cores here. 3 to 4 is a big speed increase but 4, 8, and 16 all give similar times. I'm guessing this is because of the data transfer bottleneck between cores.

[Question, non-blocking] I'm guessing you checked this, but do we know for sure that the machine that this was tested on had more than 4 cores available?

We discussed this in person and there were only 4 cores available, but currently aws only allows a single 4 core DPU for processing. If they change this we could probably get much faster speeds with more cores.

jeancochrane

Really nice work! It's a bummer to hear that the Spark code either didn't work or wasn't faster, but I don't have enough Spark experience to advise on a path forward at this point. Maybe it would make sense to take another crack at it in the future, but in the meantime I like the improvements you've made here, and I'm on board with the plan to make these changes to the assesspy package.

My recommended path forward would be to recreate these changes in a branch of assesspy, bundle and push the code from that branch to S3 as a .zip file, and then test it out by updating the sc.addPyFile() call in this model definition to point to the new version of the package. Then once we get the assesspy branch merged and released, we can update the ratio_stats model to depend on the new version. Does that sound reasonable to everyone else?

jeancochrane · 2024-06-26T19:26:24Z

dbt/models/reporting/reporting.ratio_stats.py

+        result_queue.put(ests)
+
+    def parallel_bootstrap(
+        data_array, fun, num_kwargs, n, nboot, num_processes=4


[Question, non-blocking] I'm guessing you checked this, but do we know for sure that the machine that this was tested on had more than 4 cores available?

dfsnow · 2024-07-02T02:12:20Z

@wagnerlmichael I mocked up a working set of functions using only Spark-compatible abstractions and dropped the result in a gist. The result seems to be pretty fast, at least on the limited subset of columns I calculated.

You should be able to drop that code into an Athena Spark notebook and run it without modification. It runs the COD stat calculations at the township_code level in 1m45s. I didn't mock up the other functions but it should be fairly straightforward to build out from here.

Can you take another crack at this when you get some free time, building off the linked gist? I'm happy to walk through what I did in the functions/how I figured them out. If the gist is stuff you've already tried then let me know and I'll think about another way forward.

dfsnow

@jeancochrane This is ready for your review.
@wagnerlmichael You should take a look at the changes here.
@ccao-jardine I added back a sales chasing stat, but I also changed the column order (and some names). Is that okay?

This PR is full refactor of the ratio_stats Python model. It simplifies the code by abstracting assessment metrics to the new AssessPy 2.0.0 release. It also significantly decreases the run time of the model by taking advantage of Spark parallelism (from ~15 min to ~3 min).

The refactors made in AssessPy to accommodate the needs of this model should be reusable in other contexts, and this model can serve as a template for future Python models in our dbt stack.

dfsnow · 2024-11-26T19:45:41Z

dbt/models/reporting/reporting.ratio_stats.py


+def ccao_metric(


I condensed the individual metric functions into this wrapper. It:

Drops any outliers, per ccao_drop_outliers

Checks the min sample size requirement

Calculates the stat (COD, PRD, PRB, or MKI)

Calculates the CI for the stat (except for MKI)

Calculates the post-outlier-drop sample size (n)

Returns a dictionary for use in calc_summary

We could move this and the other ccao_ prefixed functions into the ccao package, but I'm honestly not sure it's necessary. This function and the others are sufficiently short that it doesn't seem like a big deal to have them in here.

dfsnow · 2024-11-26T19:46:41Z

dbt/models/reporting/reporting.ratio_stats.py

-        prd_ci = prd_boot(fmv_no_outliers, sale_price_no_outliers, nboot=1000)
-        prd_ci = f"{prd_ci[0]}, {prd_ci[1]}"
-        met = prd_met(prd_val)
+def ccao_median(


Median gets its own function because it's not included as a stat in assesspy and takes different inputs (ratio, as opposed to estimate and sale_price).

dfsnow · 2024-11-26T19:48:10Z

dbt/models/reporting/reporting.ratio_stats.py

+        df.withColumn("geography_id", col(geography_id).cast("string"))
+        .withColumn("geography_type", lit(geography_type))
+        .groupby(group_cols)
+        .applyInPandas(


The key to making things faster was just using native PySpark abstractions. Here df is a Spark table, and applyInPandas does all the heavy lifting of running a lambda function across all the specified grouping columns.

dfsnow · 2024-11-26T19:53:47Z

pyproject.toml

@@ -21,6 +23,9 @@ extend-select = ["I"]
 # decide we want to import code from dbt/ to a context outside of it
 known-third-party = ["dbt"]

+[tool.ruff.lint.per-file-ignores]
+"dbt/models/**.py" = ["E402"]


Ignores import order errors only for Python dbt models.

jeancochrane

Awesome work 🚀

ccao-jardine · 2024-11-26T22:31:14Z

I also changed the column order (and some names). Is that okay?

Should be fine, thanks!

wagnerlmichael added 9 commits June 20, 2024 19:49

Move functinos from assesspy into file and run spark groupby

401c7fa

Try to add numba as dependency

e5e43e6

Add numba dependency

18abede

Revert added packages

4ab840e

Confirm numpy speed up

989be95

Confirm numpy strat without spark

b55bcd4

Try importing dask

b0053ea

Add multiprocessing code

4f21886

Move parallel functions inside boot_ci function

bf02f5b

wagnerlmichael linked an issue Jun 25, 2024 that may be closed by this pull request

Refactor ratio_stats job to use Pyspark #436

Closed

wagnerlmichael added 8 commits June 25, 2024 18:04

Fix linter errors

e46e22d

Fix linter errors

24a5239

Fix linter errors

d4e3f3f

Fix linter errors

c561a1b

Fix linter errors

62b0d15

Remove rduplicate function

8077f30

Remove comment

2acaa9f

Clean up

0410a52

wagnerlmichael commented Jun 26, 2024

View reviewed changes

wagnerlmichael requested review from jeancochrane and dfsnow June 26, 2024 16:22

jeancochrane reviewed Jun 26, 2024

View reviewed changes

Change random seeds and add logs

894b7db

wagnerlmichael added 2 commits July 19, 2024 14:22

Add checkpoint for functioning reduced report_summarise()

f53f311

Add checks and successfully add column to final df

29647d3

Fix col orders and types

e55970d

dfsnow force-pushed the 436-refactor-ratio_stats-job-to-use-pyspark branch from 611789d to e55970d Compare November 19, 2024 22:16

dfsnow and others added 4 commits November 20, 2024 02:01

Add working mostly-Spark implementation

c10e043

Bump max DPUs

757a47c

Get only the first value from each group col

d99972a

Bump nboot to 1000

d489299

dfsnow mentioned this pull request Nov 23, 2024

Refactor AssessPy for consistency, stability ccao-data/assesspy#24

Merged

dfsnow added 2 commits November 25, 2024 22:16

Refactor ratio_stats for assesspy 2.0.0

a125dda

Update types

4ae9362

dfsnow force-pushed the 436-refactor-ratio_stats-job-to-use-pyspark branch from 89edc2e to 4ae9362 Compare November 25, 2024 22:26

dfsnow added 11 commits November 25, 2024 22:28

Update med ratio col names

b8d860f

Check that median sample is gte 2

66f107f

Fix sample constants

3838e80

Add Athena logging

d1e2eb8

Condense Spark ratio_stats code

e881cd7

Reduce nboot to 300

062eddf

Add sales chasing check

c77fae3

Swap bool to Spark data type

b36bb46

Add sample size check for is_sales_chased

50f6c60

Repace calced ratio column

ab7f2fa

Ignore E402 only for Spark python models

5f4711b

dfsnow reviewed Nov 26, 2024

View reviewed changes

dfsnow requested a review from jeancochrane November 26, 2024 19:59

jeancochrane approved these changes Nov 26, 2024

View reviewed changes

dfsnow marked this pull request as ready for review November 26, 2024 22:29

dfsnow requested a review from a team as a code owner November 26, 2024 22:29

dfsnow merged commit 8c6633d into master Nov 26, 2024
7 checks passed

dfsnow deleted the 436-refactor-ratio_stats-job-to-use-pyspark branch November 26, 2024 22:30

dfsnow mentioned this pull request Nov 26, 2024

Update ratio_stats sales chasing detection methods #506

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor ratio stats for build speed increase #521

Refactor ratio stats for build speed increase #521

wagnerlmichael commented Jun 25, 2024 •

edited

Loading

wagnerlmichael Jun 26, 2024

wagnerlmichael Jun 26, 2024

wagnerlmichael Jun 26, 2024 •

edited

Loading

wagnerlmichael Jun 26, 2024

wagnerlmichael Jun 26, 2024 •

edited

Loading

jeancochrane Jun 26, 2024

wagnerlmichael Jun 26, 2024

jeancochrane left a comment

jeancochrane Jun 26, 2024

dfsnow commented Jul 2, 2024 •

edited

Loading

dfsnow left a comment

dfsnow Nov 26, 2024

dfsnow Nov 26, 2024

dfsnow Nov 26, 2024

dfsnow Nov 26, 2024

jeancochrane left a comment

ccao-jardine commented Nov 26, 2024

Refactor ratio stats for build speed increase #521

Refactor ratio stats for build speed increase #521

Conversation

wagnerlmichael commented Jun 25, 2024 • edited Loading

Summary

Other strategies tried

Spark

Numba and Dask

concurrent.futures

Considerations on current strategy

Other ways forward

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wagnerlmichael Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wagnerlmichael Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeancochrane left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dfsnow commented Jul 2, 2024 • edited Loading

dfsnow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeancochrane left a comment

Choose a reason for hiding this comment

ccao-jardine commented Nov 26, 2024

wagnerlmichael commented Jun 25, 2024 •

edited

Loading

`concurrent.futures`

wagnerlmichael Jun 26, 2024 •

edited

Loading

wagnerlmichael Jun 26, 2024 •

edited

Loading

dfsnow commented Jul 2, 2024 •

edited

Loading