Use new view to pull dependency bundle in `reporting.ratio_stats` #453

jeancochrane · 2024-05-15T21:48:43Z

This PR uses the Python model dependency deployment system established in #435 to power the Python dependencies in reporting.ratio_stats, our first Python model. We create a new table python_model_dependency that gets referenced in reporting.ratio_stats via a dbt.ref() call as an indirect way of calling get_s3_dependency_dir() in the context of the Python model code.

The design is a little bit counterintuitive due to the fact that Python models 1) currently have no equivalent to macros that would allow us to reuse code and 2) only support accessing project context that is passed in via config variables. If it weren't for these two limitations, I would have preferred one of two alternative solutions:

Defining a Python version of the get_s3_dependency_dir() macro that we could call directly from the context of the reporting.ratio_stats Python model. This is impossible due to limitation 1) above, since there is no equivalent of macros for Python models yet. We could think about deploying a separate bundle to S3 just for this one macro, but we wouldn't be able to namespace it properly by user or branch in dev/CI environments, since scripts would need to pull the bundle containing get_s3_dependency_dir() before they know the location of their S3 dependency dir in the first place. If Python macros were supported, this alternative solution would have entailed Python model code looking something like this:

from macros import get_s3_dependency_dir
sc.addPyFile(f"{get_s3_dependency_dir()}/reporting.ratio_stats.zip")

def model(dbt, spark_session):
    ...

Passing in the value returned by the SQL get_s3_dependency_dir() macro via configs. This is impossible due to limitation 2) above, since only a subset of builtin macros are available at the time when schema files are compiled (see discussion here). If all macros were available at compile time, this alternative solution would have entailed a schema file looking something like this:

models:
  - name: reporting.ratio_stats
    config:
      s3_dependency_dir: '{{ get_s3_dependency_dir() }}'

Closes #439. Note the one extra task from #439 that isn't completed as part of this PR (adding docs for Python models) -- I think I'd prefer to spin that off into a follow-up issue if you're comfortable with it, since some aspects of our use of Python models are still unter active consideration (e.g. which types of transformations should use Python models vs. SQL models vs. Python/R scripts).

…ng.ratio_stats

jeancochrane · 2024-05-15T21:49:21Z

.sqlfluff

+# not have sqlfluff mocks builtin, so we have to mock out any macros
+# that reference those variables if they are used in code that sqlfluff
+# lints
+get_s3_dependency_dir = {% macro get_s3_dependency_dir() %}s3://bucket{% endmacro %}


Without this mocked macro, SQLFluff raises an error like this:

== [aws-athena/views/ccao-vw_python_model_dependency.sql] FAIL L: 1 | P: 1 | TMP | Unrecoverable failure in Jinja templating: 'target' is | undefined. Have you configured your variables? | https://docs.sqlfluff.com/en/latest/configuration.html

Might be worth revisiting if we do decide to take a second pass at our SQLFluff config in the near future.

Let's defer a harder look at this to #456!

jeancochrane · 2024-05-15T22:07:53Z

dbt/models/ccao/schema.yml

@@ -70,6 +70,12 @@ sources:
        tags:
          - type_condo

+      - name: python_model_dependency


I recognize the name of this model is pretty weird, so I'm open to other suggestions!

I think it's fine!

… model

…debugging

…ratio_stats

…io_stats

…ting.ratio_stats" This reverts commit 04f0015.

…porting.ratio_stats" This reverts commit d43fa6e.

…ath for debugging" This reverts commit 3dc4894.

…chive

…cies.sh

…ndencies.sh

…f testing" This reverts commit 58e415b.

jeancochrane · 2024-05-16T20:26:34Z

dbt/models/reporting/reporting.ratio_stats.py

+def model(dbt, spark_session):
+    dbt.config(materialized="table")

-import numpy as np  # noqa: E402
-import pandas as pd  # noqa: E402
-from assesspy import boot_ci  # noqa: E402
-from assesspy import cod  # noqa: E402
-from assesspy import prd_met  # noqa: E402
-from assesspy import cod_ci as cod_boot  # noqa: E402
-from assesspy import cod_met, mki, mki_met, prb, prb_met, prd  # noqa: E402
-from assesspy import prd_ci as prd_boot  # noqa: E402
+    # Load dependency bundle so that we can import deps
+    python_model_dependency = dbt.ref("ccao.python_model_dependency")
+    s3_dependency_dir = python_model_dependency.first()["s3_dependency_dir"]


One major downside of this solution is that the sc.addPyFile() call and any subsequent import calls must be defined inside the context of the model function, since sc.addPyFile() won't know where to load the bundle from until we can pull s3_dependency_dir from the ccao.python_model_dependency table using a dbt.ref() call. This is slightly annoying insofar as it makes the Python model file look less similar to a regular Python script than it would otherwise be, but I don't actually think it's a particularly big deal.

jeancochrane · 2024-05-16T20:27:49Z

dbt/models/reporting/reporting.ratio_stats.py

@@ -1,237 +1,234 @@
-# type: ignore


The diff in this file might look pretty gnarly, but it's almost all whitespace changes due to the fact that the code needed to be indented one more layer to nest inside the context of the model() function (more on that soon). If you hide whitespace in the diff, it should be easier to review.

jeancochrane · 2024-05-16T20:30:09Z

.github/scripts/deploy_dbt_model_dependencies.sh


    # Create a zip archive from the contents of the subdirectory
    zip_archive_name="${model_identifier}.requirements.zip"
    echo "Creating zip archive $zip_archive_name from $subdirectory_name"
-    zip -q -r "$zip_archive_name" "$subdirectory_name"
+    cd "$subdirectory_name" && zip -q -r9 "../$zip_archive_name" ./* && cd ..


This change more closely aligns us with the recommended workflow for bundling external dependencies for Athena PySpark, both by using a higher compression level for zip and removing the top level of the zip archive. There may be a cleaner way to implement that latter goal without requiring the cd calls, but in the meantime this works and is reasonably clear.

jeancochrane · 2024-05-16T20:32:10Z

.github/scripts/deploy_dbt_model_dependencies.sh

@@ -147,12 +147,12 @@ while read -r item; do
    subdirectory_name="${model_identifier}/"
    mkdir -p "$subdirectory_name"
    echo "Installing dependencies from $requirements_filename into $subdirectory_name"
-    pip install -t "$subdirectory_name" -r "$requirements_filename"
+    pip install -t "$subdirectory_name" -r "$requirements_filename" --no-deps


The --no-deps flag gives us more fine-grained control over the dependencies that get installed, which is important because some more complex dependencies like numpy and pandas do not seem to install properly when installed into a target directory and then bundled into a zipfile. (assesspy depends on these two packages, hence the change as part of this PR.) It may be worth future investigation to see if there's just a trick that I'm missing for installing these kinds of packages, but in the meantime we can fall back to the prebuilt versions of these packages where necessary by avoiding installation of our dependencies' dependencies using the --no-deps flag.

For our own future reference: we found a note buried in the Athena Spark docs that says you can only use pure Python packages. Seems like anything with C code in it currently doesn't work.

jeancochrane · 2024-05-16T20:33:38Z

aws-athena/ctas/ccao-python_model_dependency.sql

@@ -0,0 +1,3 @@
+{{ config(materialized='table') }}


This has to be a table because Athena views cannot be accessed in PySpark scripts.

dfsnow

I'm of two minds about this one. On one hand, I like that this gives us per-environment control over packages and that we get a package zip per model.

On the other, I'm worried that this is a lot of complexity that will ultimately make Python models a bit more brittle and hard-to-use. Specifically, needing to smuggle stuff through a table and putting everything in model definition gives me bad vibes (as the kids say).

I think this is perfectly workable in its current state, but I'm going to outline an alternative that IMO simplifies things a bit.

Global Package Directory

Rather than a per-model, per-environment zip of packages, we could create a global directory of packages by version (probably in s3://ccao-athena-dependencies-us-east-1).

CI Setup

In this setup, people would specify the specific package version they want for a model in that model's schema file. We'd then have a CI job collect and dedupe those versions into a single list. We can then pip install --no-deps each of the packages, zip the results, and name the file by the package name and version. Finally, we'd aws s3 sync the CI package dir with the bucket.

The CI job would be triggered on any branch, meaning PRs would also populate this global directory with any new packages they use. We'd control the total size of the package directory with lifecycle rules (e.g. expire after 30 days).

Python Model Setup

Users of Python models would need to do two things to use packages:

Add the package to the schema file for the model
Add a sc.addPyFile() call to the top of their model for each specific package they want to use (kind of like imports)

Pros and Cons

Pros:
- Easier to debug, since each package exists in its own zip and they all live in a single place
- No smuggling the target environment or putting everything in model()
- Aligns with the existing Athena Spark setup of "Here is a set of packages and versions you can use"
Cons:
- Global shared directory means PRs affect the prod env, no per model packages
- Need to specify the version of each package you want to install, plus import that version in the model

Thoughts

I think this makes sense in a world where we'll probably use a small number of pure Python packages only in our downstream models. If we were planning to use Python models for ingest with complex packages like GeoPandas, then I'd be all in on the current solution.

Ultimately, @jeancochrane it's up to you how to proceed. I'm perfectly happy to have this merged basically as-is.

dfsnow · 2024-05-17T15:23:17Z

.github/scripts/deploy_dbt_model_dependencies.sh

@@ -147,12 +147,12 @@ while read -r item; do
    subdirectory_name="${model_identifier}/"
    mkdir -p "$subdirectory_name"
    echo "Installing dependencies from $requirements_filename into $subdirectory_name"
-    pip install -t "$subdirectory_name" -r "$requirements_filename"
+    pip install -t "$subdirectory_name" -r "$requirements_filename" --no-deps


For our own future reference: we found a note buried in the Athena Spark docs that says you can only use pure Python packages. Seems like anything with C code in it currently doesn't work.

dfsnow · 2024-05-17T15:24:46Z

.sqlfluff

+# not have sqlfluff mocks builtin, so we have to mock out any macros
+# that reference those variables if they are used in code that sqlfluff
+# lints
+get_s3_dependency_dir = {% macro get_s3_dependency_dir() %}s3://bucket{% endmacro %}


Let's defer a harder look at this to #456!

dfsnow · 2024-05-17T15:27:19Z

dbt/models/ccao/schema.yml

@@ -70,6 +70,12 @@ sources:
        tags:
          - type_condo

+      - name: python_model_dependency


I think it's fine!

dfsnow · 2024-05-17T15:28:16Z

dbt/models/reporting/reporting.ratio_stats.py

+    python_model_dependency = dbt.ref("ccao.python_model_dependency")
+    s3_dependency_dir = python_model_dependency.first()["s3_dependency_dir"]


praise: This is a clever hack!

…tats-to-use-new-pattern-for-python-model-dependencies

…/ directory

jeancochrane · 2024-05-20T15:44:45Z

Agreed with your proposed design @dfsnow! I'm going to merge this in as-is so that we can preserve a commit with the current design, but I'm going to open up a fast follow that refactors to simplify things.

Use python_model_dependency view to pull dependency bundle in reporti…

ed7d76e

…ng.ratio_stats

jeancochrane linked an issue May 15, 2024 that may be closed by this pull request

Refactor reporting.ratio_stats to use new pattern for Python model dependencies #439

Closed

Make python_model_dependency a table instead of a view

112b7b7

jeancochrane force-pushed the jeancochrane/439-refactor-reportingratio_stats-to-use-new-pattern-for-python-model-dependencies branch from dbf9ee9 to 112b7b7 Compare May 15, 2024 21:56

jeancochrane commented May 15, 2024

View reviewed changes

jeancochrane added 3 commits May 15, 2024 17:09

Use correct path to remote dependency bundle in reporting.ratio_stats…

85ee002

… model

Temporarily simplify reporting.ratio_stats for the purposes of testing

58e415b

Temporarily try calling sc.addPyFile on the fully qualified path for …

3dc4894

…debugging

jeancochrane force-pushed the jeancochrane/439-refactor-reportingratio_stats-to-use-new-pattern-for-python-model-dependencies branch from 601682f to 3dc4894 Compare May 15, 2024 22:29

jeancochrane added 7 commits May 15, 2024 17:32

Try moving sc.addPyFile() call above model() definition in reporting.…

d43fa6e

…ratio_stats

Temporarily try pointing to different package bundle in reporting.rat…

04f0015

…io_stats

Remove numpy and pandas from reporting.ratio_stats packages list

a5b9e12

Revert "Temporarily try pointing to different package bundle in repor…

b7f7c33

…ting.ratio_stats" This reverts commit 04f0015.

Revert "Try moving sc.addPyFile() call above model() definition in re…

7e08001

…porting.ratio_stats" This reverts commit d43fa6e.

Revert "Temporarily try calling sc.addPyFile on the fully qualified p…

44d7f94

…ath for debugging" This reverts commit 3dc4894.

Update deploy_dbt_model_dependencies.sh to remove top level of zip ar…

58af9b1

…chive

jeancochrane force-pushed the jeancochrane/439-refactor-reportingratio_stats-to-use-new-pattern-for-python-model-dependencies branch from 26361e4 to 58af9b1 Compare May 15, 2024 22:50

jeancochrane added 6 commits May 16, 2024 11:09

Remove numpy from package bundle in deploy_dbt_model_dependencies.sh

16b600c

Appease shellcheck by fixing glob syntax in deploy_dbt_model_dependen…

64c63d2

…cies.sh

Exclude pandas as a preinstalled package in deploy_dbt_model_dependen…

d494cca

…cies.sh

Generalize the preinstalled_packages pattern in deploy_dbt_model_depe…

9dfa8c5

…ndencies.sh

Run pip install with --no-deps in deploy_dbt_model_dependencies.sh

189d629

Revert "Temporarily simplify reporting.ratio_stats for the purposes o…

5e12b3f

…f testing" This reverts commit 58e415b.

jeancochrane commented May 16, 2024

View reviewed changes

jeancochrane marked this pull request as ready for review May 16, 2024 20:33

jeancochrane requested a review from a team as a code owner May 16, 2024 20:33

jeancochrane requested a review from dfsnow May 16, 2024 20:34

dfsnow approved these changes May 17, 2024

View reviewed changes

jeancochrane added 2 commits May 17, 2024 18:58

Merge branch 'master' into jeancochrane/439-refactor-reportingratio_s…

a0fdf13

…tats-to-use-new-pattern-for-python-model-dependencies

Move ccao.python_model_dependency from aws-athena/ctas/ to dbt/models…

ba5b4b6

…/ directory

jeancochrane merged commit cd5062a into master May 20, 2024
12 checks passed

jeancochrane deleted the jeancochrane/439-refactor-reportingratio_stats-to-use-new-pattern-for-python-model-dependencies branch May 20, 2024 15:44

jeancochrane mentioned this pull request May 20, 2024

Refactor Python model dependencies to use global package repository #461

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use new view to pull dependency bundle in `reporting.ratio_stats` #453

Use new view to pull dependency bundle in `reporting.ratio_stats` #453

jeancochrane commented May 15, 2024 •

edited

Loading

jeancochrane May 15, 2024 •

edited

Loading

dfsnow May 17, 2024

jeancochrane May 15, 2024

dfsnow May 17, 2024

jeancochrane May 16, 2024

jeancochrane May 16, 2024

jeancochrane May 16, 2024

jeancochrane May 16, 2024

dfsnow May 17, 2024

jeancochrane May 16, 2024

dfsnow left a comment

dfsnow May 17, 2024

dfsnow May 17, 2024

dfsnow May 17, 2024

dfsnow May 17, 2024

jeancochrane commented May 20, 2024

		python_model_dependency = dbt.ref("ccao.python_model_dependency")
		s3_dependency_dir = python_model_dependency.first()["s3_dependency_dir"]

Use new view to pull dependency bundle in reporting.ratio_stats #453

Use new view to pull dependency bundle in reporting.ratio_stats #453

Conversation

jeancochrane commented May 15, 2024 • edited Loading

jeancochrane May 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dfsnow left a comment

Choose a reason for hiding this comment

Global Package Directory

CI Setup

Python Model Setup

Pros and Cons

Thoughts

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeancochrane commented May 20, 2024

Use new view to pull dependency bundle in `reporting.ratio_stats` #453

Use new view to pull dependency bundle in `reporting.ratio_stats` #453

jeancochrane commented May 15, 2024 •

edited

Loading

jeancochrane May 15, 2024 •

edited

Loading