Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use new view to pull dependency bundle in reporting.ratio_stats #453

Conversation

jeancochrane
Copy link
Contributor

@jeancochrane jeancochrane commented May 15, 2024

This PR uses the Python model dependency deployment system established in #435 to power the Python dependencies in reporting.ratio_stats, our first Python model. We create a new table python_model_dependency that gets referenced in reporting.ratio_stats via a dbt.ref() call as an indirect way of calling get_s3_dependency_dir() in the context of the Python model code.

The design is a little bit counterintuitive due to the fact that Python models 1) currently have no equivalent to macros that would allow us to reuse code and 2) only support accessing project context that is passed in via config variables. If it weren't for these two limitations, I would have preferred one of two alternative solutions:

  1. Defining a Python version of the get_s3_dependency_dir() macro that we could call directly from the context of the reporting.ratio_stats Python model. This is impossible due to limitation 1) above, since there is no equivalent of macros for Python models yet. We could think about deploying a separate bundle to S3 just for this one macro, but we wouldn't be able to namespace it properly by user or branch in dev/CI environments, since scripts would need to pull the bundle containing get_s3_dependency_dir() before they know the location of their S3 dependency dir in the first place. If Python macros were supported, this alternative solution would have entailed Python model code looking something like this:
from macros import get_s3_dependency_dir
sc.addPyFile(f"{get_s3_dependency_dir()}/reporting.ratio_stats.zip")

def model(dbt, spark_session):
    ...
  1. Passing in the value returned by the SQL get_s3_dependency_dir() macro via configs. This is impossible due to limitation 2) above, since only a subset of builtin macros are available at the time when schema files are compiled (see discussion here). If all macros were available at compile time, this alternative solution would have entailed a schema file looking something like this:
models:
  - name: reporting.ratio_stats
    config:
      s3_dependency_dir: '{{ get_s3_dependency_dir() }}'

Closes #439. Note the one extra task from #439 that isn't completed as part of this PR (adding docs for Python models) -- I think I'd prefer to spin that off into a follow-up issue if you're comfortable with it, since some aspects of our use of Python models are still unter active consideration (e.g. which types of transformations should use Python models vs. SQL models vs. Python/R scripts).

@jeancochrane jeancochrane force-pushed the jeancochrane/439-refactor-reportingratio_stats-to-use-new-pattern-for-python-model-dependencies branch from dbf9ee9 to 112b7b7 Compare May 15, 2024 21:56
# not have sqlfluff mocks builtin, so we have to mock out any macros
# that reference those variables if they are used in code that sqlfluff
# lints
get_s3_dependency_dir = {% macro get_s3_dependency_dir() %}s3://bucket{% endmacro %}
Copy link
Contributor Author

@jeancochrane jeancochrane May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this mocked macro, SQLFluff raises an error like this:

== [aws-athena/views/ccao-vw_python_model_dependency.sql] FAIL
L:   1 | P:   1 |  TMP | Unrecoverable failure in Jinja templating: 'target' is
                       | undefined. Have you configured your variables?
                       | https://docs.sqlfluff.com/en/latest/configuration.html

Might be worth revisiting if we do decide to take a second pass at our SQLFluff config in the near future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's defer a harder look at this to #456!

@@ -70,6 +70,12 @@ sources:
tags:
- type_condo

- name: python_model_dependency
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recognize the name of this model is pretty weird, so I'm open to other suggestions!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine!

@jeancochrane jeancochrane force-pushed the jeancochrane/439-refactor-reportingratio_stats-to-use-new-pattern-for-python-model-dependencies branch from 601682f to 3dc4894 Compare May 15, 2024 22:29
@jeancochrane jeancochrane force-pushed the jeancochrane/439-refactor-reportingratio_stats-to-use-new-pattern-for-python-model-dependencies branch from 26361e4 to 58af9b1 Compare May 15, 2024 22:50
Comment on lines +2 to +7
def model(dbt, spark_session):
dbt.config(materialized="table")

import numpy as np # noqa: E402
import pandas as pd # noqa: E402
from assesspy import boot_ci # noqa: E402
from assesspy import cod # noqa: E402
from assesspy import prd_met # noqa: E402
from assesspy import cod_ci as cod_boot # noqa: E402
from assesspy import cod_met, mki, mki_met, prb, prb_met, prd # noqa: E402
from assesspy import prd_ci as prd_boot # noqa: E402
# Load dependency bundle so that we can import deps
python_model_dependency = dbt.ref("ccao.python_model_dependency")
s3_dependency_dir = python_model_dependency.first()["s3_dependency_dir"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One major downside of this solution is that the sc.addPyFile() call and any subsequent import calls must be defined inside the context of the model function, since sc.addPyFile() won't know where to load the bundle from until we can pull s3_dependency_dir from the ccao.python_model_dependency table using a dbt.ref() call. This is slightly annoying insofar as it makes the Python model file look less similar to a regular Python script than it would otherwise be, but I don't actually think it's a particularly big deal.

@@ -1,237 +1,234 @@
# type: ignore
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff in this file might look pretty gnarly, but it's almost all whitespace changes due to the fact that the code needed to be indented one more layer to nest inside the context of the model() function (more on that soon). If you hide whitespace in the diff, it should be easier to review.


# Create a zip archive from the contents of the subdirectory
zip_archive_name="${model_identifier}.requirements.zip"
echo "Creating zip archive $zip_archive_name from $subdirectory_name"
zip -q -r "$zip_archive_name" "$subdirectory_name"
cd "$subdirectory_name" && zip -q -r9 "../$zip_archive_name" ./* && cd ..
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change more closely aligns us with the recommended workflow for bundling external dependencies for Athena PySpark, both by using a higher compression level for zip and removing the top level of the zip archive. There may be a cleaner way to implement that latter goal without requiring the cd calls, but in the meantime this works and is reasonably clear.

@@ -147,12 +147,12 @@ while read -r item; do
subdirectory_name="${model_identifier}/"
mkdir -p "$subdirectory_name"
echo "Installing dependencies from $requirements_filename into $subdirectory_name"
pip install -t "$subdirectory_name" -r "$requirements_filename"
pip install -t "$subdirectory_name" -r "$requirements_filename" --no-deps
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --no-deps flag gives us more fine-grained control over the dependencies that get installed, which is important because some more complex dependencies like numpy and pandas do not seem to install properly when installed into a target directory and then bundled into a zipfile. (assesspy depends on these two packages, hence the change as part of this PR.) It may be worth future investigation to see if there's just a trick that I'm missing for installing these kinds of packages, but in the meantime we can fall back to the prebuilt versions of these packages where necessary by avoiding installation of our dependencies' dependencies using the --no-deps flag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For our own future reference: we found a note buried in the Athena Spark docs that says you can only use pure Python packages. Seems like anything with C code in it currently doesn't work.

@@ -0,0 +1,3 @@
{{ config(materialized='table') }}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to be a table because Athena views cannot be accessed in PySpark scripts.

@jeancochrane jeancochrane marked this pull request as ready for review May 16, 2024 20:33
@jeancochrane jeancochrane requested a review from a team as a code owner May 16, 2024 20:33
@jeancochrane jeancochrane requested a review from dfsnow May 16, 2024 20:34
Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm of two minds about this one. On one hand, I like that this gives us per-environment control over packages and that we get a package zip per model.

On the other, I'm worried that this is a lot of complexity that will ultimately make Python models a bit more brittle and hard-to-use. Specifically, needing to smuggle stuff through a table and putting everything in model definition gives me bad vibes (as the kids say).

I think this is perfectly workable in its current state, but I'm going to outline an alternative that IMO simplifies things a bit.

Global Package Directory

Rather than a per-model, per-environment zip of packages, we could create a global directory of packages by version (probably in s3://ccao-athena-dependencies-us-east-1).

CI Setup

In this setup, people would specify the specific package version they want for a model in that model's schema file. We'd then have a CI job collect and dedupe those versions into a single list. We can then pip install --no-deps each of the packages, zip the results, and name the file by the package name and version. Finally, we'd aws s3 sync the CI package dir with the bucket.

The CI job would be triggered on any branch, meaning PRs would also populate this global directory with any new packages they use. We'd control the total size of the package directory with lifecycle rules (e.g. expire after 30 days).

Python Model Setup

Users of Python models would need to do two things to use packages:

  1. Add the package to the schema file for the model
  2. Add a sc.addPyFile() call to the top of their model for each specific package they want to use (kind of like imports)

Pros and Cons

  • Pros:
    • Easier to debug, since each package exists in its own zip and they all live in a single place
    • No smuggling the target environment or putting everything in model()
    • Aligns with the existing Athena Spark setup of "Here is a set of packages and versions you can use"
  • Cons:
    • Global shared directory means PRs affect the prod env, no per model packages
    • Need to specify the version of each package you want to install, plus import that version in the model

Thoughts

I think this makes sense in a world where we'll probably use a small number of pure Python packages only in our downstream models. If we were planning to use Python models for ingest with complex packages like GeoPandas, then I'd be all in on the current solution.

Ultimately, @jeancochrane it's up to you how to proceed. I'm perfectly happy to have this merged basically as-is.

@@ -147,12 +147,12 @@ while read -r item; do
subdirectory_name="${model_identifier}/"
mkdir -p "$subdirectory_name"
echo "Installing dependencies from $requirements_filename into $subdirectory_name"
pip install -t "$subdirectory_name" -r "$requirements_filename"
pip install -t "$subdirectory_name" -r "$requirements_filename" --no-deps
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For our own future reference: we found a note buried in the Athena Spark docs that says you can only use pure Python packages. Seems like anything with C code in it currently doesn't work.

# not have sqlfluff mocks builtin, so we have to mock out any macros
# that reference those variables if they are used in code that sqlfluff
# lints
get_s3_dependency_dir = {% macro get_s3_dependency_dir() %}s3://bucket{% endmacro %}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's defer a harder look at this to #456!

@@ -70,6 +70,12 @@ sources:
tags:
- type_condo

- name: python_model_dependency
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine!

Comment on lines +6 to +7
python_model_dependency = dbt.ref("ccao.python_model_dependency")
s3_dependency_dir = python_model_dependency.first()["s3_dependency_dir"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: This is a clever hack!

@jeancochrane
Copy link
Contributor Author

Agreed with your proposed design @dfsnow! I'm going to merge this in as-is so that we can preserve a commit with the current design, but I'm going to open up a fast follow that refactors to simplify things.

@jeancochrane jeancochrane merged commit cd5062a into master May 20, 2024
12 checks passed
@jeancochrane jeancochrane deleted the jeancochrane/439-refactor-reportingratio_stats-to-use-new-pattern-for-python-model-dependencies branch May 20, 2024 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor reporting.ratio_stats to use new pattern for Python model dependencies
2 participants