-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use new view to pull dependency bundle in reporting.ratio_stats
#453
Use new view to pull dependency bundle in reporting.ratio_stats
#453
Conversation
dbf9ee9
to
112b7b7
Compare
# not have sqlfluff mocks builtin, so we have to mock out any macros | ||
# that reference those variables if they are used in code that sqlfluff | ||
# lints | ||
get_s3_dependency_dir = {% macro get_s3_dependency_dir() %}s3://bucket{% endmacro %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this mocked macro, SQLFluff raises an error like this:
== [aws-athena/views/ccao-vw_python_model_dependency.sql] FAIL
L: 1 | P: 1 | TMP | Unrecoverable failure in Jinja templating: 'target' is
| undefined. Have you configured your variables?
| https://docs.sqlfluff.com/en/latest/configuration.html
Might be worth revisiting if we do decide to take a second pass at our SQLFluff config in the near future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's defer a harder look at this to #456!
@@ -70,6 +70,12 @@ sources: | |||
tags: | |||
- type_condo | |||
|
|||
- name: python_model_dependency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recognize the name of this model is pretty weird, so I'm open to other suggestions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine!
601682f
to
3dc4894
Compare
26361e4
to
58af9b1
Compare
…f testing" This reverts commit 58e415b.
def model(dbt, spark_session): | ||
dbt.config(materialized="table") | ||
|
||
import numpy as np # noqa: E402 | ||
import pandas as pd # noqa: E402 | ||
from assesspy import boot_ci # noqa: E402 | ||
from assesspy import cod # noqa: E402 | ||
from assesspy import prd_met # noqa: E402 | ||
from assesspy import cod_ci as cod_boot # noqa: E402 | ||
from assesspy import cod_met, mki, mki_met, prb, prb_met, prd # noqa: E402 | ||
from assesspy import prd_ci as prd_boot # noqa: E402 | ||
# Load dependency bundle so that we can import deps | ||
python_model_dependency = dbt.ref("ccao.python_model_dependency") | ||
s3_dependency_dir = python_model_dependency.first()["s3_dependency_dir"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One major downside of this solution is that the sc.addPyFile()
call and any subsequent import
calls must be defined inside the context of the model
function, since sc.addPyFile()
won't know where to load the bundle from until we can pull s3_dependency_dir
from the ccao.python_model_dependency
table using a dbt.ref()
call. This is slightly annoying insofar as it makes the Python model file look less similar to a regular Python script than it would otherwise be, but I don't actually think it's a particularly big deal.
@@ -1,237 +1,234 @@ | |||
# type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff in this file might look pretty gnarly, but it's almost all whitespace changes due to the fact that the code needed to be indented one more layer to nest inside the context of the model()
function (more on that soon). If you hide whitespace in the diff, it should be easier to review.
|
||
# Create a zip archive from the contents of the subdirectory | ||
zip_archive_name="${model_identifier}.requirements.zip" | ||
echo "Creating zip archive $zip_archive_name from $subdirectory_name" | ||
zip -q -r "$zip_archive_name" "$subdirectory_name" | ||
cd "$subdirectory_name" && zip -q -r9 "../$zip_archive_name" ./* && cd .. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change more closely aligns us with the recommended workflow for bundling external dependencies for Athena PySpark, both by using a higher compression level for zip
and removing the top level of the zip archive. There may be a cleaner way to implement that latter goal without requiring the cd
calls, but in the meantime this works and is reasonably clear.
@@ -147,12 +147,12 @@ while read -r item; do | |||
subdirectory_name="${model_identifier}/" | |||
mkdir -p "$subdirectory_name" | |||
echo "Installing dependencies from $requirements_filename into $subdirectory_name" | |||
pip install -t "$subdirectory_name" -r "$requirements_filename" | |||
pip install -t "$subdirectory_name" -r "$requirements_filename" --no-deps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The --no-deps
flag gives us more fine-grained control over the dependencies that get installed, which is important because some more complex dependencies like numpy
and pandas
do not seem to install properly when installed into a target directory and then bundled into a zipfile. (assesspy
depends on these two packages, hence the change as part of this PR.) It may be worth future investigation to see if there's just a trick that I'm missing for installing these kinds of packages, but in the meantime we can fall back to the prebuilt versions of these packages where necessary by avoiding installation of our dependencies' dependencies using the --no-deps
flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For our own future reference: we found a note buried in the Athena Spark docs that says you can only use pure Python packages. Seems like anything with C code in it currently doesn't work.
@@ -0,0 +1,3 @@ | |||
{{ config(materialized='table') }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has to be a table because Athena views cannot be accessed in PySpark scripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm of two minds about this one. On one hand, I like that this gives us per-environment control over packages and that we get a package zip per model.
On the other, I'm worried that this is a lot of complexity that will ultimately make Python models a bit more brittle and hard-to-use. Specifically, needing to smuggle stuff through a table and putting everything in model definition gives me bad vibes (as the kids say).
I think this is perfectly workable in its current state, but I'm going to outline an alternative that IMO simplifies things a bit.
Global Package Directory
Rather than a per-model, per-environment zip of packages, we could create a global directory of packages by version (probably in s3://ccao-athena-dependencies-us-east-1
).
CI Setup
In this setup, people would specify the specific package version they want for a model in that model's schema file. We'd then have a CI job collect and dedupe those versions into a single list. We can then pip install --no-deps
each of the packages, zip the results, and name the file by the package name and version. Finally, we'd aws s3 sync
the CI package dir with the bucket.
The CI job would be triggered on any branch, meaning PRs would also populate this global directory with any new packages they use. We'd control the total size of the package directory with lifecycle rules (e.g. expire after 30 days).
Python Model Setup
Users of Python models would need to do two things to use packages:
- Add the package to the schema file for the model
- Add a
sc.addPyFile()
call to the top of their model for each specific package they want to use (kind of like imports)
Pros and Cons
- Pros:
- Easier to debug, since each package exists in its own zip and they all live in a single place
- No smuggling the target environment or putting everything in
model()
- Aligns with the existing Athena Spark setup of "Here is a set of packages and versions you can use"
- Cons:
- Global shared directory means PRs affect the prod env, no per model packages
- Need to specify the version of each package you want to install, plus import that version in the model
Thoughts
I think this makes sense in a world where we'll probably use a small number of pure Python packages only in our downstream models. If we were planning to use Python models for ingest with complex packages like GeoPandas, then I'd be all in on the current solution.
Ultimately, @jeancochrane it's up to you how to proceed. I'm perfectly happy to have this merged basically as-is.
@@ -147,12 +147,12 @@ while read -r item; do | |||
subdirectory_name="${model_identifier}/" | |||
mkdir -p "$subdirectory_name" | |||
echo "Installing dependencies from $requirements_filename into $subdirectory_name" | |||
pip install -t "$subdirectory_name" -r "$requirements_filename" | |||
pip install -t "$subdirectory_name" -r "$requirements_filename" --no-deps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For our own future reference: we found a note buried in the Athena Spark docs that says you can only use pure Python packages. Seems like anything with C code in it currently doesn't work.
# not have sqlfluff mocks builtin, so we have to mock out any macros | ||
# that reference those variables if they are used in code that sqlfluff | ||
# lints | ||
get_s3_dependency_dir = {% macro get_s3_dependency_dir() %}s3://bucket{% endmacro %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's defer a harder look at this to #456!
@@ -70,6 +70,12 @@ sources: | |||
tags: | |||
- type_condo | |||
|
|||
- name: python_model_dependency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine!
python_model_dependency = dbt.ref("ccao.python_model_dependency") | ||
s3_dependency_dir = python_model_dependency.first()["s3_dependency_dir"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praise: This is a clever hack!
…tats-to-use-new-pattern-for-python-model-dependencies
Agreed with your proposed design @dfsnow! I'm going to merge this in as-is so that we can preserve a commit with the current design, but I'm going to open up a fast follow that refactors to simplify things. |
This PR uses the Python model dependency deployment system established in #435 to power the Python dependencies in
reporting.ratio_stats
, our first Python model. We create a new tablepython_model_dependency
that gets referenced inreporting.ratio_stats
via adbt.ref()
call as an indirect way of callingget_s3_dependency_dir()
in the context of the Python model code.The design is a little bit counterintuitive due to the fact that Python models 1) currently have no equivalent to macros that would allow us to reuse code and 2) only support accessing project context that is passed in via config variables. If it weren't for these two limitations, I would have preferred one of two alternative solutions:
get_s3_dependency_dir()
macro that we could call directly from the context of thereporting.ratio_stats
Python model. This is impossible due to limitation 1) above, since there is no equivalent of macros for Python models yet. We could think about deploying a separate bundle to S3 just for this one macro, but we wouldn't be able to namespace it properly by user or branch in dev/CI environments, since scripts would need to pull the bundle containingget_s3_dependency_dir()
before they know the location of their S3 dependency dir in the first place. If Python macros were supported, this alternative solution would have entailed Python model code looking something like this:get_s3_dependency_dir()
macro via configs. This is impossible due to limitation 2) above, since only a subset of builtin macros are available at the time when schema files are compiled (see discussion here). If all macros were available at compile time, this alternative solution would have entailed a schema file looking something like this:Closes #439. Note the one extra task from #439 that isn't completed as part of this PR (adding docs for Python models) -- I think I'd prefer to spin that off into a follow-up issue if you're comfortable with it, since some aspects of our use of Python models are still unter active consideration (e.g. which types of transformations should use Python models vs. SQL models vs. Python/R scripts).