-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add script and update build-and-test-dbt
workflow to push dbt Python dependencies to S3
#435
Add script and update build-and-test-dbt
workflow to push dbt Python dependencies to S3
#435
Conversation
2650718
to
074685c
Compare
… steps if requirements were found
074685c
to
09cc186
Compare
deploy-dbt-requirements
workflow for pushing dbt Python requirements to S3deploy-dbt-dependencies
workflow for pushing dbt Python dependencies to S3
ae19253
to
e6b44ea
Compare
…urposes of testing deployment
…ferent CI branches
…thon-dependencies-to-s3
49e1231
to
82fd339
Compare
… build-and-test-dbt
82fd339
to
960d2c3
Compare
…y_dbt_model_dependencies.sh
…pendencies in the workflow
Closing this to test the cleanup step. |
deploy-dbt-dependencies
workflow for pushing dbt Python dependencies to S3build-and-test-dbt
workflow to push dbt Python dependencies to S3
dbt list \ | ||
--quiet -t "$TARGET" -s state:modified state:new \ | ||
--state "$STATE_DIR" --resource-types model \ | ||
--output json --output-keys name | \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This modified/new state selection logic is shared in a couple of places. Might be worth a small follow-up issue to factor them all out into a shared YAML selector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely, let's give it to one of the juniors. Can you write up a brief issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, see #446! Note that I expect a junior would need some detailed guidance for this one, since the issue requires mastery of node selection syntax and YAML anchors in order to get the design right.
-- Get the S3 location that is used to store Python model dependencies | ||
-- for the current target |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we would just add a new attribute s3_dependency_dir
to our profiles defined in profiles.yml
, and then we could have full control over the location where dependencies get stored for each target. That doesn't appear to be possible without extending the dbt-athena
plugin itself to support a new profile variable, so this macro represents a workaround that allows us to compute a dependency dir based on the s3_data_dir
variables set in our profiles.
I think we'll probably need a Python version of this macro as well, but I'm leaving that for myself to figure out later when I tackle #439.
dbt/macros/get_s3_dependency_dir.sql
Outdated
{% endif %} | ||
{% set dir_suffix = "/" ~ head_ref %} | ||
{% endif %} | ||
{% set s3_dependency_dir = target.s3_data_dir ~ "packages" ~ dir_suffix %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One downside of this approach is that it stores packages inside the data dir, when ideally they would be stored alongside the data dir. For clarity, here are the three paths for the three different targets we support:
dev
- data dir:
s3://ccao-dbt-athena-dev-us-east-1/data/
- dependency dir:
s3://ccao-dbt-athena-dev-us-east-1/data/packages/<username>/
- data dir:
ci
- data dir:
s3://ccao-dbt-athena-ci-us-east-1/data/
- dependency dir:
s3://ccao-dbt-athena-ci-us-east-1/data/packages/<branch_slug>/
- data dir:
prod
- data dir:
s3://ccao-athena-ctas-us-east-1/
- dependency dir:
s3://ccao-athena-ctas-us-east-1/packages/
- data dir:
I think prod
is probably the most reasonable of the three. They all feel acceptable right now, but not perfect. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (non-blocking): Well, if we're following our existing schema for the dev, ci, prod split, something like this might be better:
dev
- data dir:
s3://ccao-dbt-athena-dev-us-east-1/data/
- dependency dir:
s3://ccao-dbt-athena-dev-us-east-1/packages/<username>/
- data dir:
ci
- data dir:
s3://ccao-dbt-athena-ci-us-east-1/data/
- dependency dir:
s3://ccao-dbt-athena-ci-us-east-1/packages/<branch_slug>/
- data dir:
prod
- data dir:
s3://ccao-athena-ctas-us-east-1/
- dependency dir:
s3://ccao-athena-packages-us-east-1/
- data dir:
This gives us tighter control over permissions and reduces the blast radius of accidental removal of prod stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dfsnow I agree, that pattern would be cleaner, but per my comment above we aren't actually able to configure the dependency directory as a dedicated profile variable, so it needs to be derivable from other variables like s3_data_dir
. This is why I proposed the organization sketched out in this PR, as imperfect as it is; if you have suggestions for a better one that is derivable from existing profile variables I would be grateful!
Edited to add: I suppose we could implement a scheme that is independent of profile variables by just hardcoding the pattern into the get_s3_dependency_dir
macro based on the target, but it would make me a bit nervous to decouple this particular S3 path config from profiles.yml
when the rest of our S3 path configs are stored there. Still, it's one option for working around the limitations we're facing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh oh, I gotcha, sorry I didn't read closer. What about adding some env vars to dbt_project.yml
:
vars:
s3_package_dir_dev: s3://ccao-dbt-athena-dev-us-east-1/packages/
s3_package_dir_ci: s3://ccao-dbt-athena-ci-us-east-1/packages/
s3_package_dir_prod: s3://ccao-athena-packages-us-east-1/
And then conditionally selecting the directory based on the suffix of the env var. Feels pretty hacky, but at least then the vars are declared and not stored in the conditional logic itself.
Otherwise, I think your schema is fine! All of this should be relatively easy to change if we ever want/need to move package locations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that idea! Done in 02cc83a. Note that I've been using dependency
instead of package
to name these resources since I don't want them to be confused with dbt packages, but I don't feel super strongly about it, particularly since the name of the config
attribute is packages
anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome @jeancochrane. Just a minor issue and a suggestion and I think this is set to go.
dbt/macros/get_s3_dependency_dir.sql
Outdated
{% endif %} | ||
{% set dir_suffix = "/" ~ head_ref %} | ||
{% endif %} | ||
{% set s3_dependency_dir = target.s3_data_dir ~ "packages" ~ dir_suffix %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (non-blocking): Well, if we're following our existing schema for the dev, ci, prod split, something like this might be better:
dev
- data dir:
s3://ccao-dbt-athena-dev-us-east-1/data/
- dependency dir:
s3://ccao-dbt-athena-dev-us-east-1/packages/<username>/
- data dir:
ci
- data dir:
s3://ccao-dbt-athena-ci-us-east-1/data/
- dependency dir:
s3://ccao-dbt-athena-ci-us-east-1/packages/<branch_slug>/
- data dir:
prod
- data dir:
s3://ccao-athena-ctas-us-east-1/
- dependency dir:
s3://ccao-athena-packages-us-east-1/
- data dir:
This gives us tighter control over permissions and reduces the blast radius of accidental removal of prod stuff.
) | ||
|
||
echo "Deleting Python model requirements at $s3_dependency_dir" | ||
aws s3 rm "$s3_dependency_dir" --recursive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (non-blocking): Recursive removal makes me nervous. Let's make sure the dbt Athena role is very tightly scoped to only allow removals of the bare minimum. Can be addressed in a follow up PR to the infra repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit of a tricky issue: Unfortunately, dbt-athena requires DeleteObject
permissions over all objects in s3_data_dir
(docs). It's possible these docs are misleading and there's a way to scope the permissions more tightly, but I think the only way to figure it out would be through trial and error.
One alternative would be to remove this step from cleanup_dbt_resources.sh
and rely on the existing S3 management policy that deletes everything in the CI data/
directory after 30 days. This wouldn't be perfect from a cost management perspective, but it might be good enough. It would also have the negative side effect of making it so that Python model dependencies don't get properly cleaned up when a user runs this script against the dev
target, but we could also handle that by gating this block behind a $TARGET == 'dev'
conditional block.
I'm open to any of these options -- curious what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm alright, as long as there's no way for that call to ever target anything other than the target env buckets then I think it's okay. I can dig a little more on this when we do our permissions cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! I think we're also slightly more protected by the switch to variables for defining s3_dependency_dir
in 02cc83a, since now our S3 paths for dependencies don't overlap with our paths for data.
dbt list \ | ||
--quiet -t "$TARGET" -s state:modified state:new \ | ||
--state "$STATE_DIR" --resource-types model \ | ||
--output json --output-keys name | \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely, let's give it to one of the juniors. Can you write up a brief issue?
…thon-dependencies-to-s3
|
This PR adds a new script
deploy_dbt_model_dependencies.sh
that downloads dependencies for dbt Python models, bundles them into zip archives, and pushes them to S3. It also updates thebuild-and-test-dbt
workflow to use the new script, including the following features:Note that this PR leaves out of scope the additional step to refactor the existing
reporting.ratio_stats
model to use this new dependency management system. That step will be handled in a follow-up issue (#439).Closes #417.
Testing
Evidence of
build-and-test-dbt
pushing packages for changed models to S3 onsynchronize
event: LinkEvidence of
build-and-test-dbt
skipping dependencies that have not changed onsynchronize
event: LinkEvidence of
build-and-test-dbt
skipping models that have no Python dependencies onworkflow_dispatch
event: LinkEvidence of
cleanup-dbt-resources
removing packages from S3 onclosed
event: Link