Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add script and update build-and-test-dbt workflow to push dbt Python dependencies to S3 #435

Conversation

jeancochrane
Copy link
Contributor

@jeancochrane jeancochrane commented May 8, 2024

This PR adds a new script deploy_dbt_model_dependencies.sh that downloads dependencies for dbt Python models, bundles them into zip archives, and pushes them to S3. It also updates the build-and-test-dbt workflow to use the new script, including the following features:

  • If a dbt cache exists, only deploy packages for models that are new or have changed
  • If a caller has dispatched the workflow for a specific set of models, only deploy packages for those models
  • In all cases, only deploy package bundles that are new or have changed
  • Deploy a separate package bundle for each model
  • Deploy package bundles to separate S3 locations for dev, CI, and prod, and isolate dev/CI environments by user or branch

Note that this PR leaves out of scope the additional step to refactor the existing reporting.ratio_stats model to use this new dependency management system. That step will be handled in a follow-up issue (#439).

Closes #417.

Testing

Evidence of build-and-test-dbt pushing packages for changed models to S3 on synchronize event: Link

Evidence of build-and-test-dbt skipping dependencies that have not changed on synchronize event: Link

Evidence of build-and-test-dbt skipping models that have no Python dependencies on workflow_dispatch event: Link

Evidence of cleanup-dbt-resources removing packages from S3 on closed event: Link

@jeancochrane jeancochrane linked an issue May 8, 2024 that may be closed by this pull request
@jeancochrane jeancochrane force-pushed the jeancochrane/417-add-workflow-to-deploy-python-dependencies-to-s3 branch from 2650718 to 074685c Compare May 8, 2024 16:03
@jeancochrane jeancochrane force-pushed the jeancochrane/417-add-workflow-to-deploy-python-dependencies-to-s3 branch from 074685c to 09cc186 Compare May 8, 2024 16:07
@jeancochrane jeancochrane changed the title Add deploy-dbt-requirements workflow for pushing dbt Python requirements to S3 Add deploy-dbt-dependencies workflow for pushing dbt Python dependencies to S3 May 8, 2024
@jeancochrane jeancochrane force-pushed the jeancochrane/417-add-workflow-to-deploy-python-dependencies-to-s3 branch from ae19253 to e6b44ea Compare May 8, 2024 16:16
@jeancochrane jeancochrane force-pushed the jeancochrane/417-add-workflow-to-deploy-python-dependencies-to-s3 branch 2 times, most recently from 49e1231 to 82fd339 Compare May 9, 2024 17:35
@jeancochrane jeancochrane force-pushed the jeancochrane/417-add-workflow-to-deploy-python-dependencies-to-s3 branch from 82fd339 to 960d2c3 Compare May 9, 2024 17:37
@jeancochrane
Copy link
Contributor Author

Closing this to test the cleanup step.

@jeancochrane jeancochrane reopened this May 9, 2024
@jeancochrane jeancochrane changed the title Add deploy-dbt-dependencies workflow for pushing dbt Python dependencies to S3 Add script and update build-and-test-dbt workflow to push dbt Python dependencies to S3 May 9, 2024
.github/scripts/cleanup_dbt_resources.sh Show resolved Hide resolved
Comment on lines +69 to +72
dbt list \
--quiet -t "$TARGET" -s state:modified state:new \
--state "$STATE_DIR" --resource-types model \
--output json --output-keys name | \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This modified/new state selection logic is shared in a couple of places. Might be worth a small follow-up issue to factor them all out into a shared YAML selector.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, let's give it to one of the juniors. Can you write up a brief issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, see #446! Note that I expect a junior would need some detailed guidance for this one, since the issue requires mastery of node selection syntax and YAML anchors in order to get the design right.

.github/workflows/build_and_test_dbt.yaml Show resolved Hide resolved
.github/workflows/build_and_test_dbt.yaml Show resolved Hide resolved
.github/workflows/build_and_test_dbt.yaml Show resolved Hide resolved
dbt/macros/get_s3_dependency_dir.sql Outdated Show resolved Hide resolved
Comment on lines +1 to +2
-- Get the S3 location that is used to store Python model dependencies
-- for the current target
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would just add a new attribute s3_dependency_dir to our profiles defined in profiles.yml, and then we could have full control over the location where dependencies get stored for each target. That doesn't appear to be possible without extending the dbt-athena plugin itself to support a new profile variable, so this macro represents a workaround that allows us to compute a dependency dir based on the s3_data_dir variables set in our profiles.

I think we'll probably need a Python version of this macro as well, but I'm leaving that for myself to figure out later when I tackle #439.

{% endif %}
{% set dir_suffix = "/" ~ head_ref %}
{% endif %}
{% set s3_dependency_dir = target.s3_data_dir ~ "packages" ~ dir_suffix %}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One downside of this approach is that it stores packages inside the data dir, when ideally they would be stored alongside the data dir. For clarity, here are the three paths for the three different targets we support:

  • dev
    • data dir: s3://ccao-dbt-athena-dev-us-east-1/data/
    • dependency dir: s3://ccao-dbt-athena-dev-us-east-1/data/packages/<username>/
  • ci
    • data dir: s3://ccao-dbt-athena-ci-us-east-1/data/
    • dependency dir: s3://ccao-dbt-athena-ci-us-east-1/data/packages/<branch_slug>/
  • prod
    • data dir: s3://ccao-athena-ctas-us-east-1/
    • dependency dir: s3://ccao-athena-ctas-us-east-1/packages/

I think prod is probably the most reasonable of the three. They all feel acceptable right now, but not perfect. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): Well, if we're following our existing schema for the dev, ci, prod split, something like this might be better:

  • dev
    • data dir: s3://ccao-dbt-athena-dev-us-east-1/data/
    • dependency dir: s3://ccao-dbt-athena-dev-us-east-1/packages/<username>/
  • ci
    • data dir: s3://ccao-dbt-athena-ci-us-east-1/data/
    • dependency dir: s3://ccao-dbt-athena-ci-us-east-1/packages/<branch_slug>/
  • prod
    • data dir: s3://ccao-athena-ctas-us-east-1/
    • dependency dir: s3://ccao-athena-packages-us-east-1/

This gives us tighter control over permissions and reduces the blast radius of accidental removal of prod stuff.

Copy link
Contributor Author

@jeancochrane jeancochrane May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dfsnow I agree, that pattern would be cleaner, but per my comment above we aren't actually able to configure the dependency directory as a dedicated profile variable, so it needs to be derivable from other variables like s3_data_dir. This is why I proposed the organization sketched out in this PR, as imperfect as it is; if you have suggestions for a better one that is derivable from existing profile variables I would be grateful!

Edited to add: I suppose we could implement a scheme that is independent of profile variables by just hardcoding the pattern into the get_s3_dependency_dir macro based on the target, but it would make me a bit nervous to decouple this particular S3 path config from profiles.yml when the rest of our S3 path configs are stored there. Still, it's one option for working around the limitations we're facing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh oh, I gotcha, sorry I didn't read closer. What about adding some env vars to dbt_project.yml:

vars:
  s3_package_dir_dev: s3://ccao-dbt-athena-dev-us-east-1/packages/
  s3_package_dir_ci: s3://ccao-dbt-athena-ci-us-east-1/packages/
  s3_package_dir_prod: s3://ccao-athena-packages-us-east-1/

And then conditionally selecting the directory based on the suffix of the env var. Feels pretty hacky, but at least then the vars are declared and not stored in the conditional logic itself.

Otherwise, I think your schema is fine! All of this should be relatively easy to change if we ever want/need to move package locations.

Copy link
Contributor Author

@jeancochrane jeancochrane May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea! Done in 02cc83a. Note that I've been using dependency instead of package to name these resources since I don't want them to be confused with dbt packages, but I don't feel super strongly about it, particularly since the name of the config attribute is packages anyway.

.github/workflows/build_and_test_dbt.yaml Show resolved Hide resolved
@jeancochrane jeancochrane marked this pull request as ready for review May 9, 2024 22:17
@jeancochrane jeancochrane requested a review from a team as a code owner May 9, 2024 22:17
@jeancochrane jeancochrane requested a review from dfsnow May 9, 2024 22:18
Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome @jeancochrane. Just a minor issue and a suggestion and I think this is set to go.

{% endif %}
{% set dir_suffix = "/" ~ head_ref %}
{% endif %}
{% set s3_dependency_dir = target.s3_data_dir ~ "packages" ~ dir_suffix %}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): Well, if we're following our existing schema for the dev, ci, prod split, something like this might be better:

  • dev
    • data dir: s3://ccao-dbt-athena-dev-us-east-1/data/
    • dependency dir: s3://ccao-dbt-athena-dev-us-east-1/packages/<username>/
  • ci
    • data dir: s3://ccao-dbt-athena-ci-us-east-1/data/
    • dependency dir: s3://ccao-dbt-athena-ci-us-east-1/packages/<branch_slug>/
  • prod
    • data dir: s3://ccao-athena-ctas-us-east-1/
    • dependency dir: s3://ccao-athena-packages-us-east-1/

This gives us tighter control over permissions and reduces the blast radius of accidental removal of prod stuff.

)

echo "Deleting Python model requirements at $s3_dependency_dir"
aws s3 rm "$s3_dependency_dir" --recursive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (non-blocking): Recursive removal makes me nervous. Let's make sure the dbt Athena role is very tightly scoped to only allow removals of the bare minimum. Can be addressed in a follow up PR to the infra repo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit of a tricky issue: Unfortunately, dbt-athena requires DeleteObject permissions over all objects in s3_data_dir (docs). It's possible these docs are misleading and there's a way to scope the permissions more tightly, but I think the only way to figure it out would be through trial and error.

One alternative would be to remove this step from cleanup_dbt_resources.sh and rely on the existing S3 management policy that deletes everything in the CI data/ directory after 30 days. This wouldn't be perfect from a cost management perspective, but it might be good enough. It would also have the negative side effect of making it so that Python model dependencies don't get properly cleaned up when a user runs this script against the dev target, but we could also handle that by gating this block behind a $TARGET == 'dev' conditional block.

I'm open to any of these options -- curious what you think!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm alright, as long as there's no way for that call to ever target anything other than the target env buckets then I think it's okay. I can dig a little more on this when we do our permissions cleanup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I think we're also slightly more protected by the switch to variables for defining s3_dependency_dir in 02cc83a, since now our S3 paths for dependencies don't overlap with our paths for data.

dbt/macros/get_s3_dependency_dir.sql Outdated Show resolved Hide resolved
Comment on lines +69 to +72
dbt list \
--quiet -t "$TARGET" -s state:modified state:new \
--state "$STATE_DIR" --resource-types model \
--output json --output-keys name | \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, let's give it to one of the juniors. Can you write up a brief issue?

.github/scripts/deploy_dbt_model_dependencies.sh Outdated Show resolved Hide resolved
@jeancochrane
Copy link
Contributor Author

jeancochrane commented May 14, 2024

@dfsnow Want to give this one more look before I merge? Never mind, needs one more change!

@jeancochrane jeancochrane requested review from dfsnow and removed request for dfsnow May 14, 2024 17:16
@dfsnow dfsnow self-requested a review May 14, 2024 17:31
@jeancochrane jeancochrane merged commit b06af57 into master May 14, 2024
11 checks passed
@jeancochrane jeancochrane deleted the jeancochrane/417-add-workflow-to-deploy-python-dependencies-to-s3 branch May 14, 2024 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add workflow to deploy Python dependencies to S3
2 participants