Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check use cases for dbt-athena Python support #388

Closed
2 tasks done
dfsnow opened this issue Apr 16, 2024 · 3 comments
Closed
2 tasks done

Check use cases for dbt-athena Python support #388

dfsnow opened this issue Apr 16, 2024 · 3 comments
Assignees
Labels
aws Change something related to AWS dbt Related to dbt (tests, docs, schema, etc)

Comments

@dfsnow
Copy link
Member

dfsnow commented Apr 16, 2024

We may be able to:

  • Move some of the ingest queries / raw processing directly to dbt
  • Move the res ratio stats reporting to dbt
@dfsnow dfsnow added dbt Related to dbt (tests, docs, schema, etc) aws Change something related to AWS labels Apr 16, 2024
@jeancochrane
Copy link
Contributor

jeancochrane commented Apr 17, 2024

I'm going to keep some running notes here as I investigate:

Note

I stopped investigating individual transformation scripts in detail after gaining confidence that I had considered all of the possible edge cases.

@jeancochrane
Copy link
Contributor

jeancochrane commented Apr 17, 2024

@dfsnow After some investigation, my expectation is that Python models will only have limited utility for our ingests and transformations. Python models via PySpark are still very experimental (there have been no new releases since the release that added preliminary support for it in Feburary) and Athena PySpark has two limitations that are currently deal killers for many of our most complex ingest/transformation scripts:

  • Limited support for external libraries that aren't built-in (external libraries must be stored on S3 in order to be accessible)
  • No ability to access data or connect to databases that are stored locally or on private network drives

Still, I think there are some ingest scripts that could be refactored to use Python models, particularly ones that read data from public URLs and perform limited transformations on them (e.g. housing.ihs_index and sale.mydec). I propose we pull out the easiest of these, try them out with Python models, and use that as a spike to test the stability of the approach. In my estimation, that would be these scripts:

On the transformation front, I think we should continue to focus on refactoring transformations to SQL where possible (#99), which is tractable and involves a well-supported dbt approach (SQL models). I expect that in the process of doing so, we'll also end up identifying transformations that would be good candidates for Python (i.e. transformations that don't work well in SQL but are simple enough to work in Python without needing external libraries), and our work on the ingest front will help us determine how viable those transformations would be as Python models.

Edited to add: I think we can also give the refactor of ratio_stats a try, although this is more experimental since it will require publishing an assesspy bundle to S3 so that the model can access it.

I'll create issues for all of these tasks and then close this one out.

@jeancochrane
Copy link
Contributor

Superseded by #393, #394, and #99.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aws Change something related to AWS dbt Related to dbt (tests, docs, schema, etc)
Projects
None yet
Development

No branches or pull requests

3 participants