Check use cases for `dbt-athena` Python support #388

dfsnow · 2024-04-16T22:55:29Z

We may be able to:

Move some of the ingest queries / raw processing directly to dbt
Move the res ratio stats reporting to dbt

jeancochrane · 2024-04-17T17:04:12Z

I'm going to keep some running notes here as I investigate:

Python models are still quite experimental. The first release to support them was 1.7.2 in February and there hasn't been a release since. I think this is an important area of growth for the package, but we would be working with very fresh code and would almost certainly encounter bugs and limitations in the process.
Python models won't work for ingest scripts that access data from the local filesystem or a remote drive/DB, e.g. spatial.school or ccao.pin_condo_char. We might be able to factor out additional extraction scripts that are intended to run locally prior to the model running, and/or we might be able to move some static raw data from drives to S3, but that will increase the complexity of the pipeline.
The reporting.ratio_stats Glue job mostly seems like it can be refactored into a model pretty easily, but the one snag is that it depends on the assesspy library and Athena PySpark only allows external libraries that are builtin to the environment or that are hosted on S3. It wouldn't be too hard to wire up a CI pipeline that publishes an assesspy zipfile to S3, but it would increase the complexity of the infrastructure.
A related external library issue: Many of the transformation scripts handle geographic data and perform spatial transformations. While Python has a robust ecosystem for this, it doesn't seem like any of the big libraries like geopandas come preinstalled in PySpark for Athena, so we would have to package them ourselves on S3 in order to be able to access them from PySpark for Athena.
Based on my investigation (see below), many of our scripts could be refactored into Python models, but it's almost certain that some of them will resist migration due to requiring access to data that can't be moved to S3 or libraries that can't easily be zipped and stored on S3. We should think seriously about whether it's worth it to migrate some but not all of our scripts to a new system; the main value prop of Python models is that they could simplify the initial ingest/transform process by bringing everything into the DAG, but unfortunately we won't get any benefit unless and until we move everything into the DAG.
Raw ingest scripts
- Scripts that I think can be converted to Python models with minimal changes:
  - housing.ihs_index (reads files from remote URL)
  - sale.mydec (reads files from remote URL)
  - schools.great_schools_rating (reads files from remote URL)
  - spatial.access (reads files from remote URL)
  - spatial.airport_noise_point_source (reads raw data from Athena)
  - spatial.building_footprint (reads OSM data from Overpass and footprints from Cook County open data and Microsoft)
  - spatial.census (loads data from TIGER)
  - spatial.economy (reads files from remote URL)
  - spatial.golf_course (reads OSM data from Overpass)
  - spatial.major_roads (reads OSM data from Overpass)
  - spatial.midway_noise_monitor (reads files from remote URL)
  - spatial.secondary_roads (reads OSM data from Overpass)
  - spatial.environment (reads data from FEMA, TIGER, and open data)
  - spatial.police (reads data from the City open data portal)
  - spatial.transit (reads data from remote URLs)
- Scripts that I think can't be converted with minimal changes:
  - spatial.school (reads files from a Cook County GIS drive)
  - ccao.pin_condo_char (reads files from the O drive)
  - ccao.condominium_parking (reads files from the O drive)
  - sale.foreclosure (reads files from the O drive)
  - spatial.ccao (reads files from a Cook County drive)
  - spatial.other (reads files from a Cook County drive)
  - spatial.parcel (reads files from a Cook County drive)
  - spatial.political (reads files from a Cook County drive)
  - spatial.school (reads data from the City data portal, but also from a Cook County drive)
  - spatial.tax (reads data from a Cook County drive)
- Scripts that I think might be convertible with minimal changes:
  - rpie.data (connects to RPIE SQL Server)
  - rpie.pin_codes (connects to RPIE SQL Server)
  - spatial.corner_lot (reads OSM data from Overpass, but takes ~24hrs to complete one township, so may be tricky to convert
  - spatial.ohare_noise (reads files from remote URL, but also from the O drive)
  - spatial.kriging_surfaces (seems to read raw data from Athena, but also unclear whether it is actually up to date, e.g. unbound variables)
Intermediate processing scripts
- Scripts that I think can be converted to Python models with minimal changes:
  - ccao.pin_condo_car (reads data from S3 and Athena)
  - ccao.pin_nonlivable (reads data from S3 and Athena)
  - ccao.class_dict (reads data from S3)
  - ccao.land_site_rate (reads data from S3)
  - census.acs (reads data from the Census API)
  - census.decennial (reads data from the Census API)
  - census.table_dict (doesn't read any data)
    - This could be turned into a seed with a transformation on top
  - census.variable_dict (reads data from the Census API)
  - environment.airport_noise (reads data from S3)
  - environment.flood_first_street (reads data from S3 and Athena)
- Scripts that I think might be convertible with minimal changes:
  - ccao.commercial_valuation (reads first pass spreadsheets from a network drive, but we could think about moving these to S3)
  - ccao.land_nbhd_rate (reads files from S3, but also depends on the ccao R package that has no Python equivalent)
  - ccao.hie (reads from legacy CCAO SQL Server, but we might be able to refactor it to read from iasWorld data instead)

Note

I stopped investigating individual transformation scripts in detail after gaining confidence that I had considered all of the possible edge cases.

jeancochrane · 2024-04-17T22:09:30Z

@dfsnow After some investigation, my expectation is that Python models will only have limited utility for our ingests and transformations. Python models via PySpark are still very experimental (there have been no new releases since the release that added preliminary support for it in Feburary) and Athena PySpark has two limitations that are currently deal killers for many of our most complex ingest/transformation scripts:

Limited support for external libraries that aren't built-in (external libraries must be stored on S3 in order to be accessible)
No ability to access data or connect to databases that are stored locally or on private network drives

Still, I think there are some ingest scripts that could be refactored to use Python models, particularly ones that read data from public URLs and perform limited transformations on them (e.g. housing.ihs_index and sale.mydec). I propose we pull out the easiest of these, try them out with Python models, and use that as a spike to test the stability of the approach. In my estimation, that would be these scripts:

housing.ihs_index (reads files from remote URL)
sale.mydec (reads files from remote URL)
schools.great_schools_rating (reads files from remote URL)
spatial.access (reads files from remote URL)
spatial.midway_noise_monitor (reads files from remote URL)
spatial.police (reads data from the City open data portal)
spatial.transit (reads data from remote URLs)

On the transformation front, I think we should continue to focus on refactoring transformations to SQL where possible (#99), which is tractable and involves a well-supported dbt approach (SQL models). I expect that in the process of doing so, we'll also end up identifying transformations that would be good candidates for Python (i.e. transformations that don't work well in SQL but are simple enough to work in Python without needing external libraries), and our work on the ingest front will help us determine how viable those transformations would be as Python models.

Edited to add: I think we can also give the refactor of ratio_stats a try, although this is more experimental since it will require publishing an assesspy bundle to S3 so that the model can access it.

I'll create issues for all of these tasks and then close this one out.

jeancochrane · 2024-04-17T22:20:56Z

Superseded by #393, #394, and #99.

dfsnow added dbt Related to dbt (tests, docs, schema, etc) aws Change something related to AWS labels Apr 16, 2024

dfsnow assigned wrridgeway and jeancochrane Apr 16, 2024

jeancochrane mentioned this issue Apr 17, 2024

Move reporting.ratio_stats from a Glue job to a dbt Python model #393

Closed

jeancochrane closed this as completed Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check use cases for `dbt-athena` Python support #388

Check use cases for `dbt-athena` Python support #388

dfsnow commented Apr 16, 2024 •

edited by jeancochrane

Loading

jeancochrane commented Apr 17, 2024 •

edited

Loading

jeancochrane commented Apr 17, 2024 •

edited

Loading

jeancochrane commented Apr 17, 2024

Check use cases for dbt-athena Python support #388

Check use cases for dbt-athena Python support #388

Comments

dfsnow commented Apr 16, 2024 • edited by jeancochrane Loading

jeancochrane commented Apr 17, 2024 • edited Loading

jeancochrane commented Apr 17, 2024 • edited Loading

jeancochrane commented Apr 17, 2024

Check use cases for `dbt-athena` Python support #388

Check use cases for `dbt-athena` Python support #388

dfsnow commented Apr 16, 2024 •

edited by jeancochrane

Loading

jeancochrane commented Apr 17, 2024 •

edited

Loading

jeancochrane commented Apr 17, 2024 •

edited

Loading