-
Notifications
You must be signed in to change notification settings - Fork 1
Product: ZAP
Welcome to the db-zap-opendata wiki! This pipeline downloads data from the backend of the NYC Zoning Access Portal, processes it, and does two exports. The first export sends all records to data library and big query. The second export is just publicly available records to be sent to open data.
There are 7 datasets pulled from the CRM: dcp_projectactions
, dcp_projectmilestones
, dcp_projectactionbbls
, dcp_communityboarddispositions
, dcp_dcprojectteams
, dcp_projects
, and dcp_projectbbls
. All publicly available records come from a subset of a dcp_projects
and dcp_projectbbls
.
Active publicly visible projects that have reached the "application filed" milestone and don't already have geometries stored in the project_geoms table are records where a geometry is created by Labs. Code that filters for this subset of records is here. After creation, a geometry is not updated unless a planner manually triggers an update for some reason (e.g. fixing a data error).
The geom creation broke when Labs switched the ETL to use the CRM API. The process of creating geometry data could be improved. Lab's ideal solution is to have CRM enhancement that adds a field to the BBL record where they can write a JSON blob.
There is additional processing done to the two open datasets. The processing is done in visible_projects.py
and recode_id.py
.
This is done in SQL queries executed by a sqlalchemy engine
The CRM saves string values in some columns as integers. To get data usable for public consumption these integer values must be re-coded back a to human readable string. There are two kinds of columns that have integer representations: categorical and identifier.
The categorical columns are easier to recode because there is a consistent set of values these columns can take on. For example project status can only ever be "In Public Review", "Completed", etc. There are two crosswalks to recode these columns, each of which can be accessed from a stable URL. These crosswalks are downloaded in the get_metadata
method in visible_projects.py
.
The crosswalks for "id" based integer representations are more complex as new values can be added. For example the field primary_applicant
will see a new value after an applicant's first project is added to the portal. The CRM doesn't make the crosswalks for these fields available from a single link. Instead you look up the human-readable string values for a single project at a time. Our process involves iterating through all public records and pulling the relevant info from the crosswalk specific to that record.
The downloading and processing of the data is done in src/runner.py
. The Runner
class in instantiated once for each dataset.
All data from the CRM is pulled on each run. The data is originally written as a set of JSON files in the download
method. These JSON files are sent to pandas dataframes and then a POSTGres table in the combine
method. Finally data is cached as a .csv in the export
method.
This repo is used to build MapZAP, a dataset of ZAP project records with spatial data.
A ZAP project has a geometry assigned to it from either of two sources:
- Based on the ULURP numbers in a ZAP Project, a geometry from the internal version of the Zoning Map Amendment represents the project
- Based on the BBLs in a ZAP project when it was referred for review (Project certified referred year), a version of PLUTO from that year is chosen and used to find and aggregate BBL geometries to represent the project.
- ZAP Projects
- ZAP Project BBLs
- MapPLUTO (versions from 2002 - 2022)
Note: All source data is in BigQuery as a result of the ZAP Open Data export workflows in this repo
%%{ init: { 'flowchart': { 'curve': 'curveMonotoneY' } } }%%
flowchart TB
%% source data
src_projects[("ZAP Projects\n(1975 - 2022)")]
src_project_bbls[("ZAP Project BBLs\n(1975 - 2022)")]
src_pluto_years(PLUTO versions\nby year)
src_MapPLUTO[("MapPLUTO\n(2002 - 2022)")]
%% intermediate data
projects("ZAP Projects")
projects_details("ZAP Project\nDetails")
project_bbls("ZAP Project BBLs")
project_info_bbls("ZAP Project BBLs")
pluto_geo("BBL geographies")
project_bbl_geo("Project BBLs\ngeography")
%% final data
project_geo("MapZAP")
%%
src_pluto_years --> projects
src_projects ---> projects
src_project_bbls ---> project_bbls
src_MapPLUTO ---> pluto_geo
%%
projects --> project_info_bbls
projects --> projects_details
project_bbls --> project_info_bbls
%%
project_info_bbls --> project_bbl_geo
pluto_geo ---> project_bbl_geo
%%
projects_details --> project_geo
project_bbl_geo --> project_geo
-
Clone the repo and create
.devcontainer/.env
-
Open the repo in the defined dev container in VS code
-
Run the following dbt commands to build BigQuery dataset named
dbt_mapzap_dev_YOUR_DBT_USER
dbt debug dbt deps dbt seed --full-refresh dbt run dbt test
Note: Use of dbt requires a personal keyfile at
.dbt/bigquery-dbt-dev-keyfile.json
. This can be generated and shared by a Google Cloud Platform admin. -
Review outputs in BigQuery and export as CSV to Google Cloud Storage