Skip to content

Product: ZAP

Damon McCullough edited this page Aug 25, 2023 · 1 revision

Home

Welcome to the db-zap-opendata wiki! This pipeline downloads data from the backend of the NYC Zoning Access Portal, processes it, and does two exports. The first export sends all records to data library and big query. The second export is just publicly available records to be sent to open data.

There are 7 datasets pulled from the CRM: dcp_projectactions, dcp_projectmilestones, dcp_projectactionbbls, dcp_communityboarddispositions, dcp_dcprojectteams, dcp_projects, and dcp_projectbbls. All publicly available records come from a subset of a dcp_projects and dcp_projectbbls.


Info on how geometries are created for ZAP projects

Active publicly visible projects that have reached the "application filed" milestone and don't already have geometries stored in the project_geoms table are records where a geometry is created by Labs. Code that filters for this subset of records is here. After creation, a geometry is not updated unless a planner manually triggers an update for some reason (e.g. fixing a data error).

The geom creation broke when Labs switched the ETL to use the CRM API. The process of creating geometry data could be improved. Lab's ideal solution is to have CRM enhancement that adds a field to the BBL record where they can write a JSON blob.


Open data processing

There is additional processing done to the two open datasets. The processing is done in visible_projects.py and recode_id.py.

Rename fields

This is done in SQL queries executed by a sqlalchemy engine

Recode fields with integer representation

The CRM saves string values in some columns as integers. To get data usable for public consumption these integer values must be re-coded back a to human readable string. There are two kinds of columns that have integer representations: categorical and identifier.

The categorical columns are easier to recode because there is a consistent set of values these columns can take on. For example project status can only ever be "In Public Review", "Completed", etc. There are two crosswalks to recode these columns, each of which can be accessed from a stable URL. These crosswalks are downloaded in the get_metadata method in visible_projects.py.

The crosswalks for "id" based integer representations are more complex as new values can be added. For example the field primary_applicant will see a new value after an applicant's first project is added to the portal. The CRM doesn't make the crosswalks for these fields available from a single link. Instead you look up the human-readable string values for a single project at a time. Our process involves iterating through all public records and pulling the relevant info from the crosswalk specific to that record.


Python Runner

The downloading and processing of the data is done in src/runner.py. The Runner class in instantiated once for each dataset.

All data from the CRM is pulled on each run. The data is originally written as a set of JSON files in the download method. These JSON files are sent to pandas dataframes and then a POSTGres table in the combine method. Finally data is cached as a .csv in the export method.


MapZAP

Overview

This repo is used to build MapZAP, a dataset of ZAP project records with spatial data.

A ZAP project has a geometry assigned to it from either of two sources:

  • Based on the ULURP numbers in a ZAP Project, a geometry from the internal version of the Zoning Map Amendment represents the project
  • Based on the BBLs in a ZAP project when it was referred for review (Project certified referred year), a version of PLUTO from that year is chosen and used to find and aggregate BBL geometries to represent the project.

Data sources

  • ZAP Projects
  • ZAP Project BBLs
  • MapPLUTO (versions from 2002 - 2022)

Note: All source data is in BigQuery as a result of the ZAP Open Data export workflows in this repo

Data flow diagram

%%{ init: { 'flowchart': { 'curve': 'curveMonotoneY' } } }%%
flowchart TB
%% source data
src_projects[("ZAP Projects\n(1975 - 2022)")]
src_project_bbls[("ZAP Project BBLs\n(1975 - 2022)")]
src_pluto_years(PLUTO versions\nby year)
src_MapPLUTO[("MapPLUTO\n(2002 - 2022)")]

%% intermediate data
projects("ZAP Projects")
projects_details("ZAP Project\nDetails")
project_bbls("ZAP Project BBLs")
project_info_bbls("ZAP Project BBLs")
pluto_geo("BBL geographies")
project_bbl_geo("Project BBLs\ngeography")

%% final data
project_geo("MapZAP")

%% 
src_pluto_years --> projects
src_projects ---> projects

src_project_bbls ---> project_bbls

src_MapPLUTO ---> pluto_geo

%% 
projects --> project_info_bbls
projects --> projects_details
project_bbls --> project_info_bbls

%% 
project_info_bbls --> project_bbl_geo
pluto_geo ---> project_bbl_geo

%% 
projects_details --> project_geo
project_bbl_geo --> project_geo
Loading

Build process (locally only)

  1. Clone the repo and create .devcontainer/.env

  2. Open the repo in the defined dev container in VS code

  3. Run the following dbt commands to build BigQuery dataset named dbt_mapzap_dev_YOUR_DBT_USER

    dbt debug
    dbt deps
    dbt seed --full-refresh
    dbt run
    dbt test

    Note: Use of dbt requires a personal keyfile at .dbt/bigquery-dbt-dev-keyfile.json. This can be generated and shared by a Google Cloud Platform admin.

  4. Review outputs in BigQuery and export as CSV to Google Cloud Storage