The data flow coordinates data across a few resource. These resource are:
- S3-compatible storage (Digital Ocean Spaces)
- Data are managed mostly by the Data Engineering team. Though, the Application Engineering team manages a few of its own resources.
- Data flow runner (GitHub Runner or Local Machine)
- The data flow is coordinated from this service. During development, it is run on a developer's machine. In staging and production, it is a GitHub Action Runner.
- Data flow database (Docker Container within the Runner)
- The Data flow runner (both the GitHub Action and Local Machine) version contains a Database within a Docker container.
- It is referred to as the "Flow" database
- API database (Docker container on the Local Machine or Digital Ocean Postgres Cluster)
- This the database that actually serves the data to applications. Locally, it exists within a Docker container. In live environments, it exists on a Digital Ocean Postgres cluster. There are two steps which are run from the "Flow" database against the "API" database. When running locally, these database communicate via a docker network. When run against the Digital Ocean cluster, it runs through the internet.
- It is referred to as the "API" or "Target" database
The entire data flow contains 10 steps. They are found in the scripts
section of the package.json. With some exceptions, they follow the format <tool used>:<targeted resource>:<operation performed>
. In the flow steps diagram, they are numbered 0 through 9. They are described in greater detail in the list below.
Steps 1 through 9 can be run with no configuration. They are part of the flow
command listed in the README. Step 0 requires some configuration and is not part of the flow
command; more context is listed in the step 0 description. Of the remaining steps, they can be run individually using their listed Command
. This is helpful when a step fails. After fixing the failure, the next steps of the flow can be run without rerunning the steps that already succeeded. In addition to the individual commands, multiple commands can be run together as a Group
.
The available groups are download
, configure
, seed
, and populate
. download
will retrieve the source files and convert the shapefiles to csvs. configure
will install dependencies on the flow database. seed
will initialize the source tables in the flow database and fill them with source data. populate
will transform the source data within the flow database and then transfer the transformed data to the target database.
-
Pull custom types from target database
- Command:
drizzle:api:pull
- Group: none
- Tool: Drizzle
- Run from: Runner, drizzle (api config)
- Run against: API database
- Description: The API schemas include several custom enum types. These types need to be defined in the data flow database before it can replicate the API schema. Drizzle introspection is performed against the api database, looking only at the custom types. The resulting introspection is saved in the
drizzle/migrations
folder in theschema.ts
file. There are a few things to note. First, the introspection also produces meta and sql files. However, the data flow does not use these files and they are ignored by source control. Second, Drizzle does not automatically import thepgEnum
function in theschema.ts
; any developer that reruns the introspection needs to manually import this method. Third, this step should only need to be run when there are changes to the custom types. Consequently, it is excluded from theflow
command. Finally, changes to the custom types should happen rarely. When they do happen, they will require updates to the drizzle/migration/schema.ts file. These files will need to be committed and integrated into themain
branch.
- Command:
-
Download source files from Digital Ocean Spaces
- Command:
minio:download
- Group:
download
- Tool: minio.js
- Run from: Data flow runner, minio/download.ts
- Run against: S3 storage
- Description: Use the minio node module to download source files from Digital Ocean. Files are saved on the Data flow runner in the data/download folder.
- Command:
-
Convert shapefiles into csv files
- Command:
shp:convert
- Group:
download
- Tool: shapefile.js
- Run from: Data flow runner, shp/convert.ts
- Run against: Data flow runner
- Description: Several source files are stored as shapefiles. However, the database copy functions better handle csv files. In this step, we use the shapefile node module to convert the shapefiles to csvs. The resulting files are stored in data/convert
- Command:
-
Activate PostGIS and other necessary extensions
- Command:
pg:flow:configure
- Group:
configure
- Tool: pg.js
- Run from: Data flow runner, pg/configure
- Run against: Data flow database
- Description: Run a sql command against the data flow database to activate PostGIS
- Command:
-
Push custom types and enums to the flow database
- Command:
drizzle:flow:push
- Group:
configure
- Tool: drizzle
- Run from: Data flow runner, drizzle (flow config)
- Run against: Flow database
- Description: Push the enums stored in drizzle/migration/schema.ts
- Command:
-
Create tables in flow database to hold source data
- Command:
pg:source:create
- Group:
seed
- Tool: pg.js pg/source-create
- Run from: Data flow runner
- Run against: Flow database
- Description: Run sql commands to create tables that hold data as they are stored in their source files. The source tables also create constraints that will check data validity as it is copied into the source tables. If any source tables already existed, drop them before adding them again.
- Command:
-
Load source tables with source data
- Command:
pg:source:load
- Group:
seed
- Tool: pg.js pg/source-load
- Run from: Data flow runner
- Run against: Flow database
- Description: Copy the source data from the
data
folder to the source tables within the flow database.
- Command:
-
Create tables in the flow database that model the api database tables
- Command:
db:pg:model:create
- Group:
seed
- Tool: pg_dump and psql, db/pg/model-create
- Run from: Flow database
- Run against: Flow database
- Description: Run
pg_dump
andpsql
from the flow database docker container. Usepg_dump
to extract the API Table Schemas into asql
file stored in the flow database docker container. Usepsql
to read thesql
file of the dump into the flow database. If any model tables already existed in the flow database, drop them before adding them again.
- Command:
-
Transform the source data and insert it into the model tables
- Command:
pg:model:transform
- Group:
populate
- Tool: pg.js, pg/model-transform
- Run from: Data flow runner
- Run against: Flow database
- Description: Use pg node to run the
sql
files that transform thesource
columns into their respectivemodel
columns. Export the populatedmodel
tables to.csv
files within the flow database docker container. Truncate any data that may already exist in themodel
tables before inserting the data again.
- Command:
-
Move the data from the model tables in the flow database to their corresponding target tables in the api database
- Command:
db:pg:target:populate
- Group:
populate
- Tool: psql, db/pg/target-populate
- Run from: Flow database
- Run against: API database
- Description: Use
psql
to truncate the target tables in the api database and then copy data from thecsv
files into the target tables. Perform the truncation and copy as a single transaction, allowing it to ROLLBACK under failures.
- Command:
The data flow can be used to either initialize a full data suite or update a portion of the full suite. These portions are group into "Domains". The data flow is divided into four Domains- "all", "admin", "capital-planning", and "pluto". They are visualized in the domains diagram.
The "all" domain contains every other domain plus "Boroughs". The "admin" domain contains administrative boundaries, other than Boroughs. Boroughs are excluded from the "admin" domain because tax lots depend on the Borough Ids existing. Rebuilding Boroughs would require also require rebuilding tax lots.
The "capital-planning" domain applies to tables derived from the capital planning database. The "pluto" domain applies to tables derived from the "pluto" dataset.
A domain must be specified with a BUILD
environment variable when running the data-flow