Skip to content

Latest commit

 

History

History
116 lines (96 loc) · 9.05 KB

design.md

File metadata and controls

116 lines (96 loc) · 9.05 KB

Documentation

Components

The data flow coordinates data across a few resource. These resource are:

  • S3-compatible storage (Digital Ocean Spaces)
    • Data are managed mostly by the Data Engineering team. Though, the Application Engineering team manages a few of its own resources.
  • Data flow runner (GitHub Runner or Local Machine)
    • The data flow is coordinated from this service. During development, it is run on a developer's machine. In staging and production, it is a GitHub Action Runner.
  • Data flow database (Docker Container within the Runner)
    • The Data flow runner (both the GitHub Action and Local Machine) version contains a Database within a Docker container.
    • It is referred to as the "Flow" database
  • API database (Docker container on the Local Machine or Digital Ocean Postgres Cluster)
    • This the database that actually serves the data to applications. Locally, it exists within a Docker container. In live environments, it exists on a Digital Ocean Postgres cluster. There are two steps which are run from the "Flow" database against the "API" database. When running locally, these database communicate via a docker network. When run against the Digital Ocean cluster, it runs through the internet.
    • It is referred to as the "API" or "Target" database

Steps

The entire data flow contains 10 steps. They are found in the scripts section of the package.json. With some exceptions, they follow the format <tool used>:<targeted resource>:<operation performed>. In the flow steps diagram, they are numbered 0 through 9. They are described in greater detail in the list below.

Steps 1 through 9 can be run with no configuration. They are part of the flow command listed in the README. Step 0 requires some configuration and is not part of the flow command; more context is listed in the step 0 description. Of the remaining steps, they can be run individually using their listed Command. This is helpful when a step fails. After fixing the failure, the next steps of the flow can be run without rerunning the steps that already succeeded. In addition to the individual commands, multiple commands can be run together as a Group.

The available groups are download, configure, seed, and populate. download will retrieve the source files and convert the shapefiles to csvs. configure will install dependencies on the flow database. seed will initialize the source tables in the flow database and fill them with source data. populate will transform the source data within the flow database and then transfer the transformed data to the target database.

  1. Pull custom types from target database

    • Command: drizzle:api:pull
    • Group: none
    • Tool: Drizzle
    • Run from: Runner, drizzle (api config)
    • Run against: API database
    • Description: The API schemas include several custom enum types. These types need to be defined in the data flow database before it can replicate the API schema. Drizzle introspection is performed against the api database, looking only at the custom types. The resulting introspection is saved in the drizzle/migrations folder in the schema.ts file. There are a few things to note. First, the introspection also produces meta and sql files. However, the data flow does not use these files and they are ignored by source control. Second, Drizzle does not automatically import the pgEnum function in the schema.ts; any developer that reruns the introspection needs to manually import this method. Third, this step should only need to be run when there are changes to the custom types. Consequently, it is excluded from the flow command. Finally, changes to the custom types should happen rarely. When they do happen, they will require updates to the drizzle/migration/schema.ts file. These files will need to be committed and integrated into the main branch.
  2. Download source files from Digital Ocean Spaces

    • Command: minio:download
    • Group: download
    • Tool: minio.js
    • Run from: Data flow runner, minio/download.ts
    • Run against: S3 storage
    • Description: Use the minio node module to download source files from Digital Ocean. Files are saved on the Data flow runner in the data/download folder.
  3. Convert shapefiles into csv files

    • Command: shp:convert
    • Group: download
    • Tool: shapefile.js
    • Run from: Data flow runner, shp/convert.ts
    • Run against: Data flow runner
    • Description: Several source files are stored as shapefiles. However, the database copy functions better handle csv files. In this step, we use the shapefile node module to convert the shapefiles to csvs. The resulting files are stored in data/convert
  4. Activate PostGIS and other necessary extensions

    • Command: pg:flow:configure
    • Group: configure
    • Tool: pg.js
    • Run from: Data flow runner, pg/configure
    • Run against: Data flow database
    • Description: Run a sql command against the data flow database to activate PostGIS
  5. Push custom types and enums to the flow database

  6. Create tables in flow database to hold source data

    • Command: pg:source:create
    • Group: seed
    • Tool: pg.js pg/source-create
    • Run from: Data flow runner
    • Run against: Flow database
    • Description: Run sql commands to create tables that hold data as they are stored in their source files. The source tables also create constraints that will check data validity as it is copied into the source tables. If any source tables already existed, drop them before adding them again.
  7. Load source tables with source data

    • Command: pg:source:load
    • Group: seed
    • Tool: pg.js pg/source-load
    • Run from: Data flow runner
    • Run against: Flow database
    • Description: Copy the source data from the data folder to the source tables within the flow database.
  8. Create tables in the flow database that model the api database tables

    • Command: db:pg:model:create
    • Group: seed
    • Tool: pg_dump and psql, db/pg/model-create
    • Run from: Flow database
    • Run against: Flow database
    • Description: Run pg_dump and psql from the flow database docker container. Use pg_dump to extract the API Table Schemas into a sql file stored in the flow database docker container. Use psql to read the sql file of the dump into the flow database. If any model tables already existed in the flow database, drop them before adding them again.
  9. Transform the source data and insert it into the model tables

    • Command: pg:model:transform
    • Group: populate
    • Tool: pg.js, pg/model-transform
    • Run from: Data flow runner
    • Run against: Flow database
    • Description: Use pg node to run the sql files that transform the source columns into their respective model columns. Export the populated model tables to .csv files within the flow database docker container. Truncate any data that may already exist in the model tables before inserting the data again.
  10. Move the data from the model tables in the flow database to their corresponding target tables in the api database

    • Command: db:pg:target:populate
    • Group: populate
    • Tool: psql, db/pg/target-populate
    • Run from: Flow database
    • Run against: API database
    • Description: Use psql to truncate the target tables in the api database and then copy data from the csv files into the target tables. Perform the truncation and copy as a single transaction, allowing it to ROLLBACK under failures.

Domains

The data flow can be used to either initialize a full data suite or update a portion of the full suite. These portions are group into "Domains". The data flow is divided into four Domains- "all", "admin", "capital-planning", and "pluto". They are visualized in the domains diagram.

The "all" domain contains every other domain plus "Boroughs". The "admin" domain contains administrative boundaries, other than Boroughs. Boroughs are excluded from the "admin" domain because tax lots depend on the Borough Ids existing. Rebuilding Boroughs would require also require rebuilding tax lots.

The "capital-planning" domain applies to tables derived from the capital planning database. The "pluto" domain applies to tables derived from the "pluto" dataset.

A domain must be specified with a BUILD environment variable when running the data-flow