Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest Migration - Scripts #1302

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 0 additions & 23 deletions dcpy/library/script/dcp_facilities_with_unmapped.py

This file was deleted.

33 changes: 0 additions & 33 deletions dcpy/library/templates/dcp_facilities_with_unmapped.yml

This file was deleted.

56 changes: 0 additions & 56 deletions dcpy/lifecycle/ingest/dev_templates/bpl_libraries.yml

This file was deleted.

27 changes: 27 additions & 0 deletions dcpy/lifecycle/ingest/templates/bpl_libraries.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
id: bpl_libraries
acl: public-read

attributes:
name: BPL Libraries
url: https://www.bklynlibrary.org/locations

ingestion:
source:
type: api
endpoint: https://www.bklynlibrary.org/locations/json
format: json
file_format:
type: json
json_read_fn: normalize
json_read_kwargs: { "record_path": ["locations"] }
geometry:
geom_column: data.position
crs: EPSG:4326
format:
point_xy_str: "y, x"
processing_steps:
- name: clean_column_names
args: {"replace": {"data.": ""}, "lower": True}
- name: rename_columns
args: {"map": {"geom": "wkb_geometry"} }

52 changes: 52 additions & 0 deletions dcpy/lifecycle/ingest/templates/dcp_facilities_with_unmapped.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
id: dcp_facilities_with_unmapped
acl: public-read

attributes:
name: Facilities Database (FacDB) with unmapped records
description: |
The Department of City Planning aggregates information about 30,000+ facilities and program sites that are owned, operated, funded, licensed, or certified by a City, State, or Federal agency in the City of New York into a central database called the City Planning Facilities Database (FacDB). These facilities generally help to shape quality of life in the city’s neighborhoods, and this dataset is the basis for a series of planning activities. This public data resource allows all New Yorkers to understand the breadth of government resources in their neighborhoods.
This dataset is now complemented with the Facilities Explorer, a new interactive web map that makes the data more accessible and allows users to quickly filter the data for their needs.
Note to Users: FacDB is only as good as the source data it aggregates, and the Department of City Planning cannot verify the accuracy of all records. Please read more about specific data and analysis limitations before using this data. Limitations include missing records, duplicate records, and the inclusion of administrative sites instead of service locations.
This is different from dcp_facilities in that it includes records that are never assigned a geography
url: https://www1.nyc.gov/site/planning/data-maps/open-data/dwn-selfac.page

ingestion:
source:
type: file_download
url: https://s-media.nyc.gov/agencies/dcp/assets/files/zip/data-tools/bytes/facilities_{{ version }}_csv.zip
file_format:
type: csv
unzipped_filename: facilities_{{ version }}.csv
#encoding: iso-8859-1
geometry:
crs: EPSG:4326
geom_column:
x: LATITUDE
y: LONGITUDE
processing_steps:
- name: clean_column_names
args: {"replace": {" ": "_", "-": "_"}, "lower": True}
- name: coerce_column_types
args:
column_types: {
"bbl": "string",
"bin": "string",
"ct2010": "string",
"ct2020": "string",
"cd": "string",
"policeprct": "string",
"schooldist": "string",
"council": "string",
"zipcode": "string", # should be int
"capacity": "string", # should be int
"borocode": "string", # should be int
"latitude": "string", # should be int
"longitude": "string", # should be int
"xcoord": "string", # should be int
"ycoord": "string", # should be int
}
- name: rename_columns
args:
map: {"geom": "wkb_geometry"}

columns: []
60 changes: 60 additions & 0 deletions dcpy/lifecycle/ingest/templates/dob_now_applications.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
id: dob_now_applications
acl: public-read

attributes:
name: DOB NOW Job Applications
description: |
DOB NOW job applications received from DOB via the DOB FTP which can be accessed via CyberDuck on DCP
Windows machines (see internal documentation for credentials). The encoding standard of
the file should be double checked before and after uploading to Digital Ocean as it
has changed between versions (from "Windows-1252" to "utf-8", etc).

There is an extensive writeup in a github issue about DOB NOW custom job filing data
https://github.com/NYCPlanning/db-developments/issues/386#issue-864138806

ingestion:
source:
type: s3
bucket: edm-private
key: dob_now/dob_now_job_applications/DOB_Now_Job_Filing_Data_for_DCP_{{ version }}.csv
file_format:
type: csv
delimiter: \t
encoding: cp1252
processing_steps:
- name: clean_column_names
args: {"replace": {" ": "_", "-": "_"}, "lower": True}
- name: coerce_column_types
args:
column_types: {
"bin": "string",
"total_construction_floor_area": "string",
"proposedoccupancyclassification": "string",
"horizontalenlargement": "string",
"verticalenlargement": "string",
"existing_height": "string",
"proposed_height": "string",
"initial_cost": "string",
"existing_dwelling_units": "string",
"latitude": "string",
"longitude": "string",
"existingbuildingheight": "string",
"proposed_no_of_stories": "string",
"existing_stories": "string",
"proposed_zsf": "string",
"existingoccupancyclassification": "string",
"floor_area_ratio_(far)": "string",
"proposed_dwelling_units": "string",
"nta": "string",
"proposedbuildingheight": "string",
"no_of_parking_spaces": "string",
"council_district": "string",
"census_tract": "string",
"use": "string",
"block": "string",
"lot": "string",
"total_floor_area": "string",
"enlargement_sq_footage": "string",
"total_far": "string",
"commmunity___board": "string",
}
2 changes: 0 additions & 2 deletions products/facilities/facdb/pipelines.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,6 @@


def bpl_libraries(df: pd.DataFrame):
df["longitude"] = df.position.apply(lambda x: x.split(",")[1].strip())
df["latitude"] = df.position.apply(lambda x: x.split(",")[0].strip())
df["zipcode"] = df.address.apply(lambda x: x[-6:])
df["borough"] = "Brooklyn"
df = sanitize_df(df)
Expand Down
6 changes: 3 additions & 3 deletions products/facilities/facdb/sql/_qaqc.sql
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
--Add mapped column to dcp_facilities_with_unmapped
ALTER TABLE dcp_facilities_with_unmapped ADD COLUMN IF NOT EXISTS mapped boolean;
UPDATE dcp_facilities_with_unmapped
SET mapped = (latitude::numeric != 0 AND longitude::numeric != 0);
SET mapped = (latitude != 0 AND longitude != 0);

-- QC consistency in operator information
DROP TABLE IF EXISTS qc_operator;
Expand Down Expand Up @@ -125,14 +125,14 @@ WITH
new AS (
SELECT
captype,
sum(capacity::numeric)::integer AS sum_new
sum(capacity) AS sum_new
FROM facdb
GROUP BY captype
),
old AS (
SELECT
captype,
sum(capacity::numeric)::integer AS sum_old
sum(capacity) AS sum_old
FROM dcp_facilities_with_unmapped
GROUP BY captype
)
Expand Down