-
-
Notifications
You must be signed in to change notification settings - Fork 4
Home
Welcome to the cmdc-tools wiki!
This hosts the developer documentation.
Start here if you'd like to contribute to the project -- we'd love to have you!
To set up your system for contributing to the project, we recommend the following
- Follow the instructions in the README for installing python
- Install docker
- Get the docker container for the project:
- If you made changes to any file(s) in db/schema run the following from project root:
docker build -t valorumdata/cmdc-tools-pg:latest db
(NOTE: you need to do this, stop and remove the container, and start it over again EACH time you change one of the schema files) - Otherwise, grab the latest from docker hub:
docker pull valorumdata/cmdc-tools-pg:latest
- If you made changes to any file(s) in db/schema run the following from project root:
- Start a docker container for the postgres instance:
docker run --name cmdc-tools-pg -e POSTGRES_PASSWORD=password -p 5432:5432 valorumdata/cmdc-tools-pg:latest
Then you can make edits to your scraper, suppose it is named XYZ
Then do the following to run only the tests for the scraper you are developing: PG_CONN_STR="postgresql://postgres:password@localhost:5432" pytest -v -k XYZ
Let's talk about scrapers
To create a scraper, do the following:
- Clone the repository and create a branch for your scraper. The branch name should indicate the geography or data source that is being scraped
- Create a Python file for the scraper.
- Find a home for your scraper
- If you are scraping a US county or state government dashboard, please check
src/cmdc_tools/datasets/official
and see if there is a folder for the county's state. If there is, please use that folder. If not, please create a new folder using the state's two letter abbreviation and add an__init__.py
file to that new directory - If you are scraping another data source please create a directory in
src/cmdc_tools/datasets
that represents your data source. See the other directory names in thedatasets
directory for examples. Create an__init__.py
file in your new directory
- If you are scraping a US county or state government dashboard, please check
- Create the python file: it doesn't matter too much what your file name is called. We've either put code directly in the
__init__.py
file or in a file nameddata.py
- Find a home for your scraper
- Create a class for your scraper
- Determine the parent class you should use. The file
src/cmdc_tools/datasets/base.py
has two main classes:-
DatasetBaseNoDate
: use this parent class if your scraper is getting whatever data is avaialable when the scraper runs. -
DatasetBaseNeedsDate
: use this parent class if your scraper can obtain data as of a date in the past. For example, if you are downloading files from a government website and the files have dates
-
- Fill in class level attributes:
-
source: str
-- A url to the source's website -
data_type: str
-- The type of data that is being scraped. If you are collecting information on COVID related counts (such as cases, hospitalizations, tests, etc.) use"covid"
. Otherwise use"general"
and we will help determine the - For a US county scraper, you must include the following class level attributes:
-
state_fips: int
-- An integer containing the county's state FIPS code. We use theus
library to look this up for us based on the county name (see the example below) -
has_fips: bool
-- A boolean indicating if the scraper produces a DataFrame with a column namedfips
containing the fips code for geography. If this isFalse
, the scraper must have a column namedcounty
containing county names
-
-
- Create the
get
method. This method is responsible for fetching the data and returning a DataFrame.- For COVID data scrapers, the DataFrame must include the following columns:
-
vintage: Timestamp
-- this must bepd.Timestamp.utcnow().normalize()
-
dt: Timestamp
-- The date for which the data is valid -
county: str
ORfips: int
-- The indicator for the geography. See discussion ofhas_fips
above -
variable: str
-- A string containing variable names. For a list of valid variable names, see thecovid_us
endpoint of the api -
value: number
: the value of thevariable
in thecounty/fips
on a datedt
as ofvintage
-
- For other scrapers, please reach out to us for help structuring the data
- For COVID data scrapers, the DataFrame must include the following columns:
- Determine the parent class you should use. The file
- Add your scraper to the correct namespaces.
- Your scraper must be listed in
src/cmdc_tools/datasets/__init__.py
- If your dataset is inside one or more subdirectories of
datasets
, is must also be listed in each subdirectory's__init__.py
file. See other scrapers as examples
- Your scraper must be listed in
- Make sure tests pass. At the root of the repository run the following:
black src
pytest src
Other Notes
- If you are scraping an ArcGIS dashboard, please use the
ArcGIS
class found insrc/cmdc_tools/datasets/official/base.py
as a parent class. Please use the methods on that class when writing yourget
method. NOTE that usage of this class requires setting some more class level attributes - If you are adding an entirely new datasource, we will have to create PostgreSQL table(s) to store the data. Please work with the core team to do this.
Let's see an example scraper
Here is the source (as of 2020-07-01) for the Pennsylvania scraper:
import textwrap
import pandas as pd
import us
# Parent classes
from ...base import DatasetBaseNoDate
from ..base import ArcGIS
# class name is `Pennsylvania` indicating geography for scraper
class Pennsylvania(DatasetBaseNoDate, ArcGIS):
# Using ArcGIS , so need to set this class attribute
ARCGIS_ID = "xtuWQvb2YQnp0z3F"
# Other required class level attributes as described above
source = (
"https://www.arcgis.com/apps/opsdashboard/"
"index.html#/85054b06472e4208b02285b8557f24cf"
)
state_fips = int(us.states.lookup("Pennsylvania").fips)
has_fips: bool = False
def get(self):
# Using `ArcGIS` parent class method to get data
df = self.get_all_sheet_to_df(
service="County_Case_Data_Public", sheet=0, srvid=2
)
# dict to have columns match the schema -- see note about `covid_us` endpoint above
column_map = {
"COUNTY_NAM": "county",
"Cases": "cases_total",
"Deaths": "deaths_total",
"AvailableBedsAdultICU": "available_icu_beds",
"AvailableBedsMedSurg": "available_other_beds",
"AvailableBedsPICU": "available_picu_beds",
"COVID19Hospitalized": "hospital_beds_in_use_covid_confirmed",
"TotalVents": "ventilators_capacity_count",
"VentsInUse": "ventilators_in_use_any",
"COVID19onVents": "ventilators_in_use_covid_confirmed",
}
renamed = df.rename(columns=column_map)
# the column we used was non-covid, need to add covid to get total
renamed["ventilators_in_use_any"] += renamed[
"ventilators_in_use_covid_confirmed"
]
renamed = renamed.loc[:, list(column_map.values())]
# reshape from wide to long form
out = renamed.melt(
id_vars=["county"], var_name="variable_name", value_name="value"
)
# add the `dt` and `vintage` columns
dt = pd.Timestamp.utcnow().normalize()
return out.assign(dt=dt, vintage=dt)
This code is in the file src/cmdc_tools/datasets/official/PA/data.py
The Pennsylvania
class is added to the following namespace files:
# src/cmdc_tools/datasets/official/PA/__init__.py
from .data import Pennsylvania
# src/cmdc_tools/datasets/official/__init__.py
from .PA import Pennsylvania
# src/cmdc_tools/datasets/__init__.py
from .official import (
# many other scrapers
Pennsylvania,
# even more scrapers
)