Skip to content
Spencer Lyon edited this page Jul 8, 2020 · 6 revisions

Welcome to the cmdc-tools wiki!

This hosts the developer documentation.

Start here if you'd like to contribute to the project -- we'd love to have you!

Developer setup

To set up your system for contributing to the project, we recommend the following

  • Follow the instructions in the README for installing python
  • Install docker
  • Get the docker container for the project:
    • If you made changes to any file(s) in db/schema run the following from project root: docker build -t valorumdata/cmdc-tools-pg:latest db (NOTE: you need to do this, stop and remove the container, and start it over again EACH time you change one of the schema files)
    • Otherwise, grab the latest from docker hub: docker pull valorumdata/cmdc-tools-pg:latest
  • Start a docker container for the postgres instance: docker run --name cmdc-tools-pg -e POSTGRES_PASSWORD=password -p 5432:5432 valorumdata/cmdc-tools-pg:latest

Then you can make edits to your scraper, suppose it is named XYZ

Then do the following to run only the tests for the scraper you are developing: PG_CONN_STR="postgresql://postgres:password@localhost:5432" pytest -v -k XYZ

Creating a scraper

Let's talk about scrapers

To create a scraper, do the following:

  • Clone the repository and create a branch for your scraper. The branch name should indicate the geography or data source that is being scraped
  • Create a Python file for the scraper.
    • Find a home for your scraper
      • If you are scraping a US county or state government dashboard, please check src/cmdc_tools/datasets/official and see if there is a folder for the county's state. If there is, please use that folder. If not, please create a new folder using the state's two letter abbreviation and add an __init__.py file to that new directory
      • If you are scraping another data source please create a directory in src/cmdc_tools/datasets that represents your data source. See the other directory names in the datasets directory for examples. Create an __init__.py file in your new directory
    • Create the python file: it doesn't matter too much what your file name is called. We've either put code directly in the __init__.py file or in a file named data.py
  • Create a class for your scraper
    • Determine the parent class you should use. The file src/cmdc_tools/datasets/base.py has two main classes:
      1. DatasetBaseNoDate: use this parent class if your scraper is getting whatever data is avaialable when the scraper runs.
      2. DatasetBaseNeedsDate: use this parent class if your scraper can obtain data as of a date in the past. For example, if you are downloading files from a government website and the files have dates
    • Fill in class level attributes:
      • source: str -- A url to the source's website
      • data_type: str -- The type of data that is being scraped. If you are collecting information on COVID related counts (such as cases, hospitalizations, tests, etc.) use "covid". Otherwise use "general" and we will help determine the
      • For a US county scraper, you must include the following class level attributes:
        • state_fips: int -- An integer containing the county's state FIPS code. We use the us library to look this up for us based on the county name (see the example below)
        • has_fips: bool -- A boolean indicating if the scraper produces a DataFrame with a column named fips containing the fips code for geography. If this is False, the scraper must have a column named county containing county names
    • Create the get method. This method is responsible for fetching the data and returning a DataFrame.
      • For COVID data scrapers, the DataFrame must include the following columns:
        • vintage: Timestamp -- this must be pd.Timestamp.utcnow().normalize()
        • dt: Timestamp -- The date for which the data is valid
        • county: str OR fips: int -- The indicator for the geography. See discussion of has_fips above
        • variable: str-- A string containing variable names. For a list of valid variable names, see the covid_us endpoint of the api
        • value: number: the value of the variable in the county/fips on a date dt as of vintage
      • For other scrapers, please reach out to us for help structuring the data
  • Add your scraper to the correct namespaces.
    • Your scraper must be listed in src/cmdc_tools/datasets/__init__.py
    • If your dataset is inside one or more subdirectories of datasets, is must also be listed in each subdirectory's __init__.py file. See other scrapers as examples
  • Make sure tests pass. At the root of the repository run the following:
    • black src
    • pytest src

Other Notes

  • If you are scraping an ArcGIS dashboard, please use the ArcGIS class found in src/cmdc_tools/datasets/official/base.py as a parent class. Please use the methods on that class when writing your get method. NOTE that usage of this class requires setting some more class level attributes
  • If you are adding an entirely new datasource, we will have to create PostgreSQL table(s) to store the data. Please work with the core team to do this.

Example

Let's see an example scraper

Here is the source (as of 2020-07-01) for the Pennsylvania scraper:

import textwrap
import pandas as pd
import us

# Parent classes
from ...base import DatasetBaseNoDate
from ..base import ArcGIS


# class name is `Pennsylvania` indicating geography for scraper
class Pennsylvania(DatasetBaseNoDate, ArcGIS):
    # Using ArcGIS , so need to set this class attribute
    ARCGIS_ID = "xtuWQvb2YQnp0z3F"

    # Other required class level attributes as described above
    source = (
        "https://www.arcgis.com/apps/opsdashboard/"
        "index.html#/85054b06472e4208b02285b8557f24cf"
    )
    state_fips = int(us.states.lookup("Pennsylvania").fips)
    has_fips: bool = False

    def get(self):
        # Using `ArcGIS` parent class method to get data
        df = self.get_all_sheet_to_df(
            service="County_Case_Data_Public", sheet=0, srvid=2
        )

        # dict to have columns match the schema -- see note about `covid_us` endpoint above
        column_map = {
            "COUNTY_NAM": "county",
            "Cases": "cases_total",
            "Deaths": "deaths_total",
            "AvailableBedsAdultICU": "available_icu_beds",
            "AvailableBedsMedSurg": "available_other_beds",
            "AvailableBedsPICU": "available_picu_beds",
            "COVID19Hospitalized": "hospital_beds_in_use_covid_confirmed",
            "TotalVents": "ventilators_capacity_count",
            "VentsInUse": "ventilators_in_use_any",
            "COVID19onVents": "ventilators_in_use_covid_confirmed",
        }
        renamed = df.rename(columns=column_map)

        # the column we used was non-covid, need to add covid to get total
        renamed["ventilators_in_use_any"] += renamed[
            "ventilators_in_use_covid_confirmed"
        ]

        renamed = renamed.loc[:, list(column_map.values())]
        # reshape from wide to long form
        out = renamed.melt(
            id_vars=["county"], var_name="variable_name", value_name="value"
        )

        # add the `dt` and `vintage` columns
        dt = pd.Timestamp.utcnow().normalize()
        return out.assign(dt=dt, vintage=dt)

This code is in the file src/cmdc_tools/datasets/official/PA/data.py

The Pennsylvania class is added to the following namespace files:

# src/cmdc_tools/datasets/official/PA/__init__.py
from .data import Pennsylvania
# src/cmdc_tools/datasets/official/__init__.py
from .PA import Pennsylvania
# src/cmdc_tools/datasets/__init__.py
from .official import (
    # many other scrapers
    Pennsylvania,
    # even more scrapers
)
Clone this wiki locally