Skip to content

Commit

Permalink
Merge pull request #1692 from catalyst-cooperative/add-epacems-crossw…
Browse files Browse the repository at this point in the history
…alk-to-etl

Add epacamd-eia crosswalk to etl
  • Loading branch information
aesharpe authored Sep 15, 2022
2 parents 6413c0f + 80e560b commit f13904e
Show file tree
Hide file tree
Showing 32 changed files with 533 additions and 884 deletions.
2 changes: 1 addition & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ PUDL currently integrates data from:
early release - use with caution)
* `EIA Form 923 <https://www.eia.gov/electricity/data/eia923/>`__: 2001-2021 (2021 is
early release - use with caution)
* `EPA Continuous Emissions Monitoring System (CEMS) <https://ampd.epa.gov/ampd/>`__: 1995-2021
* `EPA Continuous Emissions Monitoring System (CEMS) <https://campd.epa.gov/>`__: 1995-2021
* `FERC Form 1 <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-1-electric-utility-annual>`__: 1994-2020
* `FERC Form 714 <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-no-714-annual-electric/data>`__: 2006-2020
* `US Census Demographic Profile 1 Geodatabase <https://www.census.gov/geographies/mapping-files/2010/geo/tiger-data.html>`__: 2010
Expand Down
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions docs/dev/run_the_etl.rst
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,8 @@ You've changed the settings and renamed the file to CUSTOM_ETL.yml
$ pudl_etl settings/CUSTOM_ETL.yml
.. _add-cems-later:

Processing EPA CEMS Separately
------------------------------
As mentioned above, CEMS takes a while to process. Luckily, we've designed PUDL so that
Expand Down
45 changes: 40 additions & 5 deletions docs/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,16 @@ Data Coverage
batteries) and ``net_capacity_mwdc`` (for behind-the-meter solar PV) attributes to the
:ref:`generators_eia860` table, as they appear in the :doc:`data_sources/eia860`
monthly updates for 2022.
* We've integrated several new columns into the EIA 860 and EIA 923 including several
* Integrated several new columns into the EIA 860 and EIA 923 including several
codes with coding tables (See :doc:`data_dictionaries/codes_and_labels`). :pr:`1836`
* Added the `EPACAMD-EIA Crosswalk <https://github.com/USEPA/camd-eia-crosswalk>`__ to
the database. Previously, the crosswalk was a csv stored in ``package_data/glue``,
but now it has its own scraper
:pr:`https://github.com/catalyst-cooperative/pudl-scrapers/pull/20`, archiver,
:pr:`https://github.com/catalyst-cooperative/pudl-zenodo-storage/pull/20`
and place in the PUDL db. For now there's a ``epacamd_eia`` output table you can use
to merge CEMS and EIA data yourself :pr:`1692`. Eventually we'll work these crosswalk
values into an output table combining CEMS and EIA.

Nightly Data Builds
^^^^^^^^^^^^^^^^^^^
Expand Down Expand Up @@ -92,10 +100,25 @@ Database Schema Changes
non-standard codes, and fixing some reporting errors for ``PACW`` vs. ``PACE``
(PacifiCorp West vs. East) based on the state associated with the plant reporting the
code. Also added backfilling for codes in years before 2013 when BA Codes first
started being reported), but only in the output tables. See: :pr:`1906,1911`

Date Merge Helper Function
^^^^^^^^^^^^^^^^^^^^^^^^^^
started being reported, but only in the output tables. See: :pr:`1906,1911`
* Renamed and removed some columns in the :doc:`data_sources/epacems` dataset.
``unitid`` was changed to ``emissions_unit_id_epa`` to clarify the type of unit it
represents. ``unit_id_epa`` was removed because it is a unique identifyer for
``emissions_unit_id_epa`` and not otherwise useful or transferable to other datasets.
``facility_id`` was removed because it is specific to EPA's internal database and does
not aid in connection with other data. :pr:`1692`

Data Accuracy
^^^^^^^^^^^^^
* Retain NA values for :doc:`data_sources/epacems` fields ``gross_load_mw`` and
``heat_content_mmbtu``. Previously, these fields converted NA to 0, but this is not
accurate, so we removed this step.
* Update the ``plant_id_eia`` field from :doc:`data_sources/epacems` with values from
the newly integrated ``epacamd_eia`` crosswalk as not all EPA's ORISPL codes are
correct.

Helper Function Updates
^^^^^^^^^^^^^^^^^^^^^^^
* Replaced the PUDL helper function ``clean_merge_asof`` that merged two dataframes
reported on different temporal granularities, for example monthly vs yearly data.
The reworked function, :mod:`pudl.helpers.date_merge`, is more encapsulating and
Expand All @@ -110,6 +133,10 @@ Date Merge Helper Function
makes this function optionally used to generate the MCOE table that includes a full
monthly timeseries even in years when annually reported generators don't have
matching monthly data. See :pr:`1550`
* Updated the ``fix_leading_zero_gen_ids`` fuction by changing the name to
``remove_leading_zeros_from_numeric_strings`` because it's used to fix more than just
the ``generator_id`` column. Included a new argument to specify which column you'd
like to fix.

Plant Parts List Module Changes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down Expand Up @@ -138,6 +165,14 @@ Metadata
* Used the data source metadata class added in release 0.6.0 to dynamically generate
the data source documentation (See :doc:`data_sources/index`). :pr:`1532`

Documentation
^^^^^^^^^^^^^
* Fixed broken links in the documentation since the Air Markets Program Data (AMPD)
changed to Clean Air Markets Data (CAMD).
* Added graphics and clearer descriptions of EPA data and reporting requirements to the
:doc:`data_sources/epacems` page. Also included information about the ``epacamd_eia``
crosswalk.

Bug Fixes
^^^^^^^^^
* `Dask v2022.4.2 <https://docs.dask.org/en/stable/changelog.html#v2022-04-2>`__
Expand Down
99 changes: 64 additions & 35 deletions docs/templates/epacems_child.rst.jinja
Original file line number Diff line number Diff line change
Expand Up @@ -8,21 +8,20 @@ EPACEMS Intake catalog.
{% block browse_online %}- Available via `PUDL Data Catalog <https://github.com/catalyst-cooperative/pudl-catalog>`__{% endblock %}

{% block background %}
As depicted by the EPA, `Continuous Emissions Monitoring Systems (CEMS)
<https://www.epa.gov/emc/emc-continuous-emission-monitoring-systems>`__ are the
“total equipment necessary for the determination of a gas or particulate matter
concentration or emission rate.” They are used to determine compliance with EPA
emissions standards and are therefore associated with a given “smokestack” and are
categorized in the raw data by a corresponding ``unitid``. Because point sources of
pollution are not alway correlated on a one-to-one basis with generation units, the
CEMS ``unitid`` serves as its own unique grouping. The EPA in collaboration with the
EIA has developed `a crosswalk table <https://github.com/USEPA/camd-eia-crosswalk>`__
that maps the EPA’s ``unitid`` onto EIA’s ``boiler_id``, ``generator_id``, and
``plant_id_eia``. This file has been integrated into the SQL database.

The EPA `Clean Air Markets Division (CAMD) <https://www.epa.gov/airmarkets>`__ has
collected emissions data from CEMS units stretching back to 1995. Among the data
included in CEMS are hourly SO2, CO2, NOx emission and gross load.
`Continuous Emissions Monitoring Systems
<https://www.epa.gov/emc/emc-continuous-emission-monitoring-systems>`__ (CEMS) are used
to determine the rate of gas or particulate matter exiting a point source of emissions.
The EPA `Clean Air Markets Division (CAMD) <https://www.epa.gov/airmarkets>`__
has collected data on power plant emissions from CEMS units stretching back to 1995. The
CEMS dataset includes hourly gross load, SO2, CO2, and NOx emissions associated with
a given point source, usually a boiler. Read more about this in "Notable
Irregularities"; it gets complicated.
{% endblock %}

{% block downloadable_pdfs %}
{% for filename in download_paths %}
* :download:`{{ filename.stem.replace("_", " ").title() }} (PDF) <{{ filename }}>`
{% endfor %}
{% endblock %}

{% block accessible %}
Expand All @@ -38,34 +37,64 @@ Who is required to install CEMS and report to EPA?
{% endblock %}
{% block fill_out_form %}
`Part 75 <https://www.ecfr.gov/cgi-bin/retrieveECFR?gp=&SID=d20546b42dd4ea978d0de7eabe15cbf4&mc=true&n=pt40.18.75&r=PART&ty=HTML#se40.18.75_12>`__
of the Federal Code of Regulations (FRC), the backbone of the Clean Air Act Title IV and
Acid Rain Program, requires coal and other solid-combusting units (see §72.2) to install
and use CEMS (see §75.2, §72.6). Certain low-sulfur fueled gas and oil units (see §72.2)
may seek exemption or alternative means of monitoring their emissions if desired (see
§§75.23, §§75.48, §§75.66). Once CEMS are installed, Part 75 requires hourly data
recording, including during startup, shutdown, and instances of malfunction as well as
quarterly data reporting to the EPA. The regulation further details the protocol for
missing data calculations and backup monitoring for instances of CEMS failure (see
§§75,31-37).
of the Code of Federal Regulations (CFR), the backbone of the Clean Air Act's Acid Rain
Program, requires fossil-combustion units to install and use CEMS. The qualifications
(§75.2(a), §72.6(a)) are closely followed by a myriad of exceptions (§75.2(b), §72.6(b),
§72.7, §72.8). Among the many extenuating circumstances depicted are exemptions for
retired units; old, simple conbustion turbine units; non-utility untis; units supplying
generators with 25MW or less in capacity; units that have never sold their electricity;
and units burning low-sulfer fuels.

Once CEMS are installed, Part 75 requires hourly data recording, including during
startup, shutdown, and instances of malfunction as well as quarterly data reporting to
the EPA. The regulation further details the protocol for missing data calculations and
backup monitoring for instances of CEMS failure (see §§75.31-37).

A plain English explanation of the requirements of Part 75 is available in section
`2.0 Overview of Part 75 Monitoring Requirements <https://www.epa.gov/sites/production/files/2015-05/documents/plain_english_guide_to_the_part_75_rule.pdf>`__
{% endblock %}

{% block original_data %}
EPA CAMD publishes the CEMS data in an online `data portal <https://ampd.epa.gov/ampd/>`__
. The files are available in a prepackaged format, accessible via a `user interface <https://ampd.epa.gov/ampd/>`__
or `FTP site <ftp://newftp.epa.gov/DMDnLoad>`__ with each downloadable zip file
EPA CAMD publishes the CEMS data in an online `data portal <https://campd.epa.gov/>`__.
The files are available in a prepackaged format, accessible via a `user interface <https://campd.epa.gov/data/custom-data-download>`__
or `FTP site <https://gaftp.epa.gov/DMDnLoad/>`__ with each downloadable zip file
encompassing a year of data.
{% endblock %}

{% block notable_irregularities %}
CEMS is by far the largest dataset in PUDL at the moment with hourly records for
thousands of plants spanning decades. Note that the ETL process can easily take all
day for the full dataset. PUDL also provides a script that converts the raw EPA CEMS
data into Apache Parquet files that can be read and queried very efficiently with
Dask. Check out the `EPA CEMS example notebook <https://github.com/catalyst-cooperative/pudl-examples/blob/main/notebooks/03-pudl-parquet.ipynb>`__
in our
`pudl-examples repository <https://github.com/catalyst-cooperative/pudl-examples>`__
on GitHub for pointers on how to access this big dataset efficiently using :mod:`dask`.

CEMS is enourmous
-----------------
CEMS is by far the largest dataset in PUDL what with hourly records for
thousands of plants spanning decades. For this reason, we house CEMS data in `Apache
Parquet <https://parquet.apache.org/>`__ files rather than the main PUDL database.
Still, running the ETL with all of the CEMS data can take a long time. Note that you can
:ref:`process CEMS Data seperately <add-cems-later>` from the main ETL
script if you'd like.

Check out the `EPA CEMS example notebook <https://github.com/catalyst-cooperative/pudl-examples/blob/main/notebooks/03-pudl-parquet.ipynb>`__
in our `pudl-examples repository <https://github.com/catalyst-cooperative/pudl-examples>`__
on GitHub for pointers on how to access this dataset efficiently using :mod:`dask`.

EPA units vs. EIA units
-----------------------
Another important thing to note is the difference between EPA "units" and EIA "units".
Power plants are complex entities that have multiple subcomponents. In fossil powered
plants, emissions come from the combusion of fuel. This occurs in the boiler for coal
plants or the gas turbine for gas plants. When the EPA uses the term "unit" it is
refering to the emissions unit or smokestack where the CEMS equipment are (i.e., the
boiler or gas turbine). When the EIA refers to a "unit" it's usually refering to the
electricity generating unit (i.e. the generator). Some plants have a one-to-one
relationship between boilers and generators or gas turbines and generators, but many do
not.

The EPA and EIA have addressed this discrepancy by creating a `crosswalk
<https://github.com/USEPA/camd-eia-crosswalk>`__ between the
various sub-plant groupings reported to them. The ``plant_id_eia`` values from the
crosswalk are integrated into the EPA CEMS Parquet files available in PUDL.

Take a look at this helpful depiction of plant types from the EPA's crosswalk repo.

.. image:: /data_sources/epacems/plant_configuration.png

{% endblock %}
6 changes: 3 additions & 3 deletions notebooks/work-in-progress/explore-CEMS.ipynb

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion src/pudl/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
import pudl.extract.excel
import pudl.extract.ferc1
import pudl.extract.ferc714
import pudl.glue.eia_epacems
import pudl.glue.epacamd_eia
import pudl.glue.ferc1_eia
import pudl.helpers
import pudl.load
Expand Down
Loading

0 comments on commit f13904e

Please sign in to comment.