Skip to content

Instructions for GEOS Chem Users

Liam Bindle edited this page Feb 22, 2022 · 25 revisions

🚧 bashdatacatalog v0.1.0 will be released on February 28 2022. This page is still a work in progress.

This page is a standalone summary of bashdatacatalog instructions for GEOS-Chem users.

Terminology

There are two important terms you should understand before starting:

  • data collection - A data collection is an input data directory. For example, HEMCO/CEDS/v2021-06 is a single data collection.
  • catalog file - A file that groups data collections together. For example, the following catalog file specifies the data collections for emissions inputs for GEOS-Chem v13.2: EmissionsInputs.csv.

Overview

GEOS-Chem input data is divided into 4 catalogs: MeteorologicalInputs.csv, EmissionsInputs.csv, ChemistryInputs.csv, and InitialConditions.csv. Scientific updates in X.Y versions of GEOS-Chem often introduce new collections with updated emissions, chemistry inputs, and initial condition, so certain catalogs are version-specific:

Catalog GEOS-Chem X.Y Version-Specific?
MeteorologicalInputs.csv No
EmissionsInputs.csv Yes
ChemistryInputs.csv Yes
InitialConditions.csv Yes

In general, the workflow for downloading GEOS-Chem input data is:

  1. Download catalog files you need. These catalogs define the data collections (input data) that GEOS-Chem needs.
    1. (Optional) If you are using any optional emissions or specialty simulations, enable the appropriate collections in your catalogs (column 3 of a catalog file).
  2. Fetch collection metadata by running bashdatacatalog-fetch. This downloads metadata for each active collection in your catalogs which includes the files in every collection, their checksums, and other details.
  3. Generate a list of the files you need to download by running bashdatacatalog-list. There are a variety of output formats to choose from (e.g., as download list for cURL, wget, or Globus) along with a variety of options (e.g., filtering by a date range, or only listing missing files).

Installation

Refer to the Installation Instructions. In brief, run the following command, answer the prompts, and restart your terminal.

$ bash <(curl -s https://raw.githubusercontent.com/LiamBindle/bashdatacatalog/main/install.sh)

Set Up Instructions

Note: These set up instructions work for pre-existing (in-place set up) and new copies of the GEOS-Chem input data.

Download input data catalogs

First, create a directory to house your catalog files. Navigate to the top-level of the GEOS-Chem data directory (the directory with HEMCO/, CHEM_INPUTS/, etc.) and create a new directory called InputDataCatalogs. Inside this directory, create subdirectories for the GEOS-Chem versions you work with.

liam@~$ cd /ExtData  # navigate to GEOS-Chem data
liam@/ExtData$ mkdir InputDataCatalogs       # new directory for catalog files
liam@/ExtData$ mkdir InputDataCatalogs/13.2  # " for 13.2-specific catalogs
liam@/ExtData$ mkdir InputDataCatalogs/13.3  # " for 13.3-specific catalogs

Next, download the catalogs for the GEOS-Chem versions you work with. Since MeteorologicalInputs.csv doesn't change between GEOS-Chem versions, put it in InputDataCatalogs/. Since EmissionsInputs.csv, ChemistryInputs.csv, and InitialConditions.csv do change between GEOS-Chem versions, put those in the version specific subdirectories. Input data catalogs can be downloaded from http://geoschemdata.wustl.edu/ExtData/DataCatalogs/.

Expand to see examples of downloading catalog files

Download the catalog for metfields:

liam@/ExtData$ cd InputDataCatalogs
liam@/ExtData/InputDataCatalogs$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/MeteorologicalInputs.csv

Download the 13.2-specific catalogs:

liam@/ExtData/InputDataCatalogs$ cd 13.2
liam@/ExtData/InputDataCatalogs/13.2$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.2/ChemistryInputs.csv
liam@/ExtData/InputDataCatalogs/13.2$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.2/EmissionsInputs.csv
liam@/ExtData/InputDataCatalogs/13.2$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.2/InitialConditions.csv

Download the 13.3-specific catalogs:

liam@/ExtData/InputDataCatalogs/13.2$ cd ../13.3
liam@/ExtData/InputDataCatalogs/13.3$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.3/ChemistryInputs.csv
liam@/ExtData/InputDataCatalogs/13.3$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.3/EmissionsInputs.csv
liam@/ExtData/InputDataCatalogs/13.3$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.3/InitialConditions.csv

(Optional) Modify catalogs to activate/disable certain collections

Column 3 of a catalog file is a boolean flag that enables/disable each collection. By default, only the collections that are required for an out-of-the-box GC-Classic and GCHP simulation are enabled (they have a 1 in column 3). Column 3 is where you can customize active collections according to the simulations you plan on running. You will need to modify column 3 if you use

  • GEOS-FP or nested grid metfields,
  • optional emissions, or
  • specialty simulations

As an example, lets configure MeteorologicalInputs.csv so that the collection with GEOS-FP 0.25°x0.3125° global fields is activated. Open MeteorologicalInputs.csv with your preferred text or CSV editor and make the following modification

 Path to collection,Canonical collection (URL),Enabled,Notes
-GEOS_0.25x0.3125_AS/GEOS_FP,http://geoschemdata.wustl.edu/ExtData/GEOS_0.25x0.3125_AS/GEOS_FP,0,
+GEOS_0.25x0.3125_AS/GEOS_FP,http://geoschemdata.wustl.edu/ExtData/GEOS_0.25x0.3125_AS/GEOS_FP,1,
 GEOS_0.25x0.3125_CH/GEOS_FP,http://geoschemdata.wustl.edu/ExtData/GEOS_0.25x0.3125_CH/GEOS_FP,0,
 GEOS_0.25x0.3125_EU/GEOS_FP,http://geoschemdata.wustl.edu/ExtData/GEOS_0.25x0.3125_EU/GEOS_FP,0,
 GEOS_0.25x0.3125/GEOS_FP,http://geoschemdata.wustl.edu/ExtData/GEOS_0.25x0.3125/GEOS_FP,0,

Note: Over time we will improve the documentation of what collections are needed when. In the mean time, refer to column 4 for notes on when a collection is needed. If you have questions, please open an issue on GitHub.

Set up instructions

For existing GEOS-Chem input data repository

This section covers how you can start using the bashdatacatalog in an existing GEOS-Chem input data repository (ExtData/) on your local machine.

  1. Navigate to your local GEOS-Chem input data repository (ExtData/).
  2. Create a directory called CatalogFiles/ (or whatever you want to name it). This is where you will store the catalog files that specify your input data requirements (you maintain these files).
  3. Download the catalog files for the versions of GEOS-Chem that you want to use. Put them in your CatalogFiles/ directory. Make any edits you want to the catalog files (e.g., enabling the metfield collections that you need).
  4. Run bashdatacatalog CatalogFiles/*.csv fetch (run this at the root-level of your data repository).

Important limitations

There are only two mechanisms for selecting and filtering data files with the bashdatacatalog.

The first mechanism is the enable/disable switch in column 3 of a catalog file. This mechanism operates at the collection-level, and it is the only way to "activate" or "deactivate" data collections according to types of simulations you run. Simulation type-specific and grid-specific collections are not handled automatically! Instead, you need to "activate" the appropriate collections in your catalog file. By default, the active collections in the default catalog files are for a "standard" GEOS-Chem simulation (MERRA-2, full chemistry). For example, if you plan to run nested NA simulations with MERRA-2 metfields, you will need to put a 1 in column 3 of your meteorological inputs catalog for the GEOS_0.5x0.625_NA/MERRA2 collection.

The second mechanism is the optional date range in catalog queries. This mechanism operates at the file-level, and it is the only way to filter-out temporal files that aren't needed for your simulation period.

Caveats

Queries on GEOS-Chem input data catalogs dont give the exact minimum input file requirements like a dry-run would. Instead, the data catalogs are meant to organize input data at a high-level, so that you (a human) can select the data collections you need. This means query results will include more input files than strictly necessary, but in practice, it is a relatively small overhead. By restricting the granularity of selection and filtering, the catalogging system is a lot simpler to use and maintain.

Here are the specifics of the "extra" data that query results will include:

  • For climatological data collections, the first and last year of data are considered "always required" (i.e., results of queries with date ranges will always include the first and last year of climatology data).
  • Emissions collections do not distinguish grid-specific or meteorology-specific files. HEMCO/OFFLINE_BIOVOC/v2019-10 is a single collection which include 0.25°x0.3125° and 0.5°x0.625° files.
Clone this wiki locally