Skip to content

Instructions for GEOS Chem Users

Liam Bindle edited this page Feb 25, 2022 · 25 revisions

This page is a standalone summary of bashdatacatalog instructions for GEOS-Chem users.

Terminology

There are two important terms you should understand before starting:

  • data collection - A data collection is an input data directory. For example, HEMCO/CEDS/v2021-06 is a single data collection.
  • catalog file - A file that groups data collections together. For example, the following catalog file specifies the data collections for emissions inputs for GEOS-Chem v13.2: EmissionsInputs.csv.

Overview

GEOS-Chem input data is divided into 4 catalogs: MeteorologicalInputs.csv, EmissionsInputs.csv, ChemistryInputs.csv, and InitialConditions.csv. Scientific updates in X.Y versions of GEOS-Chem often introduce new collections with updated emissions, chemistry inputs, and initial condition, so certain catalogs are version-specific:

Catalog GEOS-Chem X.Y Version-Specific?
MeteorologicalInputs.csv No
EmissionsInputs.csv Yes
ChemistryInputs.csv Yes
InitialConditions.csv Yes

In general, the workflow for downloading GEOS-Chem input data is:

  1. Download catalog files you need. These catalogs define the data collections (input data) that GEOS-Chem needs.
    1. (Optional) If you are using any optional emissions or specialty simulations, enable the appropriate collections in your catalogs (column 3 of a catalog file).
  2. Fetch collection metadata by running bashdatacatalog-fetch. This downloads metadata for each active collection in your catalogs which includes the files in every collection, their checksums, and other details.
  3. Generate a list of the files you need to download by running bashdatacatalog-list. There are a variety of output formats to choose from (e.g., as download list for cURL, wget, or Globus) along with a variety of options (e.g., filtering by a date range, or only listing missing files).

Installation

Refer to the Installation Instructions. In brief, run the following command, answer the prompts, and restart your terminal.

$ bash <(curl -s https://raw.githubusercontent.com/LiamBindle/bashdatacatalog/main/install.sh)

💡 Set Up Instructions

Note: These set up instructions work for pre-existing (in-place set up) and new copies of the GEOS-Chem input data.

Download input data catalogs

First, create a directory to house your catalog files. Navigate to the top-level of the GEOS-Chem data directory (the directory with HEMCO/, CHEM_INPUTS/, etc.) and create a new directory called InputDataCatalogs. Inside this directory, create subdirectories for the GEOS-Chem versions you work with.

liam@~$ cd /ExtData  # navigate to GEOS-Chem data
liam@/ExtData$ mkdir InputDataCatalogs       # new directory for catalog files
liam@/ExtData$ mkdir InputDataCatalogs/13.2  # " for 13.2-specific catalogs
liam@/ExtData$ mkdir InputDataCatalogs/13.3  # " for 13.3-specific catalogs

Next, download the catalogs for the GEOS-Chem versions you work with. Since MeteorologicalInputs.csv doesn't change between GEOS-Chem versions, put it in InputDataCatalogs/. Since EmissionsInputs.csv, ChemistryInputs.csv, and InitialConditions.csv do change between GEOS-Chem versions, put those in the version specific subdirectories. Input data catalogs can be downloaded from http://geoschemdata.wustl.edu/ExtData/DataCatalogs/.

Expand to see examples of downloading catalog files

Download the catalog for metfields:

liam@/ExtData$ cd InputDataCatalogs
liam@/ExtData/InputDataCatalogs$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/MeteorologicalInputs.csv

Download the 13.2-specific catalogs:

liam@/ExtData/InputDataCatalogs$ cd 13.2
liam@/ExtData/InputDataCatalogs/13.2$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.2/ChemistryInputs.csv
liam@/ExtData/InputDataCatalogs/13.2$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.2/EmissionsInputs.csv
liam@/ExtData/InputDataCatalogs/13.2$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.2/InitialConditions.csv

Download the 13.3-specific catalogs:

liam@/ExtData/InputDataCatalogs/13.2$ cd ../13.3
liam@/ExtData/InputDataCatalogs/13.3$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.3/ChemistryInputs.csv
liam@/ExtData/InputDataCatalogs/13.3$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.3/EmissionsInputs.csv
liam@/ExtData/InputDataCatalogs/13.3$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.3/InitialConditions.csv

(Optional) Modify catalogs to activate/disable certain collections

Column 3 of a catalog file is a boolean flag that enables/disable each collection. By default, only the collections that are required for an out-of-the-box GC-Classic and GCHP simulation are enabled (they have a 1 in column 3). Column 3 is where you can customize active collections according to the simulations you plan on running. You will need to modify column 3 if you use

  • GEOS-FP or nested grid metfields,
  • optional emissions, or
  • specialty simulations

As an example, lets configure MeteorologicalInputs.csv so that the collection with GEOS-FP 0.25°x0.3125° global fields is activated. Open MeteorologicalInputs.csv with your preferred text or CSV editor and make the following modification

 GEOS_0.25x0.3125_AS/GEOS_FP,http://geoschemdata.wustl.edu/ExtData/GEOS_0.25x0.3125_AS/GEOS_FP,0,
 GEOS_0.25x0.3125_CH/GEOS_FP,http://geoschemdata.wustl.edu/ExtData/GEOS_0.25x0.3125_CH/GEOS_FP,0,
 GEOS_0.25x0.3125_EU/GEOS_FP,http://geoschemdata.wustl.edu/ExtData/GEOS_0.25x0.3125_EU/GEOS_FP,0,
-GEOS_0.25x0.3125/GEOS_FP,http://geoschemdata.wustl.edu/ExtData/GEOS_0.25x0.3125/GEOS_FP,0,
+GEOS_0.25x0.3125/GEOS_FP,http://geoschemdata.wustl.edu/ExtData/GEOS_0.25x0.3125/GEOS_FP,1,
 GEOS_0.25x0.3125_NA/GEOS_FP,http://geoschemdata.wustl.edu/ExtData/GEOS_0.25x0.3125_NA/GEOS_FP,0,
 GEOS_0.5x0.625_AS/GEOS_FP,http://geoschemdata.wustl.edu/ExtData/GEOS_0.5x0.625_AS/GEOS_FP,0,
 GEOS_0.5x0.625_AS/MERRA2,http://geoschemdata.wustl.edu/ExtData/GEOS_0.5x0.625_AS/MERRA2,0,

Note: Over time we will improve the documentation of what collections are needed when. In the mean time, refer to column 4 for notes on when a collection is needed. If you have questions, please open an issue on GitHub.

💡 Usage Instructions

⚠️ IMPORTANT: You should always run bashdatacatalog commands from the top-level of your GEOS-Chem data directory (the directory with HEMCO/, CHEM_INPUTS/, etc.), otherwise, relative paths to data collections (column 1) will be interpreted incorrectly.

Navigate to the top-level of your GEOS-Chem data directory. First, you need to fetch collection metadata (the information about what files exist in each collection). This is done with the bashdatacatalog-fetch command which takes catalog files as its arguments. See bashdatacatalog-fetch -h for more details.

liam@~$ cd /ExtData  # IMPORTANT: navigate to top-level of GEOS-Chem input data
liam@/ExtData$ bashdatacatalog-fetch InputDataCatalogs/*.csv InputDataCatalogs/**/*.csv
Click to see expected output
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/CHEM_INPUTS/FastJ_201204/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/CHEM_INPUTS/FAST_JX/v2020-02/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/CHEM_INPUTS/Linoz_200910/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/CHEM_INPUTS/Olson_Land_Map_201203/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/CHEM_INPUTS/UCX_201403/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/GEOS_0.25x0.3125/GEOS_FP/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/GEOS_0.5x0.625/MERRA2/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/GEOS_2x2.5/MERRA2/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/GEOSCHEM_RESTARTS/GC_13.0.0/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/GEOSCHEM_RESTARTS/v2021-09/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/ACET/v2014-07/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/AEIC/v2015-01/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/AFCID/v2018-04/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/ALD2/v2017-03/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/AnnualScalar/v2014-07/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/APEI/v2016-11/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/BIOFUEL/v2019-08/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/BROMINE/v2015-02/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/C2H6_2010/v2019-06/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/CEDS/v2021-06/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/CH3I/v2014-07/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/CMIP6/v2020-03/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/DICE_Africa/v2016-10/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/DMS/v2015-07/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/DUST_DEAD/v2019-06/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/EDGARv42/v2015-02/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/EDGARv43/v2016-11/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/GEIA/v2014-07/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/GFED4/v2015-10/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/GFED4/v2020-02/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/GMI/v2015-02/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/HTAP/v2015-03/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/IODINE/v2020-02/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/LIGHTNOX/v2014-07/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/MASKS/v2018-09/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/MASKS/v2019-05/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/MEGAN/v2018-05/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/MEGAN/v2020-02/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/MODIS_CHLR/v2019-11/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/MOH/v2019-12/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/MULTI_ICE/v2021-07/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/NEI2005/v2014-09/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/NEI2011/v2017-02-MM_for_GCHP/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/NEI2016/v2021-06/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/NH3/v2018-04/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/NH3/v2019-08/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/NOAA_GMD/v2018-01/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/OCEAN_O3_DRYDEP/v2020-02/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/OFFLINE_BIOVOC/v2019-10/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/OFFLINE_DUST/v2019-01/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/OFFLINE_DUST/v2021-08/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/OFFLINE_LIGHTNING/v2020-03/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/OFFLINE_SEASALT/v2019-01/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/OFFLINE_SOILNOX/v2019-01/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/OLSON_MAP/v2019-02/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/OMOC/v2018-01/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/PARANOX/v2015-02/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/POET/v2017-03/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/RCP/v2020-07/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/RONO2/v2019-05/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/RRTMG/v2018-11/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/SfcFix/v2019-12/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/SOA/v2014-07/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/SOILNOX/v2014-07/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/STRAT/v2015-01/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/TIMEZONES/v2015-02/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/TrashEmis/v2015-03/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/UVALBEDO/v2019-06/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/VerticalScaleFactors/v2021-05/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/VOLCANO/v2019-08/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/VOLCANO/v2021-09/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/XIAO/v2014-09/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/Yuan_XLAI/v2021-06/'

Note: Fetching should take about 1 minute. Recently, the GEOS-Chem data portal has had intermittent periods where it is slow to respond, so if fetching is taking a long time, try again later.

Fetching downloads the latest metadata for every active collection in your catalogs. You should run bashdatacatalog-fetch whenever you add or modify a catalog, as well as periodically so you get updates to your collections (e.g., new meteorological data that is processed and added to the meteorological collections).

Now that you have fetched, you can run bashdatacatalog-list commands. You can tailor this command the generate various types of file lists using its command-line arguments. See bashdatacatalog-list -h for details. A common use case is generating a list of required input files that missing in your local file system.

liam@/ExtData$ bashdatacatalog-list -am -r 2018-06-30,2018-08-01 InputDataCatalogs/*.csv InputDataCatalogs/**/*.csv

Here, -a means "all" files (temporal files and static files), -m means "missing" (list files that are absent locally), -r START,END is the date-range of your simulation (you should add an extra day before/after your simulation), and the remaining arguments are the paths to your catalog files.

The command can be easily modified so that it generates a list of missing files that is compatible with xargs curl to download all the files you are missing:

liam@/ExtData$ bashdatacatalog-list -am -r 2018-06-30,2018-08-01 -f xargs-curl InputDataCatalogs/*.csv InputDataCatalogs/**/*.csv | xargs curl

Here, -f xargs-curl means the output file list should be formatted for piping into xargs curl.

📖 You can find a list of useful list commands here: Useful Commands.

Caveats

Using the -r option to download data for the date range of a simulation

When using the -r option in bashdatacatalog-list (date range filtering), you need to subtract/add 1 day to the start/end date of your simulation. This is because time interpolation at midnight needs the files for both days.

For example, if you plan to run a simulation for 2019-01-01 to 2019-12-31, you should use an -r range like

liam@/ExtData$ bashdatacatalog-list -am -r 2018-12-31,2020-01-01 InputDataCatalogs/*.csv InputDataCatalogs/**/*.csv

Static Overhead

Emission data collections that use the C flag in HEMCO (i.e., "recycling" where the first/last year of data is recycled for simulation dates before/after the data's temporal coverage) need to classify the first/last year of data as static assets rather than temporal assets. As a result, the OFFLINE_BIOVOC/v2019-10 and OFFLINE_SEASALT/v2019-01 collections each have about ~100 GB of static data. This means that the minimum size of ExtData using the bashdatacatalog is ~300 GB (this is static data and data which isn't easily classified from the top-down). It's essentially a one-time overhead, and in practice it's a small price. Once you have this "static overhead" downloaded, subsequent downloads will only be the required data.

You can see the static overhead of each collection here: Summary of Input Data Collection Sizes (see the "Total Size (Static)" column).

FAQ

How do I get data for optional emissions or specialty simulations?

The default collections only enable (column 3) collections that are required for out-of-the-box GC-Classic and GCHP simulations. To enable data collections for optional emissions, search column 4 (collection comments) for keywords related to emission or speciatly simulation. Column 4 of EmissionsInputs.csv has comments that indicate the corresponding switch in HEMCO_Config.rc. For example, for APEI emissions:

liam@/ExtData$ grep -r 'APEI' InputDataCatalogs/*.csv InputDataCatalogs/**/*.csv
13.2/EmissionsInputs.csv:HEMCO/MASKS/v2018-09,http://geoschemdata.wustl.edu/ExtData/HEMCO/MASKS/v2018-09,1,For China mask and APEI and NEI2016_MONMEAN and DICE_Africa
13.2/EmissionsInputs.csv:HEMCO/APEI/v2016-11,http://geoschemdata.wustl.edu/ExtData/HEMCO/APEI/v2016-11,0,Optional; APEI Canada
13.3/EmissionsInputs.csv:HEMCO/MASKS/v2018-09,http://geoschemdata.wustl.edu/ExtData/HEMCO/MASKS/v2018-09,1,For China mask and APEI and NEI2016_MONMEAN and DICE_Africa
13.3/EmissionsInputs.csv:HEMCO/APEI/v2016-11,http://geoschemdata.wustl.edu/ExtData/HEMCO/APEI/v2016-11,0,Optional; APEI Canada

This shows that the HEMCO/APEI/v2016-11 is currently disabled (0 in column 3), but it could be enable it by setting column 3 to a 1.

Note: Technically, only one catalog needs to enable it, but it would be good practice to enable it in both catalogs above.

After running the bashdatacatalog I'm still missing files. What should I do?

Please report it by Opening and Issue. Indexing all of the GEOS-Chem data has been a major organizational effort, and it is likely we have a few minor errors. It shouldn't take long to fix once you have reported the issue.

Where should I go for help?

Feel free to open an issue if you have any questions or need help.