-
Notifications
You must be signed in to change notification settings - Fork 2
Instructions for GEOS Chem Users
🚧 bashdatacatalog v0.1.0 will be released on February 28 2022. This page is still a work in progress.
This page is a standalone summary of bashdatacatalog instructions for GEOS-Chem users.
There are two important terms you should understand before starting:
- data collection - A data collection is an input data directory. For example, HEMCO/CEDS/v2021-06 is a single data collection.
- catalog file - A file that groups data collections together. For example, the following catalog file specifies the data collections for emissions inputs for GEOS-Chem v13.2: EmissionsInputs.csv.
GEOS-Chem input data is divided into 4 catalogs: MeteorologicalInputs.csv
, EmissionsInputs.csv
, ChemistryInputs.csv
, and InitialConditions.csv
. Scientific updates in X.Y versions of GEOS-Chem often introduce new collections with updated emissions, chemistry inputs, and initial condition, so certain catalogs are version-specific:
Catalog | GEOS-Chem X.Y Version-Specific? |
---|---|
MeteorologicalInputs.csv | No |
EmissionsInputs.csv | Yes |
ChemistryInputs.csv | Yes |
InitialConditions.csv | Yes |
In general, the workflow for downloading GEOS-Chem input data is:
-
Download catalog files you need. These catalogs define the data collections (input data) that GEOS-Chem needs.
- (Optional) If you are using any optional emissions or specialty simulations, enable the appropriate collections in your catalogs (column 3 of a catalog file).
-
Fetch collection metadata by running
bashdatacatalog-fetch
. This downloads metadata for each active collection in your catalogs which includes the files in every collection, their checksums, and other details. -
Generate a list of the files you need to download by running
bashdatacatalog-list
. There are a variety of output formats to choose from (e.g., as download list for cURL, wget, or Globus) along with a variety of options (e.g., filtering by a date range, or only listing missing files).
Refer to the Installation Instructions. In brief, run the following command, answer the prompts, and restart your terminal.
$ bash <(curl -s https://raw.githubusercontent.com/LiamBindle/bashdatacatalog/main/install.sh)
Note: These set up instructions work for pre-existing (in-place set up) and new copies of the GEOS-Chem input data.
First, create a directory to house your catalog files. Navigate to the top-level of the GEOS-Chem data directory (the directory with HEMCO/
, CHEM_INPUTS/
, etc.) and create a new directory called InputDataCatalogs
. Inside this directory, create subdirectories for the GEOS-Chem versions you work with.
liam@~$ cd /ExtData # navigate to GEOS-Chem data
liam@/ExtData$ mkdir InputDataCatalogs # new directory for catalog files
liam@/ExtData$ mkdir InputDataCatalogs/13.2 # " for 13.2-specific catalogs
liam@/ExtData$ mkdir InputDataCatalogs/13.3 # " for 13.3-specific catalogs
Next, download the catalogs for the GEOS-Chem versions you work with. Since MeteorologicalInputs.csv
doesn't change between GEOS-Chem versions, put it in InputDataCatalogs/
. Since EmissionsInputs.csv
, ChemistryInputs.csv
, and InitialConditions.csv
do change between GEOS-Chem versions, put those in the version specific subdirectories. Input data catalogs can be downloaded from http://geoschemdata.wustl.edu/ExtData/DataCatalogs/.
Expand to see examples of downloading catalog files
Download the catalog for metfields:
liam@/ExtData$ cd InputDataCatalogs
liam@/ExtData/InputDataCatalogs$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/MeteorologicalInputs.csv
Download the 13.2-specific catalogs:
liam@/ExtData/InputDataCatalogs$ cd 13.2
liam@/ExtData/InputDataCatalogs/13.2$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.2/ChemistryInputs.csv
liam@/ExtData/InputDataCatalogs/13.2$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.2/EmissionsInputs.csv
liam@/ExtData/InputDataCatalogs/13.2$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.2/InitialConditions.csv
Download the 13.3-specific catalogs:
liam@/ExtData/InputDataCatalogs/13.2$ cd ../13.3
liam@/ExtData/InputDataCatalogs/13.3$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.3/ChemistryInputs.csv
liam@/ExtData/InputDataCatalogs/13.3$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.3/EmissionsInputs.csv
liam@/ExtData/InputDataCatalogs/13.3$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.3/InitialConditions.csv
Column 3 of a catalog file is a boolean flag that enables/disable each collection. By default, only the collections that are required for an out-of-the-box GC-Classic and GCHP simulation are enabled (they have a 1
in column 3). Column 3 is where you can customize active collections according to the simulations you plan on running. You will need to modify column 3 if you use
- GEOS-FP or nested grid metfields,
- optional emissions, or
- specialty simulations
As an example, lets configure MeteorologicalInputs.csv so that the collection with GEOS-FP 0.25°x0.3125° global fields is activated. Open MeteorologicalInputs.csv with your preferred text or CSV editor and make the following modification
Path to collection,Canonical collection (URL),Enabled,Notes
-GEOS_0.25x0.3125_AS/GEOS_FP,http://geoschemdata.wustl.edu/ExtData/GEOS_0.25x0.3125_AS/GEOS_FP,0,
+GEOS_0.25x0.3125_AS/GEOS_FP,http://geoschemdata.wustl.edu/ExtData/GEOS_0.25x0.3125_AS/GEOS_FP,1,
GEOS_0.25x0.3125_CH/GEOS_FP,http://geoschemdata.wustl.edu/ExtData/GEOS_0.25x0.3125_CH/GEOS_FP,0,
GEOS_0.25x0.3125_EU/GEOS_FP,http://geoschemdata.wustl.edu/ExtData/GEOS_0.25x0.3125_EU/GEOS_FP,0,
GEOS_0.25x0.3125/GEOS_FP,http://geoschemdata.wustl.edu/ExtData/GEOS_0.25x0.3125/GEOS_FP,0,
Note: Over time we will improve the documentation of what collections are needed when. In the mean time, refer to column 4 for notes on when a collection is needed. If you have questions, please open an issue on GitHub.
This section covers how you can start using the bashdatacatalog in an existing GEOS-Chem input data repository (ExtData/) on your local machine.
- Navigate to your local GEOS-Chem input data repository (ExtData/).
- Create a directory called CatalogFiles/ (or whatever you want to name it). This is where you will store the catalog files that specify your input data requirements (you maintain these files).
- Download the catalog files for the versions of GEOS-Chem that you want to use. Put them in your
CatalogFiles/
directory. Make any edits you want to the catalog files (e.g., enabling the metfield collections that you need). - Run
bashdatacatalog CatalogFiles/*.csv fetch
(run this at the root-level of your data repository).
There are only two mechanisms for selecting and filtering data files with the bashdatacatalog.
The first mechanism is the enable/disable switch in column 3 of a catalog file. This mechanism operates at the collection-level, and it is the only way to "activate" or "deactivate" data collections according to types of simulations you run. Simulation type-specific and grid-specific collections are not handled automatically! Instead, you need to "activate" the appropriate collections in your catalog file. By default, the active collections in the default catalog files are for a "standard" GEOS-Chem simulation (MERRA-2, full chemistry). For example, if you plan to run nested NA simulations with MERRA-2 metfields, you will need to put a 1
in column 3 of your meteorological inputs catalog for the GEOS_0.5x0.625_NA/MERRA2
collection.
The second mechanism is the optional date range in catalog queries. This mechanism operates at the file-level, and it is the only way to filter-out temporal files that aren't needed for your simulation period.
Queries on GEOS-Chem input data catalogs dont give the exact minimum input file requirements like a dry-run would. Instead, the data catalogs are meant to organize input data at a high-level, so that you (a human) can select the data collections you need. This means query results will include more input files than strictly necessary, but in practice, it is a relatively small overhead. By restricting the granularity of selection and filtering, the catalogging system is a lot simpler to use and maintain.
Here are the specifics of the "extra" data that query results will include:
- For climatological data collections, the first and last year of data are considered "always required" (i.e., results of queries with date ranges will always include the first and last year of climatology data).
- Emissions collections do not distinguish grid-specific or meteorology-specific files.
HEMCO/OFFLINE_BIOVOC/v2019-10
is a single collection which include 0.25°x0.3125° and 0.5°x0.625° files.
Consider giving the bashdatacatalog a Star ⭐ if you find it useful. This increase visibility and helps justify maintaining this repository.