ESGF_Data_Publishing

ESGF Data Publishing and Un-publishing

Data can be published (or unpublished) from an ESGF Node in multiple ways. The net effect of publishing data to a Node is that additional datasets and files are available as results for a search initiated at that Node, either through the web portal user interface, or the underlying back-end search services. Also, because of the distributed nature of the ESGF search services, those same results will be immediately available to a search initiated at any other Node in the federation.

Note that when publishing or unpublishing data, there is always a 60 seconds maximum delay between the execution of the operation, and the time when the Node search service is made aware of the changes.

Data Publishing

Using the ESGF Publisher

Useful documentation for getting started with the ESGF publisher:

The most complete workflow for publishing data involves using the ESGF publishing client to parse data on the local Data Node disk, generate THREDDS catalogs that are served by that Data Node THREDDS Data Server, and send a request to an ESGF Index Node for harvesting the THREDDS catalogs and make them available for searching. Because the Data Node and Index Node can be hosted on separate servers, the catalog harvesting request requires a security setup that guarantees that the person running the ESGF publisher on the Data Node is authorized to publish that data to the Index Node.

Assuming for now that the security pre-requisites have been successfully completed (see later for details on how to exactly configure security), publishing to an ESGF Node is no different than publishing to a traditional ESG gateway - exactly the same steps must be followed, the only difference being the web service endpoint referenced in the esg.ini configuration file. In short, execute the following steps (please consult the references at the end of this document for for details on how to configure and run the ESGF publishing client):

Edit the configuration file _ esg.ini _ and change the endpoint of the Hessian publication service to your specific ESGF Node setup:
- hessian_service_port = 443
- hessian_service_url = https:///esg-search/remote/secure/client-cert/hessian/publishingService

Alternatively, use the esg-node installation script to modifiy this field. Use the _ --set-index-peer _ or _ --set-publication-peer _ flags to change this value:

* %> esg-node --set-index-peer <HOSTNAME>

This will not only edit the esg.ini file accordingly (as described above) but will also fetch the certificate of the endpoint and import that cert into your ESGF truststore (used by tomcat).

Make sure the thredds_file_services are defined in your esg.ini file. Here's an example defining the "fileservice":

thredds_file_services = HTTPServer | /thredds/fileServer/ | HTTPServer | fileservice GridFTP | gsiftp://myhost.domain:2811/ | GRIDFTP | fileservice
Obtain a digital certificate from an ESGF trusted MyProxy server, and rename it to whatever path you have defined in esg.ini
- Example: myproxy-logon -s jpl-esg.jpl.nasa.gov -l rootAdmin -T -o ~/user.pem
Run the usual commands to parse the data on the local Data Node, ingest it in the local Postgres database, and send it for harvesting to the configured Index Node. For example for the NASA AIRS collection:
- esgscan_directory --project AIRS -o airs.txt /esgdata/airs_level3-update
- esgpublish --map ./airs.txt --project AIRS --service fileservice
- esglist_datasets AIRS
- esgpublish --map ./airs.txt --project AIRS --noscan --thredds --publish --service fileservice
After publishing, you can verify that new metadata has been indexed into the system in a variety of ways:
- Make a direct query through the master Solr instance admin interface running on port 8984: http://:8984/solr/admin/ (use _ * _ as your query string)
- After the default replication time of 60 seconds, you can make the same query at the slave Solr instance admin interface: http://:8983/solr/admin/ ]
- Check the results returned by the default ESGF Node search services REST URL: http:///esg-search/search/
- Use the web front end to formulate a specific user query: http:///esgf-web-fe/

Using the ESGF THREDDS Parser

It is also possible to publish data to an ESGF Index Node by running a program on the Index Node machine, that harvests THREDDS catalogs already generated on a local or remote Data Node. For example, you may want to use this option if you have already published data to an ESG Gateway, resulting in a hierarchy of THREDDS catalogs that you now want to push to the ESGF P2P Node. To do so, you can _ cd _ to the directory containing the compiled Java classes for the ESGF search services application and run the following commands (assuming that the Solr engine is accessible on _ localhost:8984 _ ) :

cd /esg-search/WEB-INF/classes
java -Dlog4j.configuration=./log4j.xml -Djava.ext.dirs=../lib esg.search.publish.impl.PublishingServiceMain THREDDS true [optional log file]
- _ filter _ is a regular expression that is used to publish/unpublish only those catalogs that match it. To apply no filtering, i.e. publish all catalogs, use the special values * or ALL .
- _ true _ is used for publishing, _ false _ for unpublishing the same catalog hierarchy
- Optionally, the path to a log file can be supplied as the last invocation argument - the log file will keep track of which THREDDS catalogs were published successfully, and which failed.
- Example: to publish the full hierarchy of NASA-JPL THREDDS catalogs:
  - java -Dlog4j.configuration=./log4j.xml -Djava.ext.dirs=../lib esg.search.publish.impl.PublishingServiceMain http://esg-datanode.jpl.nasa.gov/thredds/esgcet/catalog.xml '*' THREDDS true /tmp/publishing.log
- Example: to publish only those catalogs in the NASA-JPL hierarchy that match the regular expression ".AIRS." (note the leading and trailing symbols...):
  - java -Dlog4j.configuration=./log4j.xml -Djava.ext.dirs=../lib esg.search.publish.impl.PublishingServiceMain http://esg-datanode.jpl.nasa.gov/thredds/esgcet/catalog.xml '.AIRS.' THREDDS true /tmp/publishing.log
- Example: to publish the full hierarchy of PCMDI THREDDS catalogs:
  - java -Dlog4j.configuration=./log4j.xml -Djava.ext.dirs=../lib esg.search.publish.impl.PublishingServiceMain http://pcmdi9.llnl.gov/thredds/esgcet/catalog.xml '*' THREDDS true /tmp/publishing.log

Using the esgf-crawl command

The esgf-crawl command, which is installed as part of the standard ESGF installation, provides a handy shortcut to running the PublishingServiceMain. To publish a hierarchy of THREDDS catalogs, simply run:

esgf-crawl --outdir --

* Example: esgf-crawl -- [ http://esg-datanode.jpl.nasa.gov/thredds/esgcet/catalog.xml ](http://esg-datanode.jpl.nasa.gov/thredds/esgcet/catalog.xml)

You can also run the program by pointing it to a text file containing a list of THREDDS catalog URLs, one per line:

esgf-crawl --outdir --file

* Example: esgf-crawl --outdir /tmp --file /usr/home/cinquini/workspace-esg/scripts/catalogs.txt

Data Unpublishing

Using the ESGF Publisher

The most complete way to unpublish a dataset and all its files is to use the ESGF publisher application. Because the ESGF publisher deletion request does not distinguish between the id of a dataset and its version, the ESGF Node will unpublish all local datasets that match that "master_id" or "instance_id". For example;

esgunpublish --database-delete obs4MIPs.NASA-JPL.AIRS.mon.v1 will delete a specific dataset version (since no "master_id" matches the given string)
esgunpublish --database-delete obs4MIPs.NASA-JPL.AIRS.mon will delete all dataset versions with the supplied "master_id" (since no "instance_id" matches that string)
esgunpublish --database-delete --map will delete all datasets with "master_id" contained in the given map file (because the ESGF publisher parses the map file for the unversioned ids, and requests their deletion).

In all cases, the esgunpublish command has the following effects:

It removes the metadata records (datasets and files) from the Solr index, so they cannot be found any more by a search.
It deletes the corresponding THREDDS catalogs, so the data cannot be downloaded any more
If the option _ --database-delete _ is supplied, the datasets and files will also be removed from the local Postgres database.

Using the ESGF THREDDS Parser

Alternatively, you can execute the Java program _ PublishingServiceMain _ on the local machine to parse a hierarchy of THREDDS catalogs and unpublish them from the local Index Node. This is one of the options available to you if, at the time of publishing, you have crawled existing catalogs from a local or remote THREDDS Data Server, instead of running the ESGF publisher application.

The PublishingServiceMain program can be run as follows:

cd /usr/local/tomcat/webapps/esg-search/WEB-INF/classes
java -Dlog4j.configuration=./log4j.xml -Djava.ext.dirs=../lib/:../lib/fetched !esg.search.publish.impl. PublishingServiceMain THREDDS false

For example:

java -Dlog4j.configuration=./log4j.xml -Djava.ext.dirs=../lib/:../lib/fetched !esg.search.publish.impl. PublishingServiceMain http://test-datanode.jpl.nasa.gov/thredds/esgcet/catalog.xml * THREDDS false
java -Dlog4j.configuration=./log4j.xml -Djava.ext.dirs=../lib esg.search.publish.impl. PublishingServiceMain http://esg-datanode.jpl.nasa.gov/thredds/esgcet/catalog.xml .AIRS. THREDDS false /tmp/publishing.log

The full hierarchy of THREDDS catalogs starting at the supplied URL will be parsed and unpublished from the local Index Node, i.e. from the underlying Solr metadata repository.

Please note the following:

Any other version of the same datasets that is indexed on the system will be left completely untouched: these other versions need to be un-published or re-published independently
The PublishingServiceMain does nothing to the local or remote THREDDS data server - existing catalogs will remain unchanged.
Similarly, the PublishingServiceMain has no interaction with the local Postgres database: if the datasets were published locally, they will remain in the database untill they are unpublished through the ESGF publisher.
The specified filter does not prevent the full hierarchy of catalogs to be crawled, but it does prevent publishing or un-publishing of records for those catalogs taht don't match the regular expression.

Using the esgf-crawl command

The esgf-crawl command can be used for unpublishing data as well, as a handy alternative to running the PublishingServiceMain directly. To unpublish a hierarchy of THREDDS catalogs, simply run:

esgf-crawl --remove --outdir /tmp --

* Example: esgf-crawl --remove --outdir /tmp -- ' [ http://test-datanode.jpl.nasa.gov/thredds/esgcet/catalog.xml ](http://test-datanode.jpl.nasa.gov/thredds/esgcet/catalog.xml) '

The same considerations as in the discussion of the PublishingServiceMain apply.

Un-publishing directly from Solr

The most powerful (and dangerous!) method to unpublish metadata records from the local metadata index consists in interacting directly with the Solr server. From the local machine, you can send an HTTP request to the local Solr server to remove all records matching any possible search criteria, including removing all records at once. Note that separate unpublishing requests need to be sent to all local Solr cores ("datasets", "files" and "aggregations"). Just like when using the PublishingServiceMain , these operations will remove metadata from the Solr index, but will not affect neither the THREDDS catalogs, nor the local Postgres database.

Examples:

Remove all records from project "CMIP5":
- curl -s http://localhost:8984/solr/datasets/update?commit=true -H "Content-Type:text/xml" --data-binary '<delete><query>project:CMIP5</query></delete>'
- curl -s http://localhost:8984/solr/files/update?commit=true -H "Content-Type:text/xml" --data-binary <delete><query>project:CMIP5</query></delete>'
- curl -s http://localhost:8984/solr/aggregations/update?commit=true -H "Content-Type:text/xml" --data-binary <delete><query>project:CMIP5</query></delete>'
Remove ALL records from the local ESGF index:
- curl -s http://localhost:8984/solr/datasets/update?commit=true -H "Content-Type:text/xml" --data-binary '<delete><query>*:*</query></delete>'
- curl -s http://localhost:8984/solr/files/update?commit=true -H "Content-Type:text/xml" --data-binary '<delete><query>*:*</query></delete>'
- curl -s http://localhost:8984/solr/aggregations/update?commit=true -H "Content-Type:text/xml" --data-binary '<delete><query>*:*</query></delete>'

Security

Publication to an Index Node via the Publisher application is secured via the standard ESGF security infrastructure: in order for a user to be able to publish a data collection, the following requisites must be fulfilled:

The user must obtain an X509 certificate from the MyProxy server at the Node where he/she registered.
The user certificate, sent by the Publisher application as part of the publishing request to the Index Node, must be trusted by that Node (which entails that the root CA certificate used by the MyProxy
The Index Node administrator must configure the access control so that the specific data collection can be published by members of a particular group that are granted "publisher" role. This process is described in detail in the reference Configuring Access Control for Publishing and Downloading Data .
The user executing the publishing must be granted membership in that group with role of "publisher" (note that if the collection is controlled by more than one group, the user needs only be a publisher in one of them).

Note that the security constraints only apply when publishing data via the ESGF Publisher, because this involves a request from a Data Node to a generally remote Index Node. There is no security applied when publishing data through the local [ublishingServiceMain class or esgf-crawl command, since any local access is assumed to be authorized already (as it already is for ingesting data in the local Postgres database).

Optimizing the Solr Index

As data is routinely published into an Index Node, the underlying Solr index is fragmented into several pieces or segments, which will cause the search to be slower. It is good practice to periodically issue a directive to Solr to optimize its index, which will reduce the number of segments from tens or hundreds to about 11 in the current release. Optimization is done for the "master" Solr, and will be automatically replicated to the "slave" Solr. The "master" Solr indexes are located in /esg/solr-index-master/datasets/index and /esg/solr-index-master/files/index for the two cores, respectively.

A simple script to optimize the "master" Solr follows. The script takes a few seconds to run on an index that is about 2GB in size, it might take longer (even hours) for bigger indexes. It can be run from the command line or as a cron job, on the same server that the Index Node is running.

{{{#!/bin/bash

curl -s http://localhost:8984/solr/datasets/update?commit=true -H "Content- Type:text/xml" --data-binary '' curl -s http://localhost:8984/solr/files/update?commit=true -H "Content- Type:text/xml" --data-binary '' }}}

References

Configuring Access Control for Publishing and Downloading Data
THREDDS Catalogs Harvesting
ESG Publication Scripts ESG Publisher documentation at PCMDI
ESG Publisher Configuration ESG Publisher documentation at PCMDI
Customizing the ESG Publisher ESG Publisher documentation at PCMDI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESGF_Data_Publishing

ESGF Data Publishing and Un-publishing

Data Publishing

Using the ESGF Publisher

Using the ESGF THREDDS Parser

Using the esgf-crawl command

Data Unpublishing

Using the ESGF Publisher

Using the ESGF THREDDS Parser

Using the esgf-crawl command

Un-publishing directly from Solr

Security

Optimizing the Solr Index

References

Clone this wiki locally