Skip to content

A Guide to Getting Names into the ALA

jloomisVCE edited this page Apr 22, 2020 · 15 revisions

How It Works

The ALA is basically a great big data cube. The ALA uses names as a way of indexing this cube so that users can structure data according to taxonomy (e.g.: I only want records from the family Cassidini). This guide allows you to set up the Biodiversity Information Explorer (BIE) and name matching indexes with your own taxonomies.

The biocache holds occurrence records – this animal/plant was seen here, at this time, by this person, what, where, when and who. The information in the biocache is indexed by a solr index, which allows people to search the biocache for things they are interested in. The name matching index contains a taxonomy suitable for processing. The supplied information in every occurrence record in the biocache is matched against the name matching index and the occurrence record is annotated with things like the matched name, higher taxonomy, quality of match etc. The link between the name matching index and the biocache is the taxonID or guid, which gives a unique identifier for each species, suitable for indexing in solr.

The BIE holds organising information – this species, this dataset, this locality, this region, this webpage. A person can search the BIE as a first entry point into the ALA and get back a number of references to things that might be of interest. In particular, the BIE holds species and taxonomy information. It also holds references to, more or less, anything that can be used to search the biocache.

The collectory holds metadata about the datasets, data providers, collections and institutions the provide data to the biocache. The metadata particularly holds a description, URLs, contact information and licencing and copyright information. As well as datasets, it can also hold metadata about things like webpages, lists of species, etc.

The basic use of the ALA is that a user goes to the BIE and types in a name. The BIE will search for the name and give the user a set of options. The user can then click on the link that is closest to what they want. If that link is a species (or genus etc.) page then the user can ask to be shown all the records in the biocache which match the taxonID.

What You Will Need

Installing and Using a New Names

In the examples, we are building an archive for sibbr. You can use any name that suits you.

Step 1: Build the Darwin Core Archive

You can do this any way you want to. What you need as an output is a Darwin Core Archive (DwCA) containing information covered in the Taxon profile of Darwin Core . The result needs to follow the conventions described in https://github.com/AtlasOfLivingAustralia/bie-index/blob/master/doc/nameology/index.md At a minimum, though, you will need a taxon.csv taxonomy file, a meta.xml description and a eml.xml metadata description. The DwCA needs to be structured to have a taxonID, parentNameUsageID (for accepted), acceptedNameUsageID (for synonyms), nomenclaturalCode, scientificName, scientificNameAuthorship, taxonRank and taxonomicStatus following the conventions listed above. You can add other information as you see fit.

The Gbif Darwin Core Archive Assistant can help you decide on the terms and structure of the archive.

The ALA uses Talend to pull together the various data sources, transform them into Darwin Core following the nameology conventions and building a DwCA. You don’t have to do this; you can use anything that archives the correct result.

If you have multiple, overlapping taxonomies, things get more complicated. You will need to use the Large Taxon Collider, described at https://github.com/AtlasOfLivingAustralia/ala-name-matching/blob/master/doc/large-taxon-collider.md

Step 2: Build the Name Matching Index

Where you have unzipped the name matching distribution, run the command:

java -jar ala-name-matching-2.4.7.jar -all -dwca /path/to/DwCA

where /path/to/DwCA is the path to the directory where the unzipped DwCA is. If you want to see all the possible options, run

java -jar ala-name-matching-2.4.7.jar -h

The resulting name index will be found in /data/lucene/namematching Any previous name matching index will be renamed. For copying around, it’s usally best to zip up the namematching directory. Say zip -r namematching.zip namematching

You can also use the nameindexer role to perform that task.

Also: On a VM with nameindexer installed and which includes default DwCA from the Catalog Of Life, to create your own nameindex with vernacular names:

  • rename /data/lucene/sources/col_vernacular.txt (so nameindexer can't find this default file)
  • put your DwCA in a sub-folder and include at least these:
    • (eml.xml does not appear to be required.)
    • meta.xml with column-mappings for your species file and vernacular file
    • Species file (csv/txt. Header not required. See Step 1 above for required fields.)
    • Vernacular file (csv/txt. Header not required.)
nameindexer -all -dwca /path/to/your/dwca

Note that you do NOT include the -common switch to include/process your vernacular file via your meta.xml file. This method overrides the default behavior for including a vernacular file. If you see errors like this:

2020-01-01 12:00:00,000 INFO : [DwcaNameIndexer] - Issue on line 10000  1234567

This is likely the result of trying to use the -common switch with your own vernacular file whose columns do not match the default column-mapping expected by nameindexer.

Step 3: Copy the Data to the Server

Do not have any occurrence records being imported or processed while you are doing the next steps.

Copy the DwCA and namematching to the server.

Put the contents of the DwCA into /data/bie/import/sibbr and change ownership to tomcat7 via chown -R tomcat7.tomcat7 /data/bie/import/sibbr It is important that you change ownership, otherwise the BIE may have trouble importing the archive.

Put the contents of the namematching.zip into /data/lucene/namematching If you are changing name matching indexes often, it is often a good practice to datestamp the directories (eg namematching-20180921) and use a symbolic link from /data/lucene/namematching

Step 4: Import into the BIE

Before doing this, have a look in /data/bie-index/config The bie-index-config.properties or bie-index-config.yml file contains a list of steps through which the import process goes through. You can adjust these steps to suit what you have. For example,

import.sequence=collectory,taxonomy-all,denormalise,conservation-lists,link-identifiers,images,occurrences

You may also want to modify the contents of conservation-lists.json and image-lists.json in the same directory. These are documented at https://github.com/AtlasOfLivingAustralia/bie-index Once you are happy, go to your server http://localhost/bie-index/admin and choose the "Import All" option. Click on the button and watch the log expand as it steps through the elements in the import sequence. The above sequence will first pull in all the data providers, data resources, collections, institutions, etc. from the collectory, then import all the taxa from DwCAs in /data/bie/import then denormalise the taxonomy and link synonyms, then load conservation status information, then scan for unique human readable links to species, then scan for images and finally load an estimate of the occurrences for each taxon into the index.

Step 5: Configure your Species Subgroups

Also probably you should configure your species subgroups to match this new nameindex hierarchy.

Step 6: Reprocess and Reindex the biocache

Since we have a new name matching index, the entire biocache needs to be reprocessed to match the supplied names against the new index.

biocache process-local-node

Once you have reprocessed the biocache, you need to re-index it with

biocache index-local-node

The result will be a new biocache index.

Step 7: Swap Cores

The BIE serves data from a solr core called bie. It imports data into a core called bie-offline. Once complete, the cores need to be swapped so that what was bie-offline becomes bie and what was bie becomes bie-offline, ready for the next load. To swap cores, you need to go to http://localhost:8983/solr choose "Core Admin" and swap the two cores.

The new biocache index first needs to be imported into solr. Again, choose "Core Admin" and choose "Add Core" with an instance dir of /data/solr/data/biocache and a data dir of wherever the new index is located. Then swap the biocache and new cores.

More details and screenshots in SOLR Admin tasks page.

More information

  • This bie-index wiki page about the full reindex tasks is outdated but quite informative about the whole re index process.
Clone this wiki locally