-
Notifications
You must be signed in to change notification settings - Fork 2
A Guide to Getting Names into the ALA
The ALA is basically a great big data cube. The ALA uses names as a way of indexing this cube so that users can structure data according to taxonomy (e.g.: I only want records from the family Cassidini
). This guide allows you to set up the Biodiversity Information Explorer (BIE
) and name matching indexes with your own taxonomies.
The biocache
holds occurrence records – this animal/plant was seen here, at this time, by this person, what, where, when and who. The information in the biocache
is indexed by a solr
index, which allows people to search the biocache
for things they are interested in. The name matching index contains a taxonomy suitable for processing. The supplied information in every occurrence record in the biocache
is matched against the name matching index and the occurrence record is annotated with things like the matched name, higher taxonomy, quality of match etc. The link between the name matching index and the biocache
is the taxonID
or guid
, which gives a unique identifier for each species, suitable for indexing in solr
.
The BIE
holds organising information – this species, this dataset, this locality, this region, this webpage. A person can search the BIE
as a first entry point into the ALA and get back a number of references to things that might be of interest. In particular, the BIE
holds species and taxonomy information. It also holds references to, more or less, anything that can be used to search the biocache
.
The collectory
holds metadata about the datasets, data providers, collections and institutions the provide data to the biocache
. The metadata particularly holds a description, URLs, contact information and licencing and copyright information. As well as datasets, it can also hold metadata about things like webpages, lists of species, etc.
The basic use of the ALA is that a user goes to the BIE
and types in a name. The BIE
will search for the name and give the user a set of options. The user can then click on the link that is closest to what they want. If that link is a species (or genus etc.) page then the user can ask to be shown all the records in the biocache
which match the taxonID
.
- To begin with, an installed instance of the ALA, with the
BIE
,biocache
andcollectory
. See the LA Quick Start Guide.- Ensure that the following directories exist:
/data/bie/import
,/data/lucene/namematching
and/data/bie-index/config
- Ensure that the following directories exist:
- The
ala-name-matching library
and programs from https://nexus.ala.org.au- What you want for the current installation is https://nexus.ala.org.au/service/local/repositories/releases/content/au/org/ala/ala-name-matching/2.4.7/ala-name-matching-2.4.7-distribution.zip
- You can unzip this distribution anywhere you want to build the name matching index. This can be your personal computer, you just need a Java8 installation.
- See https://github.com/AtlasOfLivingAustralia/ala-name-matching for more information
- You will also need the
IRMNG
generaDwCA
, for homonym detection. You can get this from http://www.irmng.org/export/ Unzip this into/data/lucene/sources/IRMNG_DWC_HOMONYMS
on the machine where you plan to run ala-name-matching
- (Optionally)
Talend Open Studio
from https://www.talend.com This is useful for building the taxonomy Darwin Core Archive described below.
In the examples, we are building an archive for sibbr
. You can use any name that suits you.
You can do this any way you want to. What you need as an output is a Darwin Core Archive (DwCA
) containing information covered in the Taxon profile of Darwin Core . The result needs to follow the conventions described in https://github.com/AtlasOfLivingAustralia/bie-index/blob/master/doc/nameology/index.md
At a minimum, though, you will need a taxon.csv
taxonomy file, a meta.xml
description and a eml.xml
metadata description. The DwCA
needs to be structured to have a taxonID
, parentNameUsageID
(for accepted), acceptedNameUsageID
(for synonyms), nomenclaturalCode
, scientificName
, scientificNameAuthorship
, taxonRank
and taxonomicStatus
following the conventions listed above. You can add other information as you see fit.
The Gbif Darwin Core Archive Assistant can help you decide on the terms and structure of the archive.
The ALA uses Talend
to pull together the various data sources, transform them into Darwin Core following the nameology conventions and building a DwCA
. You don’t have to do this; you can use anything that archives the correct result.
If you have multiple, overlapping taxonomies, things get more complicated. You will need to use the Large Taxon Collider, described at https://github.com/AtlasOfLivingAustralia/ala-name-matching/blob/master/doc/large-taxon-collider.md
Where you have unzipped the name matching distribution, run the command:
java -jar ala-name-matching-2.4.7.jar -all -dwca /path/to/DwCA
where /path/to/DwCA
is the path to the directory where the unzipped DwCA
is. If you want to see all the possible options, run
java -jar ala-name-matching-2.4.7.jar -h
The resulting name index will be found in /data/lucene/namematching
Any previous name matching index will be renamed.
For copying around, it’s usally best to zip up the namematching directory. Say zip -r namematching.zip namematching
You can also use the nameindexer role to perform that task.
Also: On a VM with nameindexer installed and which includes default DwCA from the Catalog Of Life, to create your own nameindex with vernacular names:
- rename /data/lucene/sources/col_vernacular.txt (so nameindexer can't find this default file)
- put your DwCA in a sub-folder and include at least these:
- (eml.xml does not appear to be required.)
- meta.xml with column-mappings for your species file and vernacular file
- Species file (csv/txt. Header not required. See Step 1 above for required fields.)
- Vernacular file (csv/txt. Header not required.)
nameindexer -all -dwca /path/to/your/dwca
Note that you do NOT include the -common switch to include/process your vernacular file via your meta.xml file. This method overrides the default behavior for including a vernacular file. If you see errors like this:
2020-01-01 12:00:00,000 INFO : [DwcaNameIndexer] - Issue on line 10000 1234567
This is likely the result of trying to use the -common switch with your own vernacular file whose columns do not match the default column-mapping expected by nameindexer.
Do not have any occurrence records being imported or processed while you are doing the next steps.
Copy the DwCA
and namematching to the server.
Put the contents of the DwCA into /data/bie/import/sibbr
and change ownership to tomcat7
via chown -R tomcat7.tomcat7 /data/bie/import/sibbr
It is important that you change ownership, otherwise the BIE may have trouble importing the archive.
Put the contents of the namematching.zip
into /data/lucene/namematching
If you are changing name matching indexes often, it is often a good practice to datestamp the directories (eg namematching-20180921
) and use a symbolic link from /data/lucene/namematching
Before doing this, have a look in /data/bie-index/config
The bie-index-config.properties
or bie-index-config.yml
file contains a list of steps through which the import process goes through. You can adjust these steps to suit what you have. For example,
import.sequence=collectory,taxonomy-all,denormalise,conservation-lists,link-identifiers,images,occurrences
You may also want to modify the contents of conservation-lists.json
and image-lists.json
in the same directory. These are documented at https://github.com/AtlasOfLivingAustralia/bie-index
Once you are happy, go to your server http://localhost/bie-index/admin
and choose the "Import All" option. Click on the button and watch the log expand as it steps through the elements in the import sequence. The above sequence will first pull in all the data providers, data resources, collections, institutions, etc. from the collectory
, then import all the taxa from DwCAs
in /data/bie/import
then denormalise the taxonomy and link synonyms, then load conservation status information, then scan for unique human readable links to species, then scan for images and finally load an estimate of the occurrences for each taxon into the index.
Also probably you should configure your species subgroups to match this new nameindex hierarchy.
Since we have a new name matching index, the entire biocache needs to be reprocessed to match the supplied names against the new index.
biocache process-local-node
Once you have reprocessed the biocache, you need to re-index it with
biocache index-local-node
The result will be a new biocache index.
The BIE
serves data from a solr
core called bie
. It imports data into a core called bie-offline
. Once complete, the cores need to be swapped so that what was bie-offline
becomes bie
and what was bie
becomes bie-offline
, ready for the next load. To swap cores, you need to go to http://localhost:8983/solr choose "Core Admin" and swap the two cores.
The new biocache
index first needs to be imported into solr
. Again, choose "Core Admin" and choose "Add Core" with an instance dir of /data/solr/data/biocache
and a data dir of wherever the new index is located. Then swap the biocache
and new cores.
More details and screenshots in SOLR Admin tasks page.
- This bie-index wiki page about the full reindex tasks is outdated but quite informative about the whole re index process.
Index
- Wiki home
- Community
- Getting Started
- Support
- Portals in production
- ALA modules
- Demonstration portal
- Data management in ALA Architecture
- DataHub
- Customization
- Internationalization (i18n)
- Administration system
- Contribution to main project
- Study case