Skip to content

Latest commit

 

History

History
127 lines (106 loc) · 4.98 KB

README.md

File metadata and controls

127 lines (106 loc) · 4.98 KB

GATEWAy database from v.1 to v.2

Emilio Berti

Change log

V.2.0

  1. Removed problematic characters from fields: - , replaced with ; when within ". - " removed. - Multiple spaces trimmed to one space. - Convert all uppercase letters to lowercase.
  2. Add data from Mulder et al. (citation needed).
  3. Add data from Vinagre et al. (citation_needed).
  4. Harmonize taxonomy
  5. Rename columns following the Darwin Core standard: http://rs.tdwg.org/dwc/terms.htm#4.

Additionally, tables and shapefiles are created to be inserted into a PostGIS database to feed the website server.

SQL database

I re-structured the GATEWAy database into <…> tables:

  1. Foodwebs
  2. Species
  3. Interactions

The old database was a single csv file. I colored the columns of this file with the same color of the tables above. The symbol => show columns that have been renamed. E.g., A => B means A was renamed to B, following DarwinCore standards: http://rs.tdwg.org/dwc/terms.htm#4. All column names have been converted to camelCase. Strike-through columns have been dropped.

foodwebName
longitude => decimalLongitude
latitude => decimalLatitude ecosystemType
geographicLocation => locality
studySite
altitude => verbatimElevation
depth => verbatimDepth
samplingTime
sampling.start.year => EarliestDateCollected
sampling.end.year => LatestDateCollected

~~key~~
taxonID,

identified by the combination of all fields related to species
taxonomy => acceptedTaxonName
common => vernacularName
lifeStage
metabolic.type => metabolicType
movement.type => movementType
mass.min.g. => lowestMass, in grams
mass.max.g. => highestMass, in grams
mass.mean.g. => meanMass, in grams
length.min.cm. => shortestLength, in cm
length.max.cm. => longestLength, in cm
length.mean.cm. => meanLength, in cm
size.method => sizeMethod
size.citation => sizeReference
taxonomy.status => taxonomicStatus
taxonomy.level => taxonRank

~~autoid~~
style="color:red">interactionID, identified by unique taxonID and foodwebName
interactionType
interactionDimensionality
link.methodology => interactionMethod
link.citation => interactionReference
interaction.classification => basisOfRecord
link.citation => interactionReference
notes => interactionRemarks

Pipeline

To run the whole pipeline at once:

bash pipeline.sh

This runs (in order):

  1. Clean v.1.0 for encoding/parsing errors: scripts/clean-gateway.
  2. Process new data to add to the database: scripts/mulder.R and scripts/tagus.R.
  3. Extract species names: scripts/extract-species-names.R.
  4. Query the species names against GBIF: harmonize-taxonomy.py.
  5. Combine v.1.0 with new data: scripts/combine.R.
  6. Harmonize taxonomy of the database: harmonize-taxonomy.R.
  7. Saves the new database as gateway-v.2.0.csv.
  8. Rename fields and split table into relational tables: scripts/rename.R.

Some of this steps can take some time. To avoid re-running already completed steps, once the step is completed successfully an hidden (empty) file is added to the steps folder. Steps that have such files will not be re-ran. You can re-run the whole pipeline from scratch specifying the option –clean:

bash pipeline.sh --clean

To see available options and usage: bash pipeline.sh --help.

PostGIS database

Create database

psql -U postgres
CREATE DATABASE gateway;
\c gateway;
CREATE EXTENSION postgis;