Tools to prep EMu data for IPT
These scripts are part of the FMNH workflow for publishing specimen data from EMu to the Field Museum IPT. Information on how to structure data/reports from EMu is at the top of each script.
- prepares EMu Catalogue and Multimedia data as an Audiovisual Core extension for multimedia associated with occurrences.
- IPTac_v1.R is an older version (pre-CT scan data) for preparing Multimedia with a simpler record structure
- prepares a [draft] Collections Description dataset for inventories, accessions, or other data not yet resolved to 'occurrence-level' specificity.
- includes checks to help prepare EMu Catalogue data as a Darwin Core dataset.
- converts EMu's DarDateLastModified values to the proper ISO time format for dwc:modified.
- converts EMu's ColDateCollectedFrom (or DarYear, DarMonth, DayDay) to ISO format for dwc:eventDate.
- replaces carriage-returns with pipes within all fields.
- checks for duplicate GUIDs and if any are found, outputs a CSV of duplicates to check.
- prepares EMu Catalogue data from the Relationship Tab (AllRelNhTab) as a Resource Relationship extension for occurrences with interactions.
- prepares EMu Catalogue data (pre-EMu-development) as a Resource Relationship extension for occurrences with interactions.
- see the Relationship Data workflows doc for help with EMu data handling and IPT mapping.
- try an online version here
EMu-IPT-Prep scripts primarily use tidyverse's tidyr
and readr
packages. For more info, check the tidyverse site
- Download and install R and RStudio
- In RStudio, install the required tidyverse packages in the 'Console' pane (usually lower-left) by typing the following and hitting enter:
install.packages('tidyverse')
- To clone the repo, UChicago's steps here are helpful.
Or:
- Simply download the EMu-IPT-prep repo as a .zip, and unzip it
- Open RStudio, and create a new project by going to File --> New Project --> Existing Directory (select the 'EMu-IPT-prep' directory), and clicking 'Create Project'
The input files for the EMu-IPT-prep scripts are CSV datasets generated from EMu reports. In this repo:
-
First, create a
data01raw
directory -
Second, create a
data02output
directory -
Run the script's corresponding EMu CSV report and put the output CSVs in the location described below:
-
For Audubon Core scripts, e.g.
IPTac.R
:- EMu Catalogue report = 'IPT Audubon Core' CSV report
- Location for all EMu csv's from this report:
data01raw/
-
For Darwin Core scripts, e.g.
IPTdwc.R
:- run an EMu Catalogue 'IPT_General' CSV report (or IPT_[Collection Area])
- Note that the file should be the default EMu report name
ecatalog.csv
- Location for EMu csv:
data01raw/iptSpec/
-
For Resource Relationship scripts, e.g.
IPTrr.R
- run an EMu Catalogue 'IPT Resource Relationship' CSV report
- Location for EMu csv:
data01raw/relationships
-
- Open command line (cmd, terminal, etc), and check that R can run there by typing
Rscript
and hitting enter.- If a 'command not found' warning appears, add Rscript.exe's path (e.g.
C:\Program Files\R\R-4.1.2\bin
) to the Path environment variable - Steps to add a path are here
- If a 'command not found' warning appears, add Rscript.exe's path (e.g.
cd
to the root directory of this repo- Use
Rscript
to run a script in commandline -- e.g.:Rscript IPTac.R
- Use
--verbose
to see more info while the script runs -- e.g.:Rscript --verbose IPTac.R
- Use
- When the script finishes, check for the output file/s in the
data02output
directory in this repo.
Scripts can be run using R's source()
function if input-files are named properly and in the right directory.
When running source
, setting verbose=TRUE
can be useful if warnings or errors pop up. After running a script, cross-checking the input- and output-data in a text-editor -- or in RStudio's 'Environment' pane (usually upper right) -- is recommended.
-
In RStudio, make sure you're in the EMu-IPT-prep project (The top of the RStudio window should show the project directory path. If it's wrong, go to File -> Open Project -> go to the EMu-IPT-prep dir, and open its '.RProj' file).
-
Run the
source
function in the Console pane by typingsource("[script-filename]", verbose=TRUE)
and hitting enter -- e.g.:source("IPTac.R", verbose=TRUE)
# For Audubon Coresource("IPTdwc.R", verbose=TRUE)
# For Darwin Core -
While the script is running, a small red 'stop sign' icon will display in the Console pane's upper-right corner. When the script is finished, the stop sign will disappear.
-
When the script finishes, check for the output file/s in the
data02output
directory in this repo. -
Rename the output file
Catalog2.csv
to the corresponding collection e.g. field_ipt_insects -
zip the file
- try using guess max like this
cat <- read_csv(file = "data01raw/iptSpec/ecatalog.csv", guess_max = 1000000)
- Basically "guess_max" tells R to look at more rows before guessing which data-types to assign to columns... we could get more strict about schemas, but for now should be good.
Error: package or namespace load failed for ‘readr’ in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]): there is no package called ‘hms’ In addition: Warning message: package ‘readr’ was built under R version 3.5.3
- try
install.packages("tidyverse")
- Add example input/output data
- More how-to, validation, error logging...
- Finish draft-CD script