GitHub - jannebor/dd_forecast: Code for predicting probabilities of threat for Data Deficient species of the IUCN Red List of Threatened Species

Extinction Risk of Data Deficient Species

Numerous species of the IUCN Red List of Threatened Species are classified as Data Deficient. This code was used to predict probabilities of being threatened by extinction for Data Deficient species containing range map data available from the IUCN spatial data download. The classifier can be applied for individual species using our web application (alpha version).

Predictor data

Note: The following datasets need to be downloaded individually from third-party sources for reproducing the study, otherwise skip to Model preparation:

Water scarcity footprints (Boulay et al. 2018)
Freshwater connectivity indices (Barbarossa et al. 2020)
Global Database of Power Plants (Byers et al. 2019)
Global dataset of more than 38,000 georeferenced dams (Mulligan et al. 2020)
Human development index
Corruption Perceptions Index 2020
Global threats from invasive alien species (Early et al. 2016)
ESA CCI Land cover
Marine data layers for ecological modelling: Bio-ORACLE
Climatologies at high resolution (Karger et al. 2018)
Global terrestrial Human Footprint maps for 2009 (Venter et al. 2016)
Human modification gradient (Kennedy et al. 2019)
Urban expansion probabilities (Seto et al. 2012)
Forest Cover Change (Hansen et al. 2013)
Habitat heterogeneity (Tuanmu & Jetz 2015)
Pesticide application rates (Maggi et al. 2019)
Freshwater environmental variables (Domisch et al. 2015)
Human Impacts on Marine Ecosystems (Halpern et al. 2008)
World Database on Protected Areas (UNEP-WCMC & IUCN 2021)

Scripts for data pre-processing, e.g., calculating land-use fractions, etc., and stacking all spatial layers are stored in workflow/preparation/raster_preparation and need to be adjusted individually.

The underlying function for retrieving predictor data from tables, web sources (i.e., IUCN, GBIF & OBIS), and the above downloaded spatial datasets for single species is workflow/preparation/data_extraction.R. We applied this function for entire spatial datasets in workflow/preparation/data_extraction_batch.R. The resulting full dataframe (df_ml_v2) is stored as R object in dataframes/full_data.

Model preparation

Full reproducibility (based on code only) is given from this point onwards:

Training (75%) and testing (25%) data was prepared (workflow/preparation/model_prep.R) for each partition (partition 1: all species, partition 2: marine & non-marine species separately) and stored as R objects in dataframes/Partition 1 and dataframes/Partition 2. For each of the partition-specific dataframes features were selected (workflow/preparation/feature_selection.R) using the Boruta algorithm (Kursa & Rudnicki 2010). Only relevant features were considered during model building.

Model building

In total 510 models were fitted using AutoML in H2O. 222 models were fitted using all species (workflow/training/model_partition 1.R), 134 using only marine species and 154 using only non-marine species (workflow/training/model_partition 2.R). All models were calibrated using 10-fold cross-validation, and ranked in terms of AUC based on the set aside testing data (25%), e.g. for partition 1:

model_id	auc	logloss	aucpr	mean_per_class_error	rmse	mse
StackedEnsemble_AllModels_3_AutoML_1	0.912	0.314	0.795	0.174	0.311	0.097
StackedEnsemble_AllModels_6_AutoML_1	0.912	0.315	0.795	0.175	0.311	0.097
StackedEnsemble_AllModels_4_AutoML_1	0.912	0.315	0.795	0.175	0.311	0.097
StackedEnsemble_AllModels_5_AutoML_1	0.910	0.318	0.791	0.176	0.313	0.098
StackedEnsemble_BestOfFamily_4_AutoML_1	0.909	0.318	0.793	0.184	0.313	0.098

Model evaluation

Performance metrics were calculated based on the testing data (workflow/evaluation/model_performance.R) and based on reclassified Data Deficient species (workflow/evaluation/dd_performance.R). Permutation variable importance was calculated by measuring performance loss before and after a feature was permuted (workflow/evaluation/variable_importance.R).

Predictions

The generated predictions for Data Deficient species are stored in dd_predictions.csv and show the probability of being threatened by extinction for each species:

Species	Last Assessed	Taxonomic class	Red List Category	Probability of being threatened
Chirostoma grandocule	2018	Actinopterygii	Data Deficient	95.8%
Sarcohyla miahuatlanensis	2019	Amphibia	Data Deficient	95.8%
Crossodactylus dantei	2008	Amphibia	Data Deficient	95.4%
Nyctibatrachus sholai	2008	Amphibia	Data Deficient	95.2%
Colostethus alacris	2016	Amphibia	Data Deficient	95.2%
…

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
classifier/v2		classifier/v2
dataframes		dataframes
figs		figs
workflow		workflow
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extinction Risk of Data Deficient Species

Predictor data

Note: The following datasets need to be downloaded individually from third-party sources for reproducing the study, otherwise skip to Model preparation:

Model preparation

Full reproducibility (based on code only) is given from this point onwards:

Model building

Model evaluation

Predictions

About

Releases 1

Packages

Languages

jannebor/dd_forecast

Folders and files

Latest commit

History

Repository files navigation

Extinction Risk of Data Deficient Species

Predictor data

Note: The following datasets need to be downloaded individually from third-party sources for reproducing the study, otherwise skip to Model preparation:

Model preparation

Full reproducibility (based on code only) is given from this point onwards:

Model building

Model evaluation

Predictions

About

Topics

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages