In this section you will learn how to prepare the data to train models using SDMtune
. We will use the virtualSp
dataset included in the package and environmental predictors from the WorldClim dataset.
First let’s load some packages that we will use to visualize the data:
library(ggplot2) # To plot locations
library(maps) # To access useful maps
library(rasterVis) # To plot raster objects
We use the climate data of WorldClim version 1.4 (Hijmans et al. 2005) and the terrestrial ecoregions from WWF (Olson et al. 2001) included in the dismo
package:
<- list.files(path = file.path(system.file(package = "dismo"), "ex"),
files pattern = "grd", full.names = TRUE)
We convert these files in a raster stack
object that will be used later in the analysis:
<- raster::stack(files) predictors
There are nine environmental variables, eight continuous and one categorical:
names(predictors)
We can plot bio1 using the gplot
function from the rasterVis
package:
gplot(predictors$bio1) +
geom_tile(aes(fill = value)) +
coord_equal() +
scale_fill_gradientn(colours = c("#2c7bb6", "#abd9e9", "#ffffbf", "#fdae61", "#d7191c"),
na.value = "transparent",
name = "°C x 10") +
labs(title = "Annual Mean Temperature",
x = "longitude",
y = "latitude") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank())
Let’s load the SDMtune package:
library(SDMtune)
For demonstrating how to use SDMtune
we use the random generated virtualSp
dataset included in the package.
help(virtualSp)
<- virtualSp$presence
p_coords <- virtualSp$background bg_coords
Plot the study area together with the presence locations:
ggplot(map_data("world"), aes(long, lat)) +
geom_polygon(aes(group = group), fill = "grey95", color = "gray40", size = 0.2) +
geom_jitter(data = p_coords, aes(x = x, y = y), color = "red",
alpha = 0.4, size = 1) +
labs(x = "longitude", y = "latitude") +
theme_minimal() +
theme(legend.position = "none") +
coord_fixed() +
scale_x_continuous(limits = c(-125, -32)) +
scale_y_continuous(limits = c(-56, 40))
To plot the background locations run the following code:
ggplot(map_data("world"), aes(long, lat)) +
geom_polygon(aes(group = group), fill = "grey95", color = "gray40", size = 0.2) +
geom_jitter(data = as.data.frame(bg_coords), aes(x = x, y = y),
color = "blue", alpha = 0.4, size = 0.5) +
labs(x = "longitude", y = "latitude") +
theme_minimal() +
theme(legend.position = "none") +
coord_fixed() +
scale_x_continuous(limits = c(-125, -32)) +
scale_y_continuous(limits = c(-56, 40))
Before training a model we have to prepare the data in the correct format. The prepareSWD
function creates an SWD
object that stores the species’ name, the coordinates of the species at presence and absence/background locations and the value of the environmental variables at the locations. The argument categorical
indicates which environmental variables are categorical. In our example biome is categorical (we can pass a vector if we have more than one categorical environmental variable). The function extracts the value of the environmental variables for each location and excludes those locations that have NA
value for at least one environmental variable.
<- prepareSWD(species = "Virtual species", p = p_coords, a = bg_coords,
data env = predictors, categorical = "biome")
Let’s have a look at the SWD
object that we have just created:
data
When we print an SWD
object we get a bunch of information:
The object contains four slots: @species
, @coords
@data
and @pa
. @pa
contains a vector with 1 for presence and 0 for absence/background locations. To visualize the data we run:
head(data@data)
We can visualize the coordinates with:
head(data@coords)
or the name of the species with:
@species data
We can save the SWD
object in a .csv file using the function swd2csv
(the function saves the file in the working directory). There are two possibilities:
swd2csv(data, file_name = "data.csv")
swd2csv(data, file_name = c("presence.csv", "background.csv"))
SDMtune
supports four methods for model training:
nnet
package (Venables and Ripley 2002);gbm
package (Greenwell et al. 2019);dismo
package (Hijmans et al. 2017);maxnet
package (Phillips 2017);randomForest
package (Liaw and Wiener 2002).The code necessary to train a model is the same for all the implementations. We will show how to train a Maxent model, you can adapt the code for the other methods or check the presence absence models
vignette.
We use the function train
to train a Maxent model. We need to provide two arguments:
method
: “Maxent” in our case;data
: the SWD
object with the presence and background locations.<- train(method = "Maxent", data = data) model
The function trains the model using default settings that are:
We will see later how to change the default settings, for the moment let’s have a look at the model
object.
The output of the function train
is an object of class SDMmodel
. Let’s print the model
object we’ve just created:
model
When we print an SDMmodel
object we get the following information:
An SDMmodel
object has two slots:
slotNames(model)
SWD
object with the presence absence/background locations used to train the model;Maxent
object, in our case, with all the model configurations.The slot model
contains the configurations of the model plus other information used to make predictions.
slotNames(model@model)
The most important are: fc, reg and iter that contain the values of the model configuration.
The function train()
accepts optional arguments that can be used to change the default model settings. In our previous example we could have trained the same model using:
<- train(method = "Maxent", data = data, fc = "lqph", reg = 1, iter = 500) model
In the following example we train a model using linear and hinge as feature class combination, 0.5 as regularization multiplier and 700 iterations:
<- train(method = "Maxent", data = data, fc = "lh", reg = 0.5, iter = 700) model
By default Maxent models are trained using the arguments “removeduplicates=false” and “addsamplestobackground=false”. The user should have the full control of the data used to train the model, so is expected that duplicated locations are already removed and that the presence locations are already included in the background locations, when desired. You can use the function thinData
to remove duplicated locations and the function addSamplesToBg
to add the presence locations to the background locations.
New locations are predicted with the function predict
. The function takes three main arguments:
SDMmodel
object;data.frame
, an SWD
object or a raster stack
object);Next we get the prediction for our training locations using the cloglog output type:
<- predict(model, data = data, type = "cloglog") pred
The output in this case is a vector containing all the predicted values for the training locations:
head(pred)
We can get the prediction only for the presence location with:
<- data@data[data@pa == 1, ]
p <- predict(model, data = p, type = "cloglog")
pred tail(pred)
For models trained with the Maxent method, the function performs the prediction in R without calling the MaxEnt Java software. This results in a faster computation for large datasets and might result in a slightly different output compared to the Java software.
We can use the same function to create a distribution map starting from the predictors
raster `stack’ object:
<- predict(model, data = predictors, type = "cloglog") map
In this case the output is a raster
object:
map
The map can be saved in a file directly when running the prediction, we just have to pass additional arguments to the predict
function. In the next example we save the map in a file called “my_file” in the GeoTIFF format:
<- predict(model, data = predictors, type = "cloglog", file = "my_map",
map format = "GTiff")
The function predict
has other arguments useful when predicting large datasets:
"text"
to visualize a progress bar;In the next example we restrict the prediction to Chile and plot the prediction:
# First create the extent that surrounds Cile
= raster::extent(c(-77, -60, -56, -15))
e # Now use the extent to make the prediction
<- predict(model, data = predictors, type = "cloglog", extent = e) map_e
To plot the distribution map we can use the function plotPred
:
plotPred(map)
The function plotPred
plots a map with a color ramp similar to the one used by the MaxEnt Java software. We can pass additional arguments to customize the map. In the next example we provide a custom color ramp and we add a title to the legend:
plotPred(map, lt = "Habitat\nsuitability",
colorramp = c("#2c7bb6", "#abd9e9", "#ffffbf", "#fdae61", "#d7191c"))
To plot a presence/absence map we need a threshold value that splits the prediction in presence and absence values. The function thresholds
returns some commonly used threshold values starting from an SDMmodel
object. In the next example we print the threshold values using the type "cloglog"
:
<- thresholds(model, type = "cloglog")
ths ths
For example we want to create a presence/absence map using the threshold that maximize the training sensitivity plus specificity. We use the function plotPA
passing the threshold value as argument:
plotPA(map, th = ths[3, 2])
We can also save the map in a file with the following code:
plotPA(map, th = ths[3, 2], filename = "my_pa_map", format = "GTiff")
Both functions plotPred
and plotPA
have the argument hr
to plot the the map with high resolution, useful when the map is used in a scientific publication.
SDMtune
implements three evaluation metrics:
We will compute the value of the metrics on the training dataset.
The AUC can be calculated using the function auc
:
auc(model)
We can also plot the ROC curve using the function plotROC
:
plotROC(model)
The TSS is computed with the function tss
:
tss(model)
For the AICc we use the function aicc
. In this case we need to pass to the env
argument the predictors
raster stack
object:
aicc(model, env = predictors)
It’s always a good practice to split the species locations into two parts and use one part to train the model and the remaining part to evaluate it. We can use the trainValTest
function for this purpose. Let’s say we want to use 80% of the species locations to train our model and 20% as testing dataset to evaluate it:
library(zeallot) # For unpacking assignment
c(train, test) %<-% trainValTest(data, test = 0.2, only_presence = TRUE,
seed = 25)
Now we train the model using the train
dataset:
<- train("Maxent", data = train) model
The only_presence
argument is used to split only the presence and not the background locations. We can now evaluate the model using the testing dataset that has not been used to train the model:
auc(model)
auc(model, test = test)
We can plot the ROC curve for both, training and testing datasets, with:
plotROC(model, test = test)
This approach is valid when we have a large dataset. In our case, with only few observations, the evaluation depends strongly on how we split our presence locations. Let’s run a small experiment in which we perform different train/test splits and we compute the AUC:
# Create an empty data.frame
<- data.frame(matrix(NA, nrow = 10, ncol = 3))
output colnames(output) <- c("seed", "trainAUC", "testAUC")
# Create 10 different random seeds
set.seed(25)
<- sample.int(1000, 10)
seeds
# Loop through the seeds
for (i in 1:length(seeds)) {
# Make the train/test split
c(train, test) %<-% trainValTest(data, test = 0.2, seed = seeds[i],
only_presence = TRUE)
# train the model
<- train("Maxent", data = train)
m # Populate the output data.frame
1] <- seeds[i]
output[i, 2] <- auc(m)
output[i, 3] <- auc(m, test = test)
output[i,
}
# Print the output
output# compute the range of the testing AUC
range(output[, 3])
The testing AUC varies for the different train/test partitions. When we have to deal with a small dataset a better approach is the cross validation.
To perform a cross validation in SDMtune we have to pass the fold
argument to the train
function. First we have to create the folds. There are several way to create them, here we explain how to make a random partition of 4 folds using the function randomFolds
:
<- randomFolds(data, k = 4, only_presence = TRUE, seed = 25) folds
The output of the function is a list containing two matrices, the first for the training and the second for the testing locations. Each column of one matrix represents a fold with TRUE
for the locations included in and FALSE
excluded from the partition.
Let’s perform a 4 fold cross validation using the Maxent method (note that we use the full dataset):
<- train("Maxent", data = data, folds = folds)
cv_model cv_model
The output in this case is an SDMmodelCV
object. It contains the four trained models in the models
slot and the fold partitions in the folds
slot. We can compute the AUC of a SDMmodelCV
object using:
auc(cv_model)
auc(cv_model, test = TRUE)
this returns the AUC value averaged across the four different models.
The train()
function accepts folds created with two other packages:
ENMeval
(Muscarella et al. 2014)blockCV
(Valavi et al. 2019)The function will convert internally the created folds into the correct format for SDMtune
. These packages have specific function to create folds partitions that are spatially or environmentally independent.
Block partition using the ENMeval
package:
library(ENMeval)
<- get.block(occ = data@coords[data@pa == 1, ],
block_folds bg.coords = data@coords[data@pa == 0, ])
<- train(method = "Maxent", data = data, fc = "l", reg = 0.8,
model folds = block_folds)
Checkerboard1 partition using the ENMeval
package:
<- get.checkerboard1(occ = data@coords[data@pa == 1, ],
cb_folds env = predictors,
bg.coords = data@coords[data@pa == 0, ],
aggregation.factor = 4)
<- train(method = "Maxent", data = data, fc = "l", reg = 0.8,
model folds = cb_folds)
Environmental block using the package blockCV
:
library(blockCV)
library(raster)
# Create spatial points data frame
<- SpatialPointsDataFrame(data@coords,
sp_df data = as.data.frame(data@pa),
proj4string = crs(predictors))
<- envBlock(rasterLayer = predictors, speciesData = sp_df,
e_folds species = "data@pa", k = 4, standardization = "standard",
rasterBlock = FALSE, numLimit = 100)
<- train(method = "Maxent", data = data, fc = "l", reg = 0.8,
model folds = e_folds)