The mushrooms
dataset contains data about mushrooms (see ?mushrooms
for details). The goal of our model is to predict which mushrooms are poisonous based on 22 cues ranging from the mushroom’s odor, color, etc.
Here are the first few rows of the data:
head(mushrooms)
## poisonous cshape csurface ccolor bruises odor gattach gspace gsize gcolor
## 1 TRUE x s n t p f c n k
## 2 FALSE x s y t a f c b k
## 3 FALSE b s w t l f c b n
## 4 TRUE x y w t p f c n n
## 5 FALSE x s g f n f w b k
## 6 FALSE x y y t a f c b n
## sshape sroot ssaring ssbring scaring scbring vtype vcolor ringnum ringtype
## 1 e e s s w w p w o p
## 2 e c s s w w p w o p
## 3 e c s s w w p w o p
## 4 e e s s w w p w o p
## 5 t e s s w w p w o e
## 6 e c s s w w p w o p
## sporepc population habitat
## 1 k s u
## 2 n n g
## 3 n n m
## 4 k s u
## 5 n a g
## 6 k n g
Let’s create some trees using FFTrees()
, we’ll use the train.p = .5
argument to split the original data into a 50% training set and a 50% testing set.
# Create FFTs from the mushrooms data
set.seed(100) # For replicability of the training / test data split
<- FFTrees(formula = poisonous ~.,
mushrooms.fft data = mushrooms,
train.p = .5, # Split data into 50\50 training \ test
main = "Mushrooms",
decision.labels = c("Safe", "Poison"))
Here’s basic information about the best performing FFT:
# Print information about the best performing tree
mushrooms.fft
## Mushrooms
## FFTrees
## - Trees: 6 fast-and-frugal trees predicting poisonous
## - Outcome costs: [hi = 0, mi = 1, fa = 1, cr = 0]
##
## FFT #1: Definition
## [1] If odor = {f,s,y,p,c,m}, decide Poison.
## [2] If sporepc != {h,w,r}, decide Safe, otherwise, decide Poison.
##
## FFT #1: Prediction Accuracy
## Prediction Data: N = 4,062, Pos (+) = 1,958 (48%)
##
## | | True + | True - |
## |---------|----------|----------|
## |Decide + | hi 1,958 | fa 312 | 2,270
## |Decide - | mi 0 | cr 1,792 | 1,792
## |---------|----------|----------|
## 1,958 2,104 N = 4,062
##
## acc = 92.3% ppv = 86.3% npv = 100.0%
## bacc = 92.6% sens = 100.0% spec = 85.2%
## E(cost) = 0.077
##
## FFT #1: Prediction Speed and Frugality
## mcu = 1.53, pci = 0.93
Let’s look at the individual cue training accuracies with plot()
:
# Show mushrooms cue accuracies
plot(mushrooms.fft,
what = "cues")
It looks like the cues oder
and sporepc
are the best predictors. in fact, the single cue odor has a hit rate of 97% and a false alarm rate of 0%! Based on this, we should expect the final trees to use just these cues.
Now let’s plot the best training tree applied to the test dataset
# Plot the best FFT for the mushrooms data
plot(mushrooms.fft,
data = "test")
Indeed, it looks like the best tree only uses the odor and sporepc cues. In our test dataset, the tree had a false alarm rate of 0% (1 - specificity), and a hit rate of 85%.
Now, let’s say that you talk to a mushroom expert who says that we are using the wrong cues. According to her, the best predictors for poisonous mushrooms are ringtype and ringnum. Let’s build a set of trees with these cues and see how they perform relative to our initial tree:
# Create trees using only ringtype and ringnum
<- FFTrees(formula = poisonous ~ ringtype + ringnum,
mushrooms.ring.fft data = mushrooms,
train.p = .5,
main = "Mushrooms (Ring Only)",
decision.labels = c("Safe", "Poison"))
Here is the best training tree:
plot(mushrooms.ring.fft,
data = "test")
As we can see, this tree did not perform nearly as well as our earlier one.
The iris.v
dataset contains data about 150 flowers (see ?iris.v
). Our goal is to predict which flowers are of the class Virginica. In this example, we’ll create trees using the entire dataset (without an explicit test dataset)
<- FFTrees(formula = virginica ~.,
iris.fft data = iris.v,
main = "Iris",
decision.labels = c("Not-V", "V"))
First, let’s look at the individual cue training accuracies:
plot(iris.fft,
what = "cues")
It looks like the cues pet.wid and pet.len are the best predictors. Based on this, we should expect the final trees will likely use just one or both of these cues
Now let’s plot the best tree
plot(iris.fft)
Indeed, it looks like the best tree only uses the pet.wid and pet.len cues. In our test dataset, the tree had a sensitivity of 100% and specificity of 95%.
Now, this tree did quite well, but what if someone wants a tree with the lowest possible false alarm rate. If we look at the ROC plot in the bottom left corner of the plot above, we can see that tree #2 has a specificity close to 100%. Let’s look at that tree:
plot(iris.fft,
tree = 2) # Show tree #2
As you can see, this tree does indeed have a higher specificity. However, it comes at a cost of a lower sensitivity