Automate your work

Krystian Igras

2019-09-24

This document presents how you can customize xspliner usage to you make it more easier and automated.

Hierarchy of parameters

Local transition and effect

In previous sections we learned how to specify parameters for the effect and transition of single variable. Let’s name them local parameters.

A quick example you can see below:

library(xspliner)
library(randomForest)

rf_iris <- randomForest(Petal.Width ~  Sepal.Length + Petal.Length + Species, data = iris)
model_xs <- xspline(Petal.Width ~ 
  Sepal.Length + 
  xs(Petal.Length, effect = list(grid.resolution = 100), transition = list(bs = "cr")) + 
  xf(Species, transition = list(stat = "loglikelihood", value = 4)),
  model = rf_iris)
summary(model_xs)
## 
## Call:
## stats::glm(formula = Petal.Width ~ Sepal.Length + xs(Petal.Length) + 
##     xf(Species), family = family, data = data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.66137  -0.07612  -0.02611   0.10377   0.45279  
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -1.26770    0.20907  -6.064 1.10e-08 ***
## Sepal.Length           0.03636    0.03515   1.034   0.3027    
## xs(Petal.Length)       1.76230    0.33984   5.186 7.11e-07 ***
## xf(Species)versicolor  0.13297    0.16269   0.817   0.4151    
## xf(Species)virginica   0.44571    0.22056   2.021   0.0451 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.03105807)
## 
##     Null deviance: 86.5699  on 149  degrees of freedom
## Residual deviance:  4.5034  on 145  degrees of freedom
## AIC: -88.188
## 
## Number of Fisher Scoring iterations: 2

When the black box model is based on higher amount of variables it can be problematic to specify local parameters for each predictor. Also formula becomes large and hard to read.

Global transition and effect

Its more common that we use similar configuration for each variable, as it’s simple and allows us to build and experiment faster. To do this you can specify xs_opts and xf_opts of xspline function. Let’s name them global parameters.

Each of global parameters can specify effect and transition, that should be used for xs and xf transformations respectively. Then you just need to use base xs and xf symbols without parameters:

model_xs <- xspline(Petal.Width ~ 
  Sepal.Length + 
  xs(Petal.Length) + 
  xf(Species),
  model = rf_iris, 
  xs_opts = list(effect = list(grid.resolution = 100), transition = list(bs = "cr")),
  xf_opts = list(transition = list(stat = "loglikelihood", value = 4))
)
summary(model_xs)
## 
## Call:
## stats::glm(formula = Petal.Width ~ Sepal.Length + xs(Petal.Length) + 
##     xf(Species), family = family, data = data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.66137  -0.07612  -0.02611   0.10377   0.45279  
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -1.26770    0.20907  -6.064 1.10e-08 ***
## Sepal.Length           0.03636    0.03515   1.034   0.3027    
## xs(Petal.Length)       1.76230    0.33984   5.186 7.11e-07 ***
## xf(Species)versicolor  0.13297    0.16269   0.817   0.4151    
## xf(Species)virginica   0.44571    0.22056   2.021   0.0451 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.03105807)
## 
##     Null deviance: 86.5699  on 149  degrees of freedom
## Residual deviance:  4.5034  on 145  degrees of freedom
## AIC: -88.188
## 
## Number of Fisher Scoring iterations: 2

But still you can specify local parameters that override the global ones.

model_xs <- xspline(Petal.Width ~ 
  xs(Sepal.Length, transition = list(k = 10)) + 
  xs(Petal.Length) + 
  xf(Species),
  model = rf_iris, 
  xs_opts = list(effect = list(grid.resolution = 100), transition = list(bs = "cr")),
  xf_opts = list(transition = list(stat = "loglikelihood", value = 4))
)
summary(model_xs)
## 
## Call:
## stats::glm(formula = Petal.Width ~ xs(Sepal.Length) + xs(Petal.Length) + 
##     xf(Species), family = family, data = data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.66224  -0.07961  -0.02923   0.09751   0.45797  
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -1.4789     0.2850  -5.189 7.02e-07 ***
## xs(Sepal.Length)        0.4168     0.3465   1.203   0.2310    
## xs(Petal.Length)        1.6848     0.3622   4.651 7.37e-06 ***
## xf(Species)versicolor   0.1562     0.1668   0.936   0.3507    
## xf(Species)virginica    0.4808     0.2276   2.112   0.0364 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.03097806)
## 
##     Null deviance: 86.5699  on 149  degrees of freedom
## Residual deviance:  4.4918  on 145  degrees of freedom
## AIC: -88.575
## 
## Number of Fisher Scoring iterations: 2

In this case last_evaluation variable will be transformed with thin plate regression spline (bs = "tp" is default for mgcv::s) with basis dimension equal to 10. At the same time average_monthly_hours will be transformed with cubic splines.

What if you forget specifying local and global parameters?

Default transition and effect

As you can see in project objects reference, you may find there xs_opts_edfault and xf_opts_default objects. These objects specify default parameters for xs and xf transformations.

xs_opts_default
## $effect
## $effect$type
## [1] "pdp"
## 
## 
## $transition
## $transition$alter
## [1] "always"
## 
## $transition$monotonic
## [1] "not"
xf_opts_default
## $effect
## $effect$type
## [1] "ice"
## 
## 
## $transition
## $transition$alter
## [1] "always"
## 
## $transition$stat
## [1] "GIC"
## 
## $transition$value
## [1] 3

Default parameters are overwritten by global and local ones. So the hierarchy of parameters importance is as follows:

LOCAL > GLOBAL > DEFAULT

But having model based on the huge amount of variables still requires a lot of our work to build the final model. Especially we need to use xs and xf symbols for each variable. How can we make it easier?

Automatic transformation

Transform each formula predictor

If you want to transform each predictor of specified formula and not using xs and xf variables you can omit it using consider parameter for xspline function.

Possible values are "specials" the default one and "all". For automatic transformation of each variable without specifying xs and xf symbols just set consider = "all" and pass standard formula into xspline function:

model_xs <- xspline(Petal.Width ~ Sepal.Length  + Petal.Length + Species,
  model = rf_iris,
  consider = "all"
)
summary(model_xs)
## 
## Call:
## stats::glm(formula = Petal.Width ~ xs(Sepal.Length) + xs(Petal.Length) + 
##     xf(Species), family = family, data = data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.66641  -0.07500  -0.03058   0.09789   0.46614  
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -1.4846     0.2875  -5.163 7.87e-07 ***
## xs(Sepal.Length)        0.4397     0.3483   1.262    0.209    
## xs(Petal.Length)        1.6579     0.3583   4.627 8.15e-06 ***
## xf(Species)versicolor   0.1689     0.1644   1.027    0.306    
## xf(Species)virginica    0.4984     0.2246   2.219    0.028 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.03099086)
## 
##     Null deviance: 86.5699  on 149  degrees of freedom
## Residual deviance:  4.4937  on 145  degrees of freedom
## AIC: -88.513
## 
## Number of Fisher Scoring iterations: 2

Then each predictor is transformed with xs and xf symbols and use of default parameters or global ones when specified.

xs is used for integer and numeric variables - xf for factors.

By default xspline function tries to extract the data from model (rf_model) call and xspline’s parent.frame() environment then uses it to determine predictor types. So to be sure that variable types are sourced correctly a good practice here is to add data parameter, storing black box’s training data.

model_xs <- xspline(Petal.Width ~ Sepal.Length  + Petal.Length + Species,
  model = rf_iris,
  data = iris,
  consider = "all"
)
summary(model_xs)
## 
## Call:
## stats::glm(formula = Petal.Width ~ xs(Sepal.Length) + xs(Petal.Length) + 
##     xf(Species), family = family, data = data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.66641  -0.07500  -0.03058   0.09789   0.46614  
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -1.4846     0.2875  -5.163 7.87e-07 ***
## xs(Sepal.Length)        0.4397     0.3483   1.262    0.209    
## xs(Petal.Length)        1.6579     0.3583   4.627 8.15e-06 ***
## xf(Species)versicolor   0.1689     0.1644   1.027    0.306    
## xf(Species)virginica    0.4984     0.2246   2.219    0.028 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.03099086)
## 
##     Null deviance: 86.5699  on 149  degrees of freedom
## Residual deviance:  4.4937  on 145  degrees of freedom
## AIC: -88.513
## 
## Number of Fisher Scoring iterations: 2

Transform only continuous or discrete variables

In some cases you may want to transform only quantitative or qualitative predictors. Looking into default parameters xs_opts_default and xf_opts_default we may find alter parameter for transition.

The parameter is used to specify if predictor for which xs or xf was specified needs to be transformed or used as a bare variable in formula. You can specify it in the local or global transition parameter. In this case using the global one is more reasonable.

So, in order to transform only continuous variables just set alter = "always" (what is default) for xs_opts and alter = "never" for xf_opts:

model_xs <- xspline(Petal.Width ~ Sepal.Length + Petal.Length + Species,
  model = rf_iris,
  data = iris,
  consider = "all",
  xf_opts = list(transition = list(alter = "never"))
)
summary(model_xs)
## 
## Call:
## stats::glm(formula = Petal.Width ~ xs(Sepal.Length) + xs(Petal.Length) + 
##     Species, family = family, data = data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.66641  -0.07500  -0.03058   0.09789   0.46614  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.4846     0.2875  -5.163 7.87e-07 ***
## xs(Sepal.Length)    0.4397     0.3483   1.262    0.209    
## xs(Petal.Length)    1.6579     0.3583   4.627 8.15e-06 ***
## Speciesversicolor   0.1689     0.1644   1.027    0.306    
## Speciesvirginica    0.4984     0.2246   2.219    0.028 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.03099086)
## 
##     Null deviance: 86.5699  on 149  degrees of freedom
## Residual deviance:  4.4937  on 145  degrees of freedom
## AIC: -88.513
## 
## Number of Fisher Scoring iterations: 2

For transformation of factors only:

model_xs <- xspline(Petal.Width ~ Sepal.Length + Petal.Length + Species,
  model = rf_iris,
  data = iris,
  consider = "all",
  xs_opts = list(transition = list(alter = "never"))
)
summary(model_xs)
## 
## Call:
## stats::glm(formula = Petal.Width ~ Sepal.Length + Petal.Length + 
##     xf(Species), family = family, data = data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.63796  -0.07848  -0.01257   0.09863   0.47749  
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -0.084592   0.172136  -0.491  0.62387    
## Sepal.Length          -0.001693   0.044135  -0.038  0.96945    
## Petal.Length           0.231921   0.052796   4.393 2.15e-05 ***
## xf(Species)versicolor  0.432659   0.125048   3.460  0.00071 ***
## xf(Species)virginica   0.834121   0.173212   4.816 3.65e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.03249376)
## 
##     Null deviance: 86.5699  on 149  degrees of freedom
## Residual deviance:  4.7116  on 145  degrees of freedom
## AIC: -81.41
## 
## Number of Fisher Scoring iterations: 2

Even having learned the above listed options, we still can have problems with handling model with a huge amount of variables.

Model based dot formula

For many existing models in R we usually can specify “dot formulas”, when used predictors are sourced from provided data. xspliner can also handle the form. Let’s return here for iris random forest model.

model_xs <- xspline(Petal.Width ~ ., model = rf_iris)
summary(model_xs)
## 
## Call:
## stats::glm(formula = Petal.Width ~ xs(Sepal.Length) + xs(Petal.Length) + 
##     xf(Species), family = family, data = data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.66641  -0.07500  -0.03058   0.09789   0.46614  
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -1.4846     0.2875  -5.163 7.87e-07 ***
## xs(Sepal.Length)        0.4397     0.3483   1.262    0.209    
## xs(Petal.Length)        1.6579     0.3583   4.627 8.15e-06 ***
## xf(Species)versicolor   0.1689     0.1644   1.027    0.306    
## xf(Species)virginica    0.4984     0.2246   2.219    0.028 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.03099086)
## 
##     Null deviance: 86.5699  on 149  degrees of freedom
## Residual deviance:  4.4937  on 145  degrees of freedom
## AIC: -88.513
## 
## Number of Fisher Scoring iterations: 2

Good practice here is to provide data parameter as well to detect predictors classes, and model type (classification or regression).

How predictors are sourced?

xspline function tries to establish data in the following order (if any way is not possible, it tries the next one):

To assure correct predictors usage, you may also specify predictor names vector in predictors parameter, and data (optional) to assure source of variable classes:

model_xs <- xspline(Petal.Width ~ ., 
                    model = rf_iris,
                    predictors = colnames(iris)[-c(2, 4)],
                    data = iris)
summary(model_xs)
## 
## Call:
## stats::glm(formula = Petal.Width ~ xs(Sepal.Length) + xs(Petal.Length) + 
##     xf(Species), family = family, data = data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.66641  -0.07500  -0.03058   0.09789   0.46614  
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -1.4846     0.2875  -5.163 7.87e-07 ***
## xs(Sepal.Length)        0.4397     0.3483   1.262    0.209    
## xs(Petal.Length)        1.6579     0.3583   4.627 8.15e-06 ***
## xf(Species)versicolor   0.1689     0.1644   1.027    0.306    
## xf(Species)virginica    0.4984     0.2246   2.219    0.028 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.03099086)
## 
##     Null deviance: 86.5699  on 149  degrees of freedom
## Residual deviance:  4.4937  on 145  degrees of freedom
## AIC: -88.513
## 
## Number of Fisher Scoring iterations: 2

In above examples each predictor is transformed by default. You can exclude needed, by specifying global alter = "never" parameters, or bare.

This way we are ready to work with many predictors. Can the approach be simpler?

Omit formula

As we could see in previous section, using dot formula the predictors are sourced from provided black box. Why cannot we fully extract formula from black box? We can.

Let’s use previously built rf_iris model:

model_xs <- xspline(rf_iris)
summary(model_xs)
## 
## Call:
## stats::glm(formula = Petal.Width ~ xs(Sepal.Length) + xs(Petal.Length) + 
##     xf(Species), family = family, data = data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.66641  -0.07500  -0.03058   0.09789   0.46614  
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -1.4846     0.2875  -5.163 7.87e-07 ***
## xs(Sepal.Length)        0.4397     0.3483   1.262    0.209    
## xs(Petal.Length)        1.6579     0.3583   4.627 8.15e-06 ***
## xf(Species)versicolor   0.1689     0.1644   1.027    0.306    
## xf(Species)virginica    0.4984     0.2246   2.219    0.028 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.03099086)
## 
##     Null deviance: 86.5699  on 149  degrees of freedom
## Residual deviance:  4.4937  on 145  degrees of freedom
## AIC: -88.513
## 
## Number of Fisher Scoring iterations: 2

Works! Can it be simpler? Actually not because of black box based transformation and theory, but we can provide some model based parameters upfront using DALEX’s explainer object (see next section).

Unfortunately this simplicity works thanks to a few approaches that randomForest follows.

They are:

Most of R based ML models satisfy the above condition. One of exceptions is XGBoost which in actual xspliner version needs to be handled in the more complex xspline call. You can see it in use cases

Excluding predictors from transformation

For this example consider again Boston Housing Data from pdp package, and build simple svm model for predicting chas variable:

library(pdp)
library(e1071)
data(boston)
svm_boston <- svm(chas ~ cmedv + rad + lstat, data = boston, probability = TRUE)
str(boston[, "rad"])
##  int [1:506] 1 2 2 3 3 3 5 5 5 5 ...
unique(boston$rad)
## [1]  1  2  3  5  4  8  6  7 24

As we can see rad variable is integer and has only 9 unique values. As a result spline approximation may be misleading, and not possible to perform. We decide here to omit rad variable when performing transformation, nevertheless remaining predictors should be transformed.

At first setup model based transformation:

xs_model <- xspline(svm_boston)
## Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots): A term has fewer unique covariate combinations than specified maximum degrees of freedom

As we can see, the error was returned due to the form of rad variable.

How to exclude rad from transformation? We can use xspline’s bare parameter, responsible for specifying predictor which shouldn’t be transformed.

xs_model <- xspline(
  svm_boston,
  bare = "rad")
summary(xs_model)
## 
## Call:
## stats::glm(formula = chas ~ xs(cmedv) + rad + xs(lstat), family = family, 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8194  -0.3415  -0.3391  -0.3367   2.4276  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -8.734e+01  9.695e+01  -0.901  0.36765   
## xs(cmedv)   -7.638e+01  2.954e+01  -2.586  0.00972 **
## rad         -8.446e-04  2.149e-02  -0.039  0.96864   
## xs(lstat)    1.126e+01  9.231e+01   0.122  0.90294   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 254.5  on 505  degrees of freedom
## Residual deviance: 244.4  on 502  degrees of freedom
## AIC: 252.4
## 
## Number of Fisher Scoring iterations: 5

As we can see in the summary, rad variable was omitted during transformation.

Note

The option is available only for xspline build on top of model or dot formula.

Integration with DALEX

As mentioned in the previous section, xspliner is integrated with DALEX package. The main function from the package explain returns useful black box data (such as bare black model or training data) that can be used by xspline function.

Just check below example

library(DALEX)
rf_boston <- randomForest(lstat ~ cmedv + crim + chas, data = boston)
explainer <- explain(rf_boston, label = "boston")
## Preparation of a new explainer is initiated
##   -> model label       :  boston 
##   -> data              :  506  rows  4  cols (extracted from the model)
##   -> target variable   :  not specified! (WARNING)
##   -> predict function  :  yhat.randomForest  will be used (default)
##   -> predicted values  :  numerical, min =  6.107818 , mean =  12.63189 , max =  24.64083  
##   -> residual function :  difference between y and yhat (default)
## A new explainer has been created!
model <- xspline(
  explainer
)
summary(model)
## 
## Call:
## stats::glm(formula = lstat ~ xs(cmedv) + xs(crim) + xf(chas), 
##     family = family, data = data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -11.3957   -2.3779   -0.6903    1.6758   21.1446  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -16.64606    1.55497 -10.705  < 2e-16 ***
## xs(cmedv)     1.71836    0.07332  23.437  < 2e-16 ***
## xs(crim)      0.59480    0.15480   3.842 0.000137 ***
## xf(chas)1     1.42345    0.69541   2.047 0.041186 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 15.45014)
## 
##     Null deviance: 25752  on 505  degrees of freedom
## Residual deviance:  7756  on 502  degrees of freedom
## AIC: 2827.2
## 
## Number of Fisher Scoring iterations: 2

You can provide your own xspline’s parameters that overwrite that sourced from explainer.