BranchGLM is a package for fitting GLMs and performing variable selection. Most functions in this package make use of RcppArmadillo and some of them can also make use of OpenMP to perform parallel computations. This vignette introduces the package, provides examples on how to use the main functions in the package and also briefly describes the methods employed by the functions.
BranchGLM can be installed using the
install.packages()
function
install.packages("BranchGLM")
It can also be installed via the install_github()
function from the devtools package.
::install_github("JacobSeedorff21/BranchGLM") devtools
BranchGLM()
allows fitting of gaussian, binomial,
gamma, and poisson GLMs with a variety of links available.grads
argument is for L-BFGS only and it is the
number of gradients that are stored at a time and are used to
approximate the inverse information. The default value for this is 10,
but another common choice is 5.tol
argument controls how strict the convergence
criteria are, lower values of this will lead to more accurate results,
but may also be slower.method
argument is ignored for linear regression
and the OLS solution is used.offset
, it should be a
numeric vector.### Using mtcars
library(BranchGLM)
<- mtcars
cars
### Fitting linear regression model with Fisher scoring
<- BranchGLM(mpg ~ ., data = cars, family = "gaussian", link = "identity")
LinearFit
LinearFit#> Results from gaussian regression with identity link function
#> Using the formula mpg ~ .
#>
#> Estimate SE t p.values
#> (Intercept) 12.30000 18.72000 0.6573 0.51810
#> cyl -0.11140 1.04500 -0.1066 0.91610
#> disp 0.01334 0.01786 0.7468 0.46350
#> hp -0.02148 0.02177 -0.9868 0.33500
#> drat 0.78710 1.63500 0.4813 0.63530
#> wt -3.71500 1.89400 -1.9610 0.06325 .
#> qsec 0.82100 0.73080 1.1230 0.27390
#> vs 0.31780 2.10500 0.1510 0.88140
#> am 2.52000 2.05700 1.2250 0.23400
#> gear 0.65540 1.49300 0.4389 0.66520
#> carb -0.19940 0.82880 -0.2406 0.81220
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Dispersion parameter taken to be 7.0235
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#>
#> Residual Deviance: 147 on 21 degrees of freedom
#> AIC: 166
### Fitting gamma regression with inverse link with L-BFGS
<- BranchGLM(mpg ~ ., data = cars, family = "gamma", link = "inverse",
GammaFit method = "LBFGS", grads = 5)
GammaFit#> Results from gamma regression with inverse link function
#> Using the formula mpg ~ .
#>
#> Estimate SE t p.values
#> (Intercept) -6.794e-02 3.085e-02 -2.2020 0.03898 *
#> cyl 1.760e-03 1.946e-03 0.9047 0.37590
#> disp -7.947e-06 3.515e-05 -0.2261 0.82330
#> hp -6.724e-05 4.050e-05 -1.6600 0.11170
#> drat -4.283e-04 2.728e-03 -0.1570 0.87670
#> wt -9.224e-03 3.360e-03 -2.7450 0.01214 *
#> qsec 1.739e-03 1.165e-03 1.4930 0.15040
#> vs -3.087e-04 3.322e-03 -0.0929 0.92690
#> am 6.304e-04 3.441e-03 0.1832 0.85640
#> gear 4.882e-03 2.577e-03 1.8940 0.07204 .
#> carb -1.025e-03 1.481e-03 -0.6926 0.49610
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Dispersion parameter taken to be 0.0087
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#>
#> Residual Deviance: 0 on 21 degrees of freedom
#> AIC: 152
#> Algorithm converged in 3 iterations using L-BFGS
coef()
to extract the coefficientslogLik()
to extract the log likelihoodAIC()
to extract the AICBIC()
to extract the BICpredict()
to obtain predictions from the fitted
modelcoefficients
slot of the fitted
modelBranchGLM
object.### Predict method
predict(GammaFit)
#> [1] 21.18646 20.58540 25.08262 19.73260 16.84361 19.66750 14.40825 22.08188
#> [9] 23.72831 18.71060 19.08321 15.34256 16.20899 16.27084 12.61678 12.23244
#> [17] 12.09701 28.31601 31.25370 32.11988 22.42958 17.18607 17.63342 13.73885
#> [25] 15.78752 29.52892 26.61243 30.53420 16.86624 19.39720 14.08588 21.53099
### Accessing coefficients matrix
$coefficients
GammaFit#> Estimate SE t p.values
#> (Intercept) -6.794026e-02 3.085396e-02 -2.20199519 0.03897971
#> cyl 1.760179e-03 1.945517e-03 0.90473592 0.37586830
#> disp -7.947123e-06 3.515071e-05 -0.22608709 0.82331968
#> hp -6.724158e-05 4.049734e-05 -1.66039498 0.11169449
#> drat -4.282730e-04 2.727545e-03 -0.15701775 0.87673070
#> wt -9.224050e-03 3.360483e-03 -2.74485832 0.01213720
#> qsec 1.739266e-03 1.165327e-03 1.49251309 0.15043663
#> vs -3.086809e-04 3.322383e-03 -0.09290948 0.92685616
#> am 6.304156e-04 3.441185e-03 0.18319724 0.85640047
#> gear 4.881920e-03 2.577117e-03 1.89433406 0.07203615
#> carb -1.025451e-03 1.480570e-03 -0.69260584 0.49614538
VariableSelection()
.VariableSelection()
can accept either a
BranchGLM
object or a formula along with the data and the
desired family and link to perform the variable selection.VariableSelection()
returns the final model and some
other information about the search.VariableSelection()
will not properly handle
interaction terms, i.e. it may keep an interaction term while removing
the lower-order terms.keep
can also be specified if any set of variables are
desired to be kept in every model.### Forward selection with mtcars
VariableSelection(GammaFit, type = "forward")
#> Variable Selection Info:
#> --------------------------------------------
#> Variables were selected using forward selection with AIC
#> The best value of AIC obtained was 142
#> Number of models fit: 27
#>
#> Order the variables were added to the model:
#>
#> 1). wt
#> 2). hp
#> --------------------------------------------
#> Final Model:
#> --------------------------------------------
#> Results from gamma regression with inverse link function
#> Using the formula mpg ~ hp + wt
#>
#> Estimate SE t p.values
#> (Intercept) -8.923e-03 2.806e-03 -3.180 0.0034910 **
#> hp -8.887e-05 2.110e-05 -4.212 0.0002245 ***
#> wt -9.826e-03 1.384e-03 -7.100 8.21e-08 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Dispersion parameter taken to be 0.0104
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#>
#> Residual Deviance: 0 on 29 degrees of freedom
#> AIC: 142
#> Algorithm converged in 3 iterations using Fisher's scoring
### Backward elimination with mtcars
VariableSelection(GammaFit, type = "backward")
#> Variable Selection Info:
#> --------------------------------------------
#> Variables were selected using backward elimination with AIC
#> The best value of AIC obtained was 142
#> Number of models fit: 49
#>
#> Order the variables were removed from the model:
#>
#> 1). vs
#> 2). drat
#> 3). am
#> 4). disp
#> 5). carb
#> 6). cyl
#> --------------------------------------------
#> Final Model:
#> --------------------------------------------
#> Results from gamma regression with inverse link function
#> Using the formula mpg ~ hp + wt + qsec + gear
#>
#> Estimate SE t p.values
#> (Intercept) -4.691e-02 1.782e-02 -2.633 0.01384 *
#> hp -6.284e-05 3.015e-05 -2.084 0.04675 *
#> wt -9.485e-03 1.819e-03 -5.213 1.719e-05 ***
#> qsec 1.299e-03 7.471e-04 1.739 0.09348 .
#> gear 2.662e-03 1.652e-03 1.612 0.11870
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Dispersion parameter taken to be 0.0091
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#>
#> Residual Deviance: 0 on 27 degrees of freedom
#> AIC: 142
#> Algorithm converged in 3 iterations using Fisher's scoring
showprogress
is true, then progress of the branch
and bound algorithm will be reported occasionally.### Branch and bound with mtcars
VariableSelection(GammaFit, type = "branch and bound", showprogress = FALSE)
#> Variable Selection Info:
#> --------------------------------------------
#> Variables were selected using branch and bound selection with AIC
#> The best value of AIC obtained was 142
#> Number of models fit: 74
#>
#>
#> --------------------------------------------
#> Final Model:
#> --------------------------------------------
#> Results from gamma regression with inverse link function
#> Using the formula mpg ~ hp + wt + qsec + gear
#>
#> Estimate SE t p.values
#> (Intercept) -4.691e-02 1.782e-02 -2.633 0.01384 *
#> hp -6.284e-05 3.015e-05 -2.084 0.04675 *
#> wt -9.485e-03 1.819e-03 -5.213 1.719e-05 ***
#> qsec 1.299e-03 7.471e-04 1.739 0.09348 .
#> gear 2.662e-03 1.652e-03 1.612 0.11870
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Dispersion parameter taken to be 0.0091
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#>
#> Residual Deviance: 0 on 27 degrees of freedom
#> AIC: 142
#> Algorithm converged in 3 iterations using Fisher's scoring
### Can also use a formula and data
<- VariableSelection(mpg ~ . ,data = cars, family = "gamma",
FormulaVS link = "inverse", type = "branch and bound",
showprogress = FALSE)
### Number of models fit divided by the number of possible models
$numchecked / 2^(length(FormulaVS$variables))
FormulaVS#> [1] 0.07226562
### Extracting final model
$finalmodel
FormulaVS#> Results from gamma regression with inverse link function
#> Using the formula mpg ~ hp + wt + qsec + gear
#>
#> Estimate SE t p.values
#> (Intercept) -4.691e-02 1.782e-02 -2.633 0.01384 *
#> hp -6.284e-05 3.015e-05 -2.084 0.04675 *
#> wt -9.485e-03 1.819e-03 -5.213 1.719e-05 ***
#> qsec 1.299e-03 7.471e-04 1.739 0.09348 .
#> gear 2.662e-03 1.652e-03 1.612 0.11870
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Dispersion parameter taken to be 0.0091
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#>
#> Residual Deviance: 0 on 27 degrees of freedom
#> AIC: 142
#> Algorithm converged in 3 iterations using Fisher's scoring
keep
will ensure that those
variables are kept through the selection process.### Example of using keep
VariableSelection(mpg ~ . ,data = cars, family = "gamma",
link = "inverse", type = "branch and bound",
keep = c("hp", "cyl"), metric = "AIC",
showprogress = FALSE)
#> Variable Selection Info:
#> --------------------------------------------
#> Variables were selected using branch and bound selection with AIC
#> The best value of AIC obtained was 143
#> Number of models fit: 38
#> Variables that were kept in each model: hp, cyl
#>
#> --------------------------------------------
#> Final Model:
#> --------------------------------------------
#> Results from gamma regression with inverse link function
#> Using the formula mpg ~ cyl + hp + wt + qsec + gear
#>
#> Estimate SE t p.values
#> (Intercept) -6.464e-02 2.700e-02 -2.3940 0.02418 *
#> cyl 1.412e-03 1.642e-03 0.8603 0.39750
#> hp -7.523e-05 3.321e-05 -2.2650 0.03205 *
#> wt -1.037e-02 2.082e-03 -4.9830 3.517e-05 ***
#> qsec 1.816e-03 9.462e-04 1.9190 0.06603 .
#> gear 3.861e-03 2.155e-03 1.7910 0.08490 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Dispersion parameter taken to be 0.0089
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#>
#> Residual Deviance: 0 on 26 degrees of freedom
#> AIC: 143
#> Algorithm converged in 3 iterations using Fisher's scoring
Table()
creates a confusion matrix based on the
predicted classes and observed classesROC()
creates an ROC curve which can be plotted with
plot()
AUC()
and Cindex()
calculate the area
under the ROC curveMultipleROCCurves()
allows for the plotting of multiple
ROC curves on the same plot### Predicting if a car gets at least 18 mpg
<- ToothGrowth
catData
<- BranchGLM(supp ~ ., data = catData, family = "binomial", link = "logit")
catFit
Table(catFit)
#> Confusion matrix:
#> ----------------------
#> Predicted
#> OJ VC
#>
#> OJ 17 13
#> Observed
#> VC 7 23
#>
#> ----------------------
#> Measures:
#> ----------------------
#> Accuracy: 0.6667
#> Sensitivity: 0.7667
#> Specificity: 0.5667
#> PPV: 0.6389
<- ROC(catFit)
catROC
plot(catROC, main = "ROC Curve", col = "indianred")
Cindex(catFit)
#> [1] 0.7127778
AUC(catFit)
#> [1] 0.7127778
### Showing ROC plots for logit, probit, and cloglog
<- BranchGLM(supp ~ . ,data = catData, family = "binomial",
probitFit link = "probit")
<- BranchGLM(supp ~ . ,data = catData, family = "binomial",
cloglogFit link = "cloglog")
MultipleROCCurves(catROC, ROC(probitFit), ROC(cloglogFit),
names = c("Logistic ROC", "Probit ROC", "Cloglog ROC"))
BranchGLM
object.
<- predict(catFit)
preds
Table(preds, catData$supp)
#> Confusion matrix:
#> ----------------------
#> Predicted
#> OJ VC
#>
#> OJ 17 13
#> Observed
#> VC 7 23
#>
#> ----------------------
#> Measures:
#> ----------------------
#> Accuracy: 0.6667
#> Sensitivity: 0.7667
#> Specificity: 0.5667
#> PPV: 0.6389
AUC(preds, catData$supp)
#> [1] 0.7127778
ROC(preds, catData$supp) |> plot(main = "ROC Curve", col = "deepskyblue")