clere: Simultaneous Variables Clustering and Regression

GitHub tag Travis build status Codecov test coverage CRAN_Status_Badge cran checks_worst CRAN_Download_total

Implements an empirical Bayes approach for simultaneous variable clustering and regression. This version also (re)implements in C++ an R script proposed by Howard Bondell that fits the Pairwise Absolute Clustering and Sparsity (PACS) methodology (see Sharma et al (2013) doi: 10.1080/15533174.2012.707849).

Installation

You can install the released version of clere from CRAN with:

install.packages("clere")

And the development version from GitHub with:

# install.packages("remotes")
remotes::install_github("mcanouil/clere")

Citing clere

Yengo L, Jacques J, Biernacki C, Canouil M (2016). “Variable Clustering in High-Dimensional Linear Regression: The R Package clere.” The R Journal, 8(1), 92–106. doi: 10.32614/RJ-2016-006.

@Article{,
  title = {{Variable Clustering in High-Dimensional Linear Regression: The R Package clere}},
  author = {Loïc Yengo and Julien Jacques and Christophe Biernacki and Mickael Canouil},
  journal = {The R Journal},
  year = {2016},
  month = {apr},
  doi = {10.32614/RJ-2016-006},
  pages = {92--106},
  volume = {8},
  number = {1},
}

Example

library(clere)

x <- matrix(rnorm(50 * 100), nrow = 50, ncol = 100)
y <- rnorm(50)

model <- fitClere(y = y, x = x, g = 2, plotit = FALSE)
model
#>  ~~~ Class: Clere ~~~
#>  ~ y : [50] -0.3663  1.0417  0.8401  0.6298  1.3977 -0.4709
#>  ~ x : [50x100]
#>                  1        2        3        4        5        .    
#>         1      0.54299  0.54408  1.73588 -0.05461 -0.94133 ........
#>         2      1.11327 -1.00079 -0.71194 -2.17234  0.38946 ........
#>         3     -0.97223  0.03499 -1.20295 -1.32578 -1.12280 ........
#>         4      0.71881 -0.92304  0.22933  1.22511  0.35874 ........
#>         5     -0.53657  0.01233 -0.72067 -0.10695 -1.71511 ........
#>         .     ........ ........ ........ ........ ........ ........
#> 
#>  ~ n : 50
#>  ~ p : 100
#>  ~ g : 2
#>  ~ nItMC : 50
#>  ~ nItEM : 1000
#>  ~ nBurn : 200
#>  ~ dp : 5
#>  ~ nsamp : 200
#>  ~ sparse : FALSE
#>  ~ analysis : "fit"
#>  ~ algorithm : "SEM"
#>  ~ initialized : FALSE
#>  ~ maxit : 500
#>  ~ tol : 0.001
#>  ~ seed : 945
#>  ~ b : [2]  0.613709 -0.006548
#>  ~ pi : [2] 0.01002 0.98998
#>  ~ sigma2 : 0.5981
#>  ~ gamma2 : 0.0001097
#>  ~ intercept : 0.1022
#>  ~ likelihood : -64.18
#>  ~ entropy : 0
#>  ~ P : [100x2]
#>              Group 1 Group 2
#>         1       0       1   
#>         2       0       1   
#>         3       0       1   
#>         4       0       1   
#>         5       0       1   
#>         .    ....... .......
#> 
#>  ~ theta : [1000x8]
#>                intercept    b1        b2        pi1       pi2        .    
#>          1     -0.03965  -0.02462   0.05342   0.50000   0.50000  .........
#>          2      0.08769  -0.02585   0.05114   0.53000   0.47000  .........
#>          3      0.03731  -0.03260   0.05189   0.46000   0.54000  .........
#>          4     -0.03508  -0.05160   0.05514   0.45000   0.55000  .........
#>          5     -0.08861  -0.06464   0.05811   0.42000   0.58000  .........
#>          .     ......... ......... ......... ......... ......... .........
#> 
#>  ~ Zw : [100x200]
#>        1 2 3 4 5 .
#>      1 1 1 1 1 1 .
#>      2 1 1 1 1 1 .
#>      3 1 1 1 1 1 .
#>      4 1 1 1 1 1 .
#>      5 1 1 1 1 1 .
#>      . . . . . . .
#> 
#>  ~ Bw : [100x200]
#>                     1          2          3          4          5          .     
#>          1      -7.080e-03 -5.029e-03 -1.654e-02 -9.838e-03 -1.157e-02 ..........
#>          2       1.664e-03 -3.672e-05 -1.369e-02  5.982e-03 -1.140e-02 ..........
#>          3      -6.606e-03 -1.330e-02  3.800e-03 -1.147e-02 -1.297e-02 ..........
#>          4      -8.453e-03  8.423e-03 -1.493e-03  5.931e-03  1.637e-02 ..........
#>          5      -2.101e-02 -5.158e-03 -7.439e-03 -8.822e-03 -1.320e-02 ..........
#>          .      .......... .......... .......... .......... .......... ..........
#> 
#>  ~ Z0 : NA
#>  ~ message : NA

plot(model)


clus <- clusters(model, threshold = NULL)
clus
#>   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#>  [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#>  [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2

predict(model, newx = x + 1)
#>  [1] -0.41014043  0.58417440  0.07565606  0.12651690  0.10691068 -0.51438081
#>  [7] -0.53730664 -0.38719372  0.96052339  1.04059573 -0.47278984 -0.54533104
#> [13] -0.14680892  0.04529841 -0.44803772  0.48753442 -1.03283097 -0.96206984
#> [19]  0.90252948  0.35887126 -0.59158591  0.27172199  0.73862087 -0.13525905
#> [25]  1.14287637  0.37955118 -0.21296002 -0.66091713  0.22797485  0.04944170
#> [31]  0.52612573 -0.15168824 -0.78401104 -0.53532663 -0.44697030  0.19048671
#> [37]  0.10341728 -0.37691391 -0.69165509  0.52461656  0.60826835  0.01190567
#> [43] -0.50238925  0.22288924 -0.28840397 -0.43573542 -0.26704384  0.49779102
#> [49]  0.08028461  0.41752563

summary(model)
#>  -------------------------------
#>  | CLERE | Yengo et al. (2016) |
#>  -------------------------------
#> 
#>  Model object for  2 groups of variables ( user-specified )
#> 
#>  ---
#>  Estimated parameters using SEM algorithm are
#>  intercept = 0.1022
#>  b         =  0.613709   -0.006548
#>  pi        = 0.01002 0.98998
#>  sigma2    = 0.5981
#>  gamma2    = 0.0001097
#> 
#>  ---
#>  Log-likelihood =  -64.18 
#>  Entropy        =  0 
#>  AIC            =  140.36 
#>  BIC            =  151.84 
#>  ICL            =  151.84

Getting help

If you encounter a clear bug, please file a minimal reproducible example on github.
For questions and other discussion, please contact the package maintainer.


Please note that this project is released with a Contributor Code of Conduct.
By participating in this project you agree to abide by its terms.