“MatrixExtra” is an R package which extends the sparse matrix and sparse vector classes from the Matrix package, particularly the CSR formats, by providing optimized functions, methods, and operators which exploit the storage order (COO, CSC, CSR) of the inputs and work natively with different formats, such as slicing (e.g. X[1:10,]
), concatenation (e.g. rbind(X, Y)
, cbind(X, Y)
), matrix multiplication with a dense matrix (X %*% Y
), or elementwise multiplication (X * Y
).
A typical matrix is a 2D array which has a given number of rows and columns, and stores tabular data inside - e.g. measures of different variables (columns) for many observations (rows). Behind the scenes, matrices in R are represented as a 1D array with a column-major storage, where element (i,j)
is at position i + (j-1)*nrows
. This is a straightforward concept - performing operations on such matrices is trivial, and allows easy exploitation of modern CPU capabilities such as SIMD and multi-threading.
In many applications, one finds some matrices in which most of the values are exactly zero with only a few different entries - so called “sparse” matrices. For example, in recommender systems, one might construct a matrix of user-item interactions, in which users are rows, items are columns, and values denote e.g. movie ratings or hours spent playing something. Typically, each user interacts with only a handful items, so such a matrix will typically have >99% of the entries set to zero. In such cases, using a typical matrix is wasteful, since it requires creating an array which contains mostly zeros, and doing operations on them is inefficient, since the output of e.g. X * 2
only needs to look at the non-zero entries rather than the full nrows*ncols
entries. Similar situations are encountered in natural language processing (e.g. word counts by documents), social networks (e.g. connections between users), and classification/regression with one-hot/dummy encoded features, among others.
In such cases, it’s more efficient to use a matrix representation that stores only the non-zero values and the indices which are non-zero. In many cases it might even be impossible to represent the full matrix in a computer’s memory due it’s size (e.g. 1,000,000 users and 10,000 movies = 74.5GB, but if only 1% of the entries are non-zero, can be put down to ~1.5GB or less), and it’s thus necessary to perform operations in this sparse representation instead.
Object classes for sparse matrix representations in R are provided by packages like Matrix
or SparseM
(or igraph
for more specialized topics), and those objects - particularly the ones from Matrix
- are accepted and handled efficiently by many other packages such as rsparse or glmnet.
As a general rule, if a given matrix has <5% non-zero values, it is more efficient to do common operations on it in a sparse representation, which typically comes in one of the following formats:
The COO format is the simplest form, consisting of storing all the triplets (row,column,value)
which are non-zero.
The COO format is typically not optimal to operate with, but allows easy conversion to CSR and CSC formats (see below). Nevertheless, some operations such as concatenating inputs (rbind
, cbind
) or elementwise multiplication with a dense matrix (X * Y
) are efficient with a COO representation.
The CSR format, instead of of storing triplets, stores the elements in a row-major format, keeping track only of the column indices and of the positions at which the column indices for a given row start and end. Typically the column indices are meant to be sorted within each row, but this is not strictly assumed by all software or all functions.
The CSR format is optimal for doing row-based operations, such as selecting rows (X[1:1000,]
), concatenating by rows (rbind
), or matrix multiplication with a vector (CSR %*% v
).
The CSC format is the same as the CSR format, but is column-major instead of row-major.
The CSC format is optimal for doing column-based operations, such as selecting columns (X[, 1:1000]
), concatenating by columns (cbind
), and matrix multiplication with a dense matrix in column-major format (like all R’s matrices) as the LHS (Dense %*% CSC
). Typically, tree-based methods work with CSC format.
A vector (single row or single column) can also be represented in a sparse format by keeping track of the indices which are non-zero and the values.
Sparse vectors are typically not used but some operations involving them are fast, such as inner products or matrix multiplication with a CSR matrix as the LHS (CSR %*% v
).
The Matrix
package provides S4 classes to represent all the formats above in R. These objects are handled in a rich hierarchy of different matrix types with multiple inheritance. In general, one should keep in mind the following points:
TsparseMatrix
.RsparseMatrix
.CsparseMatrix
.RsparseMatrix
, but will rather have a class which inherits from it (has RsparseMatrix
as parent class), and be of a different type depending on the type of elements (dsparseMatrix
for numeric values, lsparseMatrix
for logical values, nsparseMatrix
for binary values), and depending on whether they are symmetric, triangular-diagonal, or regular.dgTMatrix
, dgRMatrix
, and dgCMatrix
; but oftentimes when dealing with Matrix
methods, one has to refer to the parent class - e.g. as(X, "RsparseMatrix")
, but not as(X, "dgRMatrix")
(which is what one usually wants to do).Sparse matrices can be created in any of the three formats in Matrix
with the function sparseMatrix
- example:
library(Matrix)
### Will construct this Matrix
### [ 1, 0, 2 ]
### [ 0, 0, 3 ]
### [ 0, 4, 0 ]
### Non-zero coordinates are:
### [(1,1), (1,3), (2,3), (3,2)]
### Row and column coordinates go separate
<- c(1, 1, 2, 3)
row_ix <- c(1, 3, 3, 2)
col_ix <- c(1, 2, 3, 4)
values <- Matrix::sparseMatrix(
X i=row_ix, j=col_ix, x=values,
index1=TRUE, repr="T"
)
X#> 3 x 3 sparse Matrix of class "dgTMatrix"
#>
#> [1,] 1 . 2
#> [2,] . . 3
#> [3,] . 4 .
They can typically be converted to other formats through methods::as
- example:
as(X, "RsparseMatrix")
#> 3 x 3 sparse Matrix of class "dgRMatrix"
#>
#> [1,] 1 . 2
#> [2,] . . 3
#> [3,] . 4 .
Such Matrix
objects have a lot of defined operators and functions so that they could be used as drop-in replacements of base R matrices - e.g.:
+ X
X #> 3 x 3 sparse Matrix of class "dgTMatrix"
#>
#> [1,] 2 . 4
#> [2,] . . 6
#> [3,] . 8 .
The Matrix
package provides most of the functions and methods from base R which one would expect, such as +
, -
, *
, %*%
, rbind
, cbind
, [
, [<-
, sqrt
, norm
, among many others.
However, the whole package is centered around the CSC format, with the provided functions oftentimes converting the input to CSC if it isn’t already, which is inefficient and loses many optimization potentials for operations like CSR[1:100,]
or rbind(COO, CSR)
, to name a few. Examples:
<- as(X, "RsparseMatrix")
Xr ### This will forcibly convert the matrix to triplets
1:2, ]
Xr[#> 2 x 3 sparse Matrix of class "dgTMatrix"
#>
#> [1,] 1 . 2
#> [2,] . . 3
### This will forcibly convert the matrix to CSC
rbind(Xr, Xr)
#> 6 x 3 sparse Matrix of class "dgCMatrix"
#>
#> [1,] 1 . 2
#> [2,] . . 3
#> [3,] . 4 .
#> [4,] 1 . 2
#> [5,] . . 3
#> [6,] . 4 .
### This will forcibly convert the matrix to CSC
* X
X #> 3 x 3 sparse Matrix of class "dgCMatrix"
#>
#> [1,] 1 . 4
#> [2,] . . 9
#> [3,] . 16 .
### This should be a straighforward operation
as(Xr, "ngRMatrix")
#> Error in as(Xr, "ngRMatrix"): no method or default for coercing "dgRMatrix" to "ngRMatrix"
### This doesn't do what it should
as(Xr, "nsparseMatrix")
#> 3 x 3 sparse Matrix of class "ngCMatrix"
#>
#> [1,] | . |
#> [2,] . . |
#> [3,] . | .
Many of these methods can be woefully inefficient when dealing with real, large datasets, particularly when dealing with the CSR format:
library(microbenchmark)
set.seed(1)
<- Matrix::rsparsematrix(1e4, 1e4, .05, repr="C")
X_big_csc <- as(t(X_big_csc), "RsparseMatrix")
X_big_csr microbenchmark({X_slice <- X_big_csr[1:10, ]}, times=10L)
#> Unit: milliseconds
#> expr min lq mean median
#> { X_slice <- X_big_csr[1:10, ] } 137.9918 138.6661 196.1937 219.9392
#> uq max neval
#> 221.5226 226.6653 10
Compare against what should be the mirror operation in CSC format:
microbenchmark({X_slice <- X_big_csc[, 1:10]}, times=10L)
#> Unit: milliseconds
#> expr min lq mean median
#> { X_slice <- X_big_csc[, 1:10] } 4.132839 4.185929 4.494835 4.24812
#> uq max neval
#> 4.294662 6.904135 10
Some operations in Matrix
, even if done natively in CSC format with a CSC input, can still be slower than one would expect and than what could in theory be achieved with different algorithms, oftentimes due to making copies of the data:
microbenchmark({X_col <- X_big_csc[, 100, drop=FALSE]}, times=10L)
#> Unit: milliseconds
#> expr min lq mean
#> { X_col <- X_big_csc[, 100, drop = FALSE] } 4.137548 4.15453 4.341068
#> median uq max neval
#> 4.195232 4.270918 5.647169 10
It should also be kept in mind that Matrix
does not exploit multi-threading in dense-sparse matrix multiplications, which have substantial potential for acceleration:
set.seed(1)
<- matrix(rnorm(1e2*nrow(X_big_csc)), nrow=1e2)
Y_dense microbenchmark({Z <- Y_dense %*% X_big_csc}, times=10L)
#> Unit: milliseconds
#> expr min lq mean median
#> { Z <- Y_dense %*% X_big_csc } 351.0095 352.2786 356.9553 356.8541
#> uq max neval
#> 361.3904 362.7509 10
The CSR sparse format is particularly useful when dealing with machine learning applications - e.g. splitting between a train and test set, tokenizing text features, multiplying a matrix by a vector of coefficients, calculating a gradient observation-by-observation, among others. Many stochastic optimization techniques and libraries (e.g. LibSVM, VowpalWabbit) require the inputs to be in CSR format or alike (see also readsparse), which does not play well with the column-centric methods of Matrix.
In principle, one could stick with just the CSC format from Matrix and keep a mental map of the matrix as being transposed. This however gets complicated rather soon and is very prone to errors. Additionally, one might want to pass sparse matrices to another package whose code is outside of one’s control, for which the storage format can make a large difference in performance.
MatrixExtra
is a package which extends the same classes from Matrix
for COO, CSR, CSC, and sparse vectors, by providing optimized replacements for typical methods which will work without changing the storage format of the matrices when not necessary; and providing some faster replacements of many methods.
library(MatrixExtra)
#> 'MatrixExtra' modifies important behaviors from 'Matrix'. See ?MatrixExtra-options.
MatrixExtra
overrides the show
method of sparse objects with a shorter version with only summary information:
Xr#> Sparse CSR matrix (class 'dgRMatrix')
#> Dimensions: 3 x 3
#> (4 entries, 44.44% full)
This new behavior usually comes handy when one wants to examine large sparse matrices as it will not generate so much print output, but for the examples in here the matrices to examine are small and one would likely want to see them in full instead. This can be controlled with a global option in the package (see ?MatrixExtra-options
for more):
options("MatrixExtra.quick_show" = FALSE)
Xr#> 3 x 3 sparse Matrix of class "dgRMatrix"
#>
#> [1,] 1 . 2
#> [2,] . . 3
#> [3,] . 4 .
The earlier examples would now become:
### This will not change the format
1:2, ]
Xr[#> 2 x 3 sparse Matrix of class "dgRMatrix"
#>
#> [1,] 1 . 2
#> [2,] . . 3
### This will not change the format
rbind(Xr, Xr)
#> 6 x 3 sparse Matrix of class "dgRMatrix"
#>
#> [1,] 1 . 2
#> [2,] . . 3
#> [3,] . 4 .
#> [4,] 1 . 2
#> [5,] . . 3
#> [6,] . 4 .
### This will not change the format
* Xr
Xr #> 3 x 3 sparse Matrix of class "dgRMatrix"
#>
#> [1,] 1 . 4
#> [2,] . . 9
#> [3,] . 16 .
Some of these operations now become much more efficient when the inputs are large:
microbenchmark({X_slice <- X_big_csr[1:10, ]}, times=10L)
#> Unit: microseconds
#> expr min lq mean median uq
#> { X_slice <- X_big_csr[1:10, ] } 322.404 330.99 21768.11 335.9605 344.045
#> max neval
#> 214655.9 10
Other methods, despite having been fast before in Matrix
, will still be replaced with faster versions:
microbenchmark({X_col <- X_big_csc[, 100, drop=FALSE]}, times=10L)
#> Unit: microseconds
#> expr min lq mean
#> { X_col <- X_big_csc[, 100, drop = FALSE] } 353.634 364.333 906.7754
#> median uq max neval
#> 367.704 371.707 5774.207 10
microbenchmark({Z <- Y_dense %*% X_big_csc}, times=10L)
#> Unit: milliseconds
#> expr min lq mean median
#> { Z <- Y_dense %*% X_big_csc } 51.54923 52.54824 52.92596 53.14878
#> uq max neval
#> 53.40562 53.6087 10
Conversions between sparse matrix classes also become easier:
as(Xr, "ngRMatrix")
#> 3 x 3 sparse Matrix of class "ngRMatrix"
#>
#> [1,] | . |
#> [2,] . . |
#> [3,] . | .
::as.csr.matrix(Xr, binary=TRUE)
MatrixExtra#> 3 x 3 sparse Matrix of class "ngRMatrix"
#>
#> [1,] | . |
#> [2,] . . |
#> [3,] . | .
Here’s a non-comprehensive list of operations which are accelerated by MatrixExtra
:
CSR %*% dense
, dense %*% CSC
, tcrossprod(CSR, dense)
, tcrossprod(dense, CSR)
, crossprod(dense, CSC)
, CSR %*% vector
.rbind(CSR, CSR)
, rbind(CSR, COO)
, rbind(CSR, vector)
, rbind(COO, vector)
.cbind(CSR, CSR)
, cbind(CSR, vector)
, cbind(CSR, COO)
.CSR * dense
, CSR * vector
, COO * dense
, COO * vector
, CSR * scalar
, COO * scalar
(and other similarly-working operators like &
, ^
, %
, %%
, %/%
).CSR + CSR
, CSR + COO
, CSC + CSC
, CSC + COO
, CSR + CSC
(and |
).t(CSR)
, t(CSC)
.CSR[i,j]
, CSC[i,j]
, COO[i,j]
.sqrt(CSR)
, norm(CSR)
, diag(CSR)
, among others.Many of the operations with dense types in MatrixExtra
allow inputs of float32
type from the float package, which leads to faster operations; and many of the operations with vector types allow sparse vectors from the same Matrix
package and dense vectors from float
.
In addition, it also provides utility functions which come in handy when sparse matrices are manually constructed or output by a different software, such as functions for sorting the indices or for removing zero-valued and NA
elements.
When one loads MatrixExtra
through library(MatrixExtra)
, it will modify some behaviors from Matrix
in important ways which make them more efficient, but which can cause breakage in code or in packages if they make certain assumptions about Matrix
methods. Among others:
### Here Matrix would return a 'dgRMatrix'
t(Xr)
#> 3 x 3 sparse Matrix of class "dgCMatrix"
#>
#> [1,] 1 . .
#> [2,] . . 4
#> [3,] 2 3 .
### Here Matrix would return a dense vector
1,]
Xr[#> sparse vector (nnz/length = 2/3) of class "dsparseVector"
#> [1] 1 . 2
These behaviors can be changed to their less-optimal versions as would be done by Matrix
, either individually (see ?MatrixExtra-options
) or all at once:
restore_old_matrix_behavior()
set_new_matrix_behavior()
One would wonder what kind of workflows specifically does MatrixExtra
improve upon, and one obvious example would be fitting a logistic regression with gradient-based procedures.
This example here will fit a binary logistic regression with L2 regularization using the L-BFGS-B optimizer in R. For simplicity purposes, the intercept will be calculated by concatenating a column of 1s to the data, but note that this is not the most efficient way of doing it.
The dataset used is the “Real-Simulated” data, downloaded from LibSVM datasets. This is an artificially-generated toy dataset for which it’s easy to achieve almost-perfect accuracy, but it’s nevertheless a large-ish dataset in which the improved methods and operators here become noticeable.
Loading the data:
library(readsparse)
<- readsparse::read.sparse("real-sim")
data <- data$X
X <- as.numeric(factor(data$y))-1 ### convert to 0/1
y
X#> Sparse CSR matrix (class 'dgRMatrix')
#> Dimensions: 72,309 x 20,958
#> (3,709,083 entries, 0.24% full)
Adding the intercept and creating a 50-50 train-test split:
<- cbind(rep(1, nrow(X)), X) ### Accelerated by 'MatrixExtra'
X set.seed(1)
<- sample(nrow(X), floor(.5*nrow(X)), replace=FALSE)
ix_train <- X[ix_train,] ### Accelerated by 'MatrixExtra'
X_train <- y[ix_train]
y_train <- X[-ix_train,] ### Accelerated by 'MatrixExtra'
X_test <- y[-ix_train] y_test
Now fitting the model:
<- function(coefs, X, y, lambda) {
logistic_fun <- 1 / (1 + exp(-as.numeric(X %*% coefs))) ### Accelerated by 'MatrixExtra'
pred <- mean(y * log(pred) + (1 - y) * log(1 - pred))
ll <- lambda * as.numeric(coefs %*% coefs)
reg ### Don't regularize the intercept
<- reg - lambda * (coefs[1]^2)
reg return(-ll + reg)
}
<- function(coefs, X, y, lambda) {
logistic_grad <- 1 / (1 + exp(-(X %*% coefs))) ### Accelerated by 'MatrixExtra'
pred <- colMeans(X * as.numeric(pred - y)) ### Accelerated by 'MatrixExtra'
grad <- grad + 2 * lambda * as.numeric(coefs)
grad ### Don't regularize the intercept
1] <- grad[1] - 2 * lambda * coefs[1]
grad[return(as.numeric(grad))
}
<- 1e-5 ### <- Regularization parameter
lambda <- optim(numeric(ncol(X_train)),
res
logistic_fun,
logistic_grad,method="L-BFGS-B",
X_train, y_train, lambda)<- res$par fitted_coefs
Verify that the model has good performance:
<- as.numeric(X_test %*% fitted_coefs)
y_hat_test ::AUC(y_hat_test, y_test)
MLmetrics#> [1] 0.9947617
Timing the optimizer:
<- numeric(ncol(X_train))
x0 ::microbenchmark({
microbenchmark<- optim(x0,
res
logistic_fun,
logistic_grad,method="L-BFGS-B",
X_train, y_train, lambda)times=10L)
}, #> Unit: seconds
#> expr
#> { res <- optim(x0, logistic_fun, logistic_grad, method = "L-BFGS-B", X_train, y_train, lambda) }
#> min lq mean median uq max neval
#> 4.02818 4.054176 4.116148 4.077869 4.176172 4.273774 10
The same routine using Matrix
would usually take around 7 seconds (~60% slower) in this same setup, plus some extra time in the data preparation. The only thing that was needed to accelerate it was to load library(MatrixExtra)
, with everything else remaining the same as it would have been in base R or Matrix
.