This is the R-package accompanying the paper (Convex optimization for the densest subgraph and densest submatrix problems.
The problem of identifying a dense submatrix is a fundamental problem in the analysis of matrix structure and complex networks. This package provides tools for identifying the densest submatrix of a given graph using first-order optimization methods.
See the tutorials below to get started.
Let [M]={1,2,…,M} for each positive integer M. Given a matrix A∈RM×N, the densest m×n-submatrix problem seeks subsets ˉU⊆[M] and ˉV⊆[N] of cardinality |ˉU|=m and |ˉV|=n, respectively, such that the submatrix A[ˉU,ˉV] with rows index by ˉU and columns indexed by ˉV contains the maximum number of nonzero entries. That is, the densest m×n-submatrix problem seeks the densest m×n-submatrix of A.
The densest m×n-submatrix problem can be formulated as:
min
where
\mathrm{P}_{\Omega} is the projection onto the index set of zero entries of matrix \mathbf A;
\operatorname{Tr} is the matrix trace function;
\Omega is the index set of zero entries of matrix A;
\mathbf{X} is rank-one matrix with mn nonzero entries;
\mathbf{Y} is used to count the number of disagreements between \mathbf A and \mathbf X;
\mathbf{e} - all-ones vector.
Unfortunately, optimization problems involving rank and binary constraints are intractable in general.
Relaxing the rank constraint with a nuclear norm penalty term, \|\mathbf{X} \|_* = \sum_{i=1}^N \sigma_i(\mathbf{X}), i.e., the sum of the singular values of matrix, and the binary constraints with box constraints yields the convex problem:
\begin{align} \min \; & \|\mathbf{X} \|_* + \gamma \operatorname{Tr}(\mathbf{Y} \mathbf{e} \mathbf{e}^T) \\ \operatorname{s.t.}\; & \operatorname{Tr}(\mathbf{X} \mathbf{e} \mathbf{e}^T) = mn, \\ & \mathrm{P}_\Omega(\mathbf{X} - \mathbf{Y}) = \mathbf{0}, \\ & \mathbf{Y} \ge \mathbf{0}, \\ & \mathbf{0} \le \mathbf{X} \le \mathbf{e} \mathbf{e}^T, \end{align} where \gamma >0 is a regularization parameter chosen to tune between the two objectives.
It can be shown that the relaxation is exact when binary matrix \mathbf{A} contains a single, relatively large dense m\times n block. For more information, see (Convex optimization for the densest subgraph and densest submatrix problems
The alternating direction method of multipliers (ADMM) has been succesfully used in a broad spectrum of applications. The ADMM solves convex optimization problems with composite objective functions subject to equality constraints.
We direct the reader to Prof. Stephen Boyd’s website (ADMM) for a more thorough discussion of the ADMM.
To apply ADMM to our problem, we introduce artificial variables \mathbf{Q}, \mathbf{W} and \mathbf{Z} to obtain the equivalent optimization problem:
\begin{align} \min \; & \|\mathbf{X} \|_* + \gamma \|\mathbf{Y} \|_1 +{1}_{\Omega_Q}(\mathbf{Q})+{1}_{\Omega_W}(\mathbf{W})+{1}_{\Omega_Z}(\mathbf{Z})\\ \operatorname{s.t.}\; & \mathbf{X}-\mathbf{Y}=\mathbf{Q},\mathbf{X}=\mathbf{W}, \mathbf{X}=\mathbf{Z} \end{align}
where
Here {1}_{S}: R^{M\times M} \rightarrow \left \{0,+\infty \right \} is the indicator function of the set S \subseteq R^{M\times N}, such that {1}_S(\mathbf{X})=0 if \mathbf{X}\in S, and +\infty otherwise.
Since our objective function is separable, we iteratively solve this optimization program using the ADMM. The basic idea is to rotate through 3 steps:
Interested readers are referred to (Convex optimization for the densest subgraph and densest submatrix problems. We include a summary of the algorithm below.
We test this package on two different types of data: first, using random matrices sampled from the planted dense m \times n submtarix model and, second, real-world collaboration and communication networks.
We first generate a random matrix with noise obscuring the planted submatrix using the function plantedsubmatrix
. and then call the function densub
to recover the planted submatrix.
# Initialize problem size and densities
# You can play around with these parameters
M=100 #number of rows of sampled matrix
N=200 #number of columns of sampled matrix
m=50 #number of rows of dense submatrix
n=40 #number of columns of dense submatrix
p=0.25 # noise density
q=0.85 #in-group density
#Make binary matrix with mn-submatrix
random<-plantedsubmatrix(M=M, N=N,m=m,n=n,p=p,q=q)
After generating the structure random
containing the random matrix with desired planted structure, we can visually represent the matrix and planted submatrix as two-tone images, where dark pixels correspond to nonzero entries, and light pixels correspond to zero entries, using the following commands.
# Plot sampled G and matrix representations.
image(random$sampled_matrix, useRaster=TRUE, axes=FALSE, main = "Matrix A")
image(random$dense_submatrix, useRaster=TRUE, axes=FALSE, main = "Matrix X0")
image(random$disagreements, useRaster=TRUE, axes=FALSE, main = "Matrix Y0")
Tne vizualization of the randomly generated matrix \mathbf{A} helps us to understand its structure. It is clear that \mathbf{A} contains a dense 50 \times 40 block (in the bottom left corner).
We can remove all noise and isolate an image of a rank-one matrix \mathbf{X0} with mn nonzero entries.
Then we vizualize matrix \mathbf{Y0} to see the number of disagreements between original matrix \mathbf{A} and \mathbf{X0}.
We call the ADMM solver and visualize the output using the following commands.
#Call ADMM solver
admm<-densub(G=random$sampled_matrix, m=m, n=n, tau = 0.35, gamma = 6/(sqrt(m*n)*(q-p)), opt_tol = 1.0e-4,maxiter=500, quiet = TRUE)
#Plot results
image(admm$X, useRaster=TRUE, axes=FALSE, main = "Matrix X")
image(admm$Y, useRaster=TRUE, axes=FALSE, main = "Matrix Y")
The ADMM solver returns the optimal solutions \mathbf{X} and \mathbf{Y}. It must be noted that matrices \mathbf X and \mathbf Y are identical to the actual structures of \mathbf{X_0} and \mathbf{Y_0}. The planted submatrix is recovered.
The following is a simple example on how one could use the package to analyze the collaboration network found in the JAZZ dataset. It is known that this network contains a cluster of 100 musicians which performed together.
We have already prepared dataset to work with. More details can be found in the provided file JAZZ_IN_R.R.
#Load dataset
load(file="JAZZ.RData")
#Initialize problem size and densities
G=new #define matrix G equivalent to JAZZ dataset
m=100 #clique size or the number of rows of the dense submatrix
n=100 #clique size of the number of columns of the dense sumbatrix
tau=0.85 #regularization parameter
opt_tol=1.0e-2 #optimal tolerance
verbose=1
maxiter=2000 #number of iterations
gamma=8/n #regularization parameter
#call ADMM solver
admm <- densub(G = G, m = m, n = n, tau = tau, gamma = gamma, opt_tol = opt_tol, maxiter=maxiter, quiet = TRUE)
# Planted solution X0.
X0=matrix(0L, nrow=198, ncol=198) #construct rank-one matrix X0
X0[1:100,1:100]=matrix(1L, nrow=100, ncol=100)#define dense block
# Planted solution Y0.
Y0=matrix(0L, nrow=198, ncol=198) #construct matrix for counting disagreements between G and X0
Y0[1:100,1:100]=matrix(1L,nrow=100,ncol=1000)-G[1:100,1:100]
#Check primal and dual residuals.
C=admm$X-X0
a=norm(C, "F") #Frobenius norm of matrix C
b=norm(X0,"F") #Frobenius norm of matrix X0
recovery = matrix(0L,nrow=1, ncol=1)#create recovery matrix
if (a/b^2<opt_tol){
recovery=recovery+1
} else {
recovery=0 #Recovery condition
}
Our algorithm converges to the dense submatrix representing the community of 100 musicians after 50 iterations.