In this work, we provide the framework to analyze multiresolution partitions (e.g. country, provinces, subdistrict) where each individual data point belongs to only one partition in each layer (e.g. i belongs to subdistrict A, province P, and country Q).
We assume that a partition in a higher layer subsumes lower-layer partitions (e.g. a nation is at the 1st layer subsumes all provinces at the 2nd layer).
Given N individuals that have a pair of real values (x,y) that generated from independent variable X and dependent variable Y. Each individual i belongs to one partition per layer.
Our goal is to find which partitions at which highest level that all individuals in the these partitions share the same linear model Y=f(X) where f is a linear function.
The framework deploys the Minimum Description Length principle (MDL) to infer solutions.
For the newest version on github, please call the following command in R terminal.
This requires a user to install the “remotes” package before installing MRReg.
In the first step, we generate a simulation dataset.
All simulation types have three layers except the type 4 has four layers.
The type-1 simulation has all individuals belong to the same homogeneous partition in the first layer.
The type-2 simulation has four homogeneous partitions in a second layer. Each partition has its own models.
The type-3 simulation has eight homogeneous partitions in a third layer. Each partition has its own models
The type-4 simulation has one homogeneous partition in a second layer, four homogeneous partitions in a third layer, and eight homogeneous partitions in a fourth layer. Each partition has its own model.
In this example, we use type-4 simulation.
library(MRReg)
# Generate simulation data type 4 by having 100 individuals per homogeneous partition.
DataT<-SimpleSimulation(100,type=4)
gamma <- 0.05 # Gamma parameter
out<-FindMaxHomoOptimalPartitions(DataT,gamma)
Then we plot the optimal homogeneous tree.
plotOptimalClustersTree(out)
The red nodes are homogeneous partitions. All children of a homogeneous partition node share the same linear model.
Lastly, we can print the result in text form.
PrintOptimalClustersResult(out, selFeature = TRUE)
The result is below.
[1] "========== List of Optimal Clusters =========="
[1] "Layer2,ClS-C1:clustInfoRecRatio=0.08,modelInfoRecRatio=0.72, eta(C)cv=1.00"
[1] "Selected features"
[1] 2
[1] "Layer3,ClS-C11:clustInfoRecRatio=0.10,modelInfoRecRatio=0.63, eta(C)cv=1.00"
[1] "Selected features"
[1] 2
[1] "Layer3,ClS-C12:clustInfoRecRatio=0.10,modelInfoRecRatio=0.70, eta(C)cv=1.00"
[1] "Selected features"
[1] 3
[1] "Layer3,ClS-C13:clustInfoRecRatio=0.10,modelInfoRecRatio=0.68, eta(C)cv=1.00"
[1] "Selected features"
[1] 4
[1] "Layer3,ClS-C14:clustInfoRecRatio=0.09,modelInfoRecRatio=0.61, eta(C)cv=1.00"
[1] "Selected features"
[1] 5
[1] "Layer4,ClS-C21:clustInfoRecRatio=NA,modelInfoRecRatio=0.61, eta(C)cv=1.00"
[1] "Selected features"
[1] 2
[1] "Layer4,ClS-C22:clustInfoRecRatio=NA,modelInfoRecRatio=0.58, eta(C)cv=1.00"
[1] "Selected features"
[1] 3
[1] "Layer4,ClS-C23:clustInfoRecRatio=NA,modelInfoRecRatio=0.61, eta(C)cv=1.00"
[1] "Selected features"
[1] 4
[1] "Layer4,ClS-C24:clustInfoRecRatio=NA,modelInfoRecRatio=0.46, eta(C)cv=1.00"
[1] "Selected features"
[1] 5
[1] "Layer4,ClS-C25:clustInfoRecRatio=NA,modelInfoRecRatio=0.55, eta(C)cv=1.00"
[1] "Selected features"
[1] 6
[1] "Layer4,ClS-C26:clustInfoRecRatio=NA,modelInfoRecRatio=0.60, eta(C)cv=1.00"
[1] "Selected features"
[1] 7
[1] "Layer4,ClS-C27:clustInfoRecRatio=NA,modelInfoRecRatio=0.63, eta(C)cv=1.00"
[1] "Selected features"
[1] 8
[1] "Layer4,ClS-C28:clustInfoRecRatio=NA,modelInfoRecRatio=0.61, eta(C)cv=1.00"
[1] "Selected features"
[1] 9
[1] "min eta(C)cv:1.000000"
Note for selected features: 1 is reserved for an intercept, and d is a selected feature if Y[i] ~ X[i,d-1] in linear model. Note that the clustInfoRecRatio values are always NA for last-layer partitions.
INPUT: DataT$X[i,j] is the value of jth independent variable of ith individual.
INPUT: DataT$Y[i] is the value of dependent variable of ith individual.
INPUT: DataT$clsLayer[i,k] is the cluster label of ith individual in kth cluster layer.
OUTPUT: out$Copt[p,1] is equal to k implies that a cluster that is a pth member of the maximal homogeneous partition is at kth layer and the cluster name in kth layer is Copt[p,2]
OUTPUT: out$Copt[p,3] is “Model Information Reduction Ratio” of pth member of the maximal homogeneous partition: positive means the linear model is better than the null model.
OUTPUT: out$Copt[p,4] is \(\eta( {C} )_{\text{cv}}\) of pth member of the maximal homogeneous partition. The greater Copt[p,4], the higher homogeneous degree of this cluster.
OUTPUT: out$models[[k]][[j]] is the linear regression model of jth cluster in kth layer.
OUTPUT: out\(models[[k]][[j]]\)clustInfoRecRatio is the “Cluster Information Reduction Ratio” between the jth cluster in kth layer and its children clusters in (k+1)th layer: positive means current cluster is better than its children clusters. Hence, we should keep this cluster at the member of maximal homogeneous partition instead of its children.
Chainarong Amornbunchornvej, Navaporn Surasvadi, Anon Plangprasopchok, and Suttipong Thajchayapong (2021). Identifying Linear Models in Multi-Resolution Population Data using Minimum Description Length Principle to Predict Household Income. ACM Transactions on Knowledge Discovery from Data (TKDD), 15(2), 1-30. https://doi.org/10.1145/3424670