Many numerical analyses are invalid when working with nominal data
because the mode is the only way to measure central tendency for nominal
data, and frequency testing, like Chi-square tests, is the most common
statistical analysis that makes sense. NIMAA
package (Jafari and Chen 2022) proposes a comprehensive
set of pipeline to perform nominal data mining, which can effectively
find label relationships of each nominal variable according to pairwise
association with other nominal data. You can also check for updates,
here NIMAA.
It uses bipartite graphs to show how two different types of data are linked together, and it puts them in the incidence matrix to continue with network analysis. Finding large submatrices with non-missing values and edge prediction are other applications of NIMAA to explore local and global similarities within the labels of nominal variables.
Then, using a variety of different network projection methods, two unipartite graphs are constructed on the given submatrix. NIMAA provides several options for clustering projected networks and selecting the best one based on internal measures and external prior knowledge (ground truth). When weighted bipartite networks are considered, the best clustering results are used as the benchmark for edge prediction analysis. This benchmark is used to figure out which imputation method is the best one to predict weight of edges in bipartite network. It looks at how similar the clustering results are before and after the imputations. By using edge prediction analysis, we tried to get more information from the whole dataset even though there were some missing values.
In this section, we demonstrate how to do a NIMAA analysis on a
weighted bipartite network using the beatAML
dataset.
beatAML
is one of four datasets that can be found in the
NIMAA
package (Jafari and Chen
2022). This dataset has three columns: the first two contain
nominal variables, while the third contains numerical variables.
inhibitor | patient_id | median |
---|---|---|
Alisertib (MLN8237) | 11-00261 | 81.00097 |
Barasertib (AZD1152-HQPA) | 11-00261 | 60.69244 |
Bortezomib (Velcade) | 11-00261 | 81.00097 |
Canertinib (CI-1033) | 11-00261 | 87.03067 |
Crenolanib | 11-00261 | 68.13586 |
CYT387 | 11-00261 | 69.66083 |
Dasatinib | 11-00261 | 66.13318 |
Doramapimod (BIRB 796) | 11-00261 | 101.52120 |
Dovitinib (CHIR-258) | 11-00261 | 33.48040 |
Erlotinib | 11-00261 | 56.11189 |
Read the data from the package:
The plotIncMatrix()
function prints some information
about the incidence matrix derived from input data, such as its
dimensions and the proportion of missing values, as well as the image of
the matrix. It also returns the incidence matrix object.
NB: To keep the size of vignette small enough for CRAN rules, we won’t output the interactive figure here.
beatAML_incidence_matrix <- plotIncMatrix(
x = beatAML_data, # original data with 3 columns
index_nominal = c(2,1), # the first two columns are nominal data
index_numeric = 3, # the third column is numeric data
print_skim = FALSE, # if you want to check the skim output, set this as TRUE
plot_weight = TRUE, # when plotting the weighted incidence matrix
verbose = FALSE # NOT save the figures to local folder
)
Na/missing values Proportion: 0.2603
Given that we have the incidence matrix, we can easily reconstruct
the corresponding bipartite network. In the NIMAA
package,
we have two options for visualizing the bipartite network: static or
interactive plots.
The plotBipartite()
function customizes the
corresponding bipartite network visualization based on the
igraph
package (Csardi and Nepusz
2006) and returns the igraph object.
# show the igraph object
bipartGraph
#> IGRAPH 0a7dfee UNWB 650 47636 --
#> + attr: name (v/c), type (v/l), shape (v/c), color (v/c), weight (e/n)
#> + edges from 0a7dfee (vertex names):
#> [1] Alisertib (MLN8237) --11-00261 Barasertib (AZD1152-HQPA)--11-00261
#> [3] Bortezomib (Velcade) --11-00261 Canertinib (CI-1033) --11-00261
#> [5] Crenolanib --11-00261 CYT387 --11-00261
#> [7] Dasatinib --11-00261 Doramapimod (BIRB 796) --11-00261
#> [9] Dovitinib (CHIR-258) --11-00261 Erlotinib --11-00261
#> [11] Flavopiridol --11-00261 GDC-0941 --11-00261
#> [13] Gefitinib --11-00261 Go6976 --11-00261
#> [15] GW-2580 --11-00261 Idelalisib --11-00261
#> + ... omitted several edges
The plotBipartiteInteractive()
function generates a
customized interactive bipartite network visualization based on the
visNetwork
package (Almende B.V. and
Contributors, Thieurmel, and Robert 2021).
NB: To keep the size of vignette small enough, we do not output the
interactive figure here. Instead, we show a screenshot of part of the
beatAML
dataset.
NIMAA
package contains a function called
analyseNetwork
to provide more details about the network
topology and common centrality measures for vertices and edges.
analysis_reuslt <- analyseNetwork(bipartGraph)
# showing the general measures for network topology
analysis_reuslt$general_stats
#> $vertices_amount
#> [1] 650
#>
#> $edges_amount
#> [1] 47636
#>
#> $edge_density
#> [1] 0.2258433
#>
#> $components_number
#> [1] 1
#>
#> $eigen_centrality_value
#> [1] 15721.82
#>
#> $hub_score_value
#> [1] 247175684
In the case of a weighted bipartite network, the dataset with the
fewest missing values should be used for the next steps. This is to
avoid the sensitivity problems of clustering-based methods. The
extractSubMatrix()
function extracts the submatrices that
have non-missing values or have a certain percentage of missing values
inside (not for elements-max matrix), depending on the argument’s input.
The result will also be shown as a plotly
plot (Sievert 2020), so you can see the screenshots
of beatAML
dataset below.
The extraction process is performed and visualized in two ways, which
can be chosen depending on the user’s preference: using the original
input matrix (row-wise) and using the transposed matrix (column-wise).
NIMAA
extracts the largest submatrices with non-missing
values or with a specified proportion of missing values (using the
bar
argument) in four ways predefined in the
shape
argument:
Here we extract two shapes of submatrix from the
beatAML_incidence_matrix
including square and rectangular,
with the maximum number of elements:
sub_matrices <- extractSubMatrix(
x = beatAML_incidence_matrix,
shape = c("Square", "Rectangular_element_max"), # the selected shapes of submatrices
row.vars = "patient_id",
col.vars = "inhibitor",
plot_weight = TRUE,
print_skim = FALSE
)
#> binmatnest.temperature
#> 20.12499
#> Size of Square: 96 rows x 96 columns
#> Size of Rectangular_element_max: 87 rows x 140 columns
A value called binmatnest2.temperature is also
printed by this function that indicates the nestedness measure of the
input matrix. If the input matrix is highly nested
(binmatnest2.temperature >= 1
), then splitting the input
matrix into parts is recommended to follow the analysis in each part
independently.
The findCluster()
function performs seven commonly used
network clustering algorithms, with the option of preprocessing the
input incidence matrix following the projecting of the bipartite network
into unipartite networks. Preprocessing options such as using
normalization and defining a threshold help us remove edges that have
low weights or are weak edges. Also, it is possible to continue using
binarized or original edge weights before performing community detection
or clustering. Finally, all of the clustering methods that were used can
be compared and the best one chosen based on internal or external
cluster validity measures such as average of silhouette width and
Jaccard similarity index. Additionally to displaying in consul, the
findings are provided as bar plots.
To begin, let us examine how to utilize findCluster
to
find clusters of patient_id
s based on the
Rectangular_element_max
submatrix of the
beatAML
bipartite network projection to a patient’s
unipartite similarity network.
cls1 <- findCluster(
sub_matrices$Rectangular_element_max,
part = 1,
method = "all", # all available clustering methods
normalization = TRUE, # normalize the input matrix
rm_weak_edges = TRUE, # remove the weak edges in graph
rm_method = 'delete', # delete the weak edges instead of lowering their weights to 0.
threshold = 'median', # Use median of edges' weights as threshold
set_remaining_to_1 = TRUE, # set the weights of remaining edges to 1
)
Additionally, if external features (prior knowledge) are provided,
the function compares the clustering results obtained with the external
features in terms of similarity by setting extra_feature
argument.
For instance, let’s generate some random features for the patients in
the beatAML
dataset. The following is an example:
external_feature <- data.frame(row.names = cls1$walktrap$names)
external_feature[,'membership'] <- paste('group',
findCluster(sub_matrices$Rectangular_element_max,
method = c('walktrap'),
rm_weak_edges = T,
set_remaining_to_1 = F,
threshold = 'median',
comparison = F)$walktrap$membership)
dset1 <- utils::head(external_feature)
knitr::kable(dset1, format = "html")
membership | |
---|---|
16-00142 | group 2 |
16-00145 | group 2 |
16-00157 | group 3 |
16-00217 | group 2 |
16-00249 | group 2 |
16-00265 | group 2 |
Clustering is done on the patient similarity network again, but this
time by setting the extra_feature
argument to provide
external validation for the cluster finding analysis.
cls <- findCluster(
sub_matrices$Rectangular_element_max,
part = 1,
method = "all",
normalization = TRUE,
rm_weak_edges = TRUE,
rm_method = 'delete',
threshold = 'median',
set_remaining_to_1 = TRUE,
extra_feature = external_feature # add an extra feature reference here
)
#> Warning in findCluster(sub_matrices$Rectangular_element_max, part = 1, method =
#> "all", : cluster_spinglass cannot work with unconnected graph
#>
#>
#> | | walktrap| louvain| infomap| label_prop| leading_eigen| fast_greedy|
#> |:------------------|---------:|---------:|---------:|----------:|-------------:|-----------:|
#> |modularity | 0.0125993| 0.0825865| 0.0000000| 0.0000000| 0.0806766| 0.0825865|
#> |avg.silwidth | 0.2109092| 0.1134990| 0.9785714| 0.9785714| 0.1001961| 0.1134990|
#> |coverage | 0.9200411| 0.5866393| 1.0000000| 1.0000000| 0.5806783| 0.5866393|
#> |corrected.rand | 0.8617545| 0.0603934| 0.2518265| 0.2518265| 0.0430117| 0.0603934|
#> |jaccard_similarity | 0.9310160| 0.4431195| 0.7981967| 0.7981967| 0.4409188| 0.4431195|
As you can see, two new indices are included
(jaccard_similarity
and corrected_rand
) as
external cluster validity measures which represent the similarity
between clustering and the prior knowledge used as ground truth.
The following sections show how to perform clustering analysis on the
other part of the beatAML
bipartite network, namely,
inhibitor
. In this case, we simply need to modify the
part
to 2.
cls2 <- findCluster(
sub_matrices$Rectangular_element_max, # the same submatrix
part = 2 # set to 2 to use the other part of bipartite network
)
#> Warning in findCluster(sub_matrices$Rectangular_element_max, part = 2):
#> cluster_spinglass cannot work with unconnected graph
#>
#>
#> | | walktrap| louvain| infomap| label_prop| leading_eigen| fast_greedy|
#> |:------------|---------:|---------:|--------:|----------:|-------------:|-----------:|
#> |modularity | 0.0211411| 0.0740154| 0.000000| 0.000000| 0.0794650| 0.0785273|
#> |avg.silwidth | 0.1723698| 0.1264353| 0.954023| 0.954023| 0.1317522| 0.1270078|
#> |coverage | 0.8433993| 0.5633351| 1.000000| 1.000000| 0.5403528| 0.5938001|
In this section, we look at the clustering results, with a particular emphasis on cluster visualization and scoring measures.
The plotCluster()
function makes an interactive network
figure that shows nodes that belong to the same cluster in the same
color and nodes that belong to different clusters in separate
colors.
The Sankey diagram is used to depict the connections between clusters
within each part of the bipartite network. The display is also
interactive, and by grouping nodes within each cluster as “summary”
nodes, the visualClusterInBipartite
function emphasizes how
clusters of each part are connected together.
visualClusterInBipartite(
data = beatAML_data,
community_left = cls1$leading_eigen,
community_right = cls2$fast_greedy,
name_left = 'patient_id',
name_right = 'inhibitor')
Based on the fpc
package (Hennig
2020), and Almeida et al. (2011),
the scoreCluster
function provides additional internal
cluster validity measures such as entropy and coverage. The concept of
scoring is according to the weight fraction of all intra-cluster edges
relative to the total weight of all edges in the network.
The validateCluster()
function calculates the similarity
of a given clustering method to the provided ground truth. This function
provides external cluster validity measures including
corrected.rand
and jaccard similarity
.
validateCluster(community = cls$leading_eigen,
extra_feature = external_feature,
dist_mat = cls$distance_matrix)
#> Warning in fpc::cluster.stats(d = dist_mat, clustering = community$membership,
#> : alt.clustering renumbered because maximum != number of clusters
#> $corrected.rand
#> [1] 0.04301166
#>
#> $jaccard_similarity
#> [1] 0.4409188
The predictEdge()
function utilizes the given imputation
methods to impute missing values in the input data matrix. The output is
a list with each element representing an updated version of input data
matrix with no missing values using a separate imputation method. The
following is a detailed list of imputation techniques:
median will replace the missing values with the median of each rows (observations),
knn is the method in bnstruct
package (Franzin, Sambo, and Camillo
2017),
als and svd are methods from
softImpute
package(Hastie and
Mazumder 2021),
CA, PCA and
FAMD are from missMDA
package (Josse and Husson 2016),
others are from the famous mice
package (van Buuren and Groothuis-Oudshoorn
2011).
The validateEdgePrediction()
function compares the
results of clustering after imputation with a predefined benchmark to
see how similar they are. After validating clustering on submatrices
with non-missing values, the benchmark had already been set up in the
previous step. This function does the analysis for a number of
user-specified imputation methods. It’s also possible to see how each
method is ranked by looking at a ranking plot. The following measures
are used to compare the methods:
Jaccard similarity index
Dice similarity coefficient
Rand index
Minkowski measure (inverted)
Fowlkes–Mallows index
validateEdgePrediction(imputation = imputations,
refer_community = cls1$fast_greedy,
clustering_args = cls1$clustering_args)
#>
#>
#> | | Jaccard_similarity| Dice_similarity_coefficient| Rand_index| Minkowski (inversed)| Fowlkes_Mallows_index|
#> |:------|------------------:|---------------------------:|----------:|--------------------:|---------------------:|
#> |median | 0.7476353| 0.8555964| 0.8628983| 1.870228| 0.8556407|
#> |svd | 0.7224792| 0.8388829| 0.8458376| 1.763708| 0.8388853|
#> |als | 0.7290576| 0.8433005| 0.8510791| 1.794478| 0.8433362|
#> |CA | 0.6935897| 0.8190765| 0.8280576| 1.670030| 0.8191111|
#> imputation_method Jaccard_similarity Dice_similarity_coefficient Rand_index
#> 1 median 0.7476353 0.8555964 0.8628983
#> 2 svd 0.7224792 0.8388829 0.8458376
#> 3 als 0.7290576 0.8433005 0.8510791
#> 4 CA 0.6935897 0.8190765 0.8280576
#> Minkowski (inversed) Fowlkes_Mallows_index
#> 1 1.870228 0.8556407
#> 2 1.763708 0.8388853
#> 3 1.794478 0.8433362
#> 4 1.670030 0.8191111
Finally, the best imputation method can be chosen, allowing for a re-examination of the label relationships or a re-run of the nominal data mining. In other words, users can use this imputed matrix as an input matrix to redo the previous steps of mining the hidden relationships in the original dataset. Note that this matrix has no missing values, hence no submatrix extraction is required.