This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
It is important but challenging to understand the interactions of multiple chronic conditions (MCC) and how they develop over time in patients and populations. Clinical data on MCC can now be represented using graphical models to study their interaction and identify the path toward the development of MCC. However, the current graphical models representing MCC are often complex and difficult to analyze. Therefore, it is necessary to develop improved methods for generating these models.
This study aimed to summarize the complex graphical models of MCC interactions to improve comprehension and aid analysis.
We examined the emergence of 5 chronic medical conditions (ie, traumatic brain injury [TBI], posttraumatic stress disorder [PTSD], depression [Depr], substance abuse [SuAb], and back pain [BaPa]) over 5 years among 257,633 veteran patients. We developed 3 algorithms that utilize the second eigenvalue of the graph Laplacian to summarize the complex graphical models of MCC by removing less significant edges. The first algorithm learns a sparse probabilistic graphical model of MCC interactions directly from the data. The second algorithm summarizes an existing probabilistic graphical model of MCC interactions when a supporting data set is available. The third algorithm, which is a variation of the second algorithm, summarizes the existing graphical model of MCC interactions with no supporting data. Finally, we examined the coappearance of the 100 most common terms in the literature of MCC to validate the performance of the proposed model.
The proposed summarization algorithms demonstrate considerable performance in extracting major connections among MCC without reducing the predictive accuracy of the resulting graphical models. For the model learned directly from the data, the area under the curve (AUC) performance for predicting TBI, PTSD, BaPa, SuAb, and Depr, respectively, during the next 4 years is as follows—year 2: 79.91%, 84.04%, 78.83%, 82.50%, and 81.47%; year 3: 76.23%, 80.61%, 73.51%, 79.84%, and 77.13%; year 4: 72.38%, 78.22%, 72.96%, 77.92%, and 72.65%; and year 5: 69.51%, 76.15%, 73.04%, 76.72%, and 69.99%, respectively. This demonstrates an overall 12.07% increase in the cumulative sum of AUC in comparison with the classic multilevel temporal Bayesian network.
Using graph summarization can improve the interpretability and the predictive power of the complex graphical models of MCC.
Clinical data on multiple chronic conditions (MCC) are often complex [
Thus, instead of using a fully/densely connected network for analysis, choosing a network with fewer but more informative connections can improve the training and querying process. However, the main questions are as follows: (1) What are the least/most informative parts of the graphical models? (2) How can such information be leveraged to summarize graphical models without losing considerable predictive/inference accuracy? and (3) How can an algorithm of this type be applied to learn a compact graph directly from the data? Effective summarization algorithms are the ones that preserve the most important structures of the original graphical model, focus on major patterns/aspects of the data, and maintain the original graph distribution (the conditional probability distribution of the original graph). They should also be capable of querying or identifying substructures/patterns in a specific set of nodes/triads (local queries) of the graph structures as well as the complete graph (global queries) to study the global influence of conditioned states.
Learning sparse graphical models directly from emergence data on multiple chronic conditions using (a) the unsummarized graphical model (λ=0) and (b) the summarized graphical model using the EAGL structure learning algorithm (λ=1000) in which each node is a binary (0,1) variable representing the status (presence or absence) of a chronic condition in a particular year, that is, TBIY1 denotes the status of traumatic brain injury at year1 (base year) and BaPaY5 denotes the status of back pain in year5. BaPa: back pain; TBI: traumatic brain injury; PTSD: posttraumatic stress disorder; SuAb: substance abuse; MCC: multiple chronic conditions; EAGL: eigenvalue analysis of the graph Laplacian.
(a) Unsummarized probabilistic graphical model of the emergence of MCCs. (b) The summarized probabilistic graphical model using the EAGL summarization algorithm at a 20% summarization rate. Each node is a binary (0,1) variable representing the status (presence or absence) of a chronic condition in a particular year, that is, TBIY1 denotes the status of TBI at year1 (base year), and BaPaY5 denotes the status of BaPa in year5. BaPa, back pain. In the figure, BaPa: back pain; TBI: traumatic brain injury, PTSD: posttraumatic stress disorder and SuAb: substance abuse, MCC: multiple chronic condition, EAGL: eigen analysis of graph Laplacian.
Graph summarization is also affected by factors such as data volume and complexity (structure, heterogeneity, and abstraction), dynamic/static nature of the graph, efficiency of the inference procedure, and computational complexity of the summarization approach [
Clusteringbased approaches, which aggregate nodes into supernodes and connect them using superedges, including spectral clustering [
Communitybased approaches, which aggregate all the nodes that belong to the same community and superimpose edge weights by summing up the weights of the original edges [
Simplificationbased approaches, which remove less important nodes/edges, including OntoVis [
Pattern set mining approaches, which create subgraphs based on the extracted patterns, including VNM [
Node/edge immunization/deletion approaches, which select the best flow of the information from the source to the destination node, including MIOBI [
In this work, we propose a graph summarization approach that utilizes the second eigenvalue analysis of the graph Laplacian (EAGL) to identify and prune less informative edges of the complex graphical models of MCC interactions. The intuition behind the proposed EAGL criterion is that the eigenvalue of the graph Laplacian of a graphical model is an effective measure of the connectivity and information flow [
Visual representation of the proposed EAGL Algorithm for summarizing a directed probabilistic graphical model based on an available dataset.
A probabilistic graphical model is specified as a tuple,
The decomposition property makes the Bayesian inference process simple. This model is also known as the recursive model. Here, we use binary variables (nodes) representing having or not having a chronic condition (TBI, PTSD, BaPa, Depr, and SuAb) for the probabilistic graphical models.
The graph Laplacian is a matrix representation of a graph, which can be used to study various properties of a graph. The first and second smallest eigenvalue of the graph Laplacian can be used to extract useful information such as graph communities (first smallest eigenvalue) and sparsest cut in a graph (second smallest eigenvalue) [
For a directed graph, we can consider both in and outdegree to form the degree matrix [
Algorithm for summarizing while learning the structure of the probabilistic graphical models directly from data.
Algorithm for summarizing an existing probabilistic graphical model with supporting data.
Excluding the step/s to update the remaining conditional dependencies in
To avoid creating isolated nodes or islands (cluster of isolated nodes) that affect the accuracy of inference and prediction (especially in temporal graphical models), we use graph traversal methods, specifically depthfirst search (DFS) [
Considering the consecutive time instances of the dynamic graph, that is, t and t + 1, as a static graph, and applying appropriate structural constraints as discussed above, that is, DFS, the EAGL algorithm can be used to summarize dynamic graphical models as well.
The relationship among the emergence of MCC can be expressed effectively using probabilistic graphical models, where nodes represent the emergence of chronic conditions, that is, BaPa, Depr, and so on, and edges show the statistical relationship (conditional dependency) between them (BaPa and Depr). Here, we are interested in sparse learning of the structure and parameters of the probabilistic graphical model using the EAGL algorithm based on an available data set of the emergence of MCC. Our deidentified data were collected from a large national cohort of US military veteran patients (N=608,503), who were deployed in support of the wars in Afghanistan and Iraq and began receiving care in the Veterans Health Administration (VA) between 2002 and 2011. For the purpose of this analysis, we have only considered patients who received care each year for the first 5 years after entering VA care (N=257,633). Dropout may result from not requiring care, dropping out of VA care, or death. This study received institutional review board approval from the University of Texas Health Science Center at San Antonio and the Bedford VA Hospital, with a waiver of informed consent. A summary of the study population is shown in
Demographics of the patients included in the study.
Demographics  Serial number  

1  2  3  4  5  6  
Race  White  Black  Hispanic  Asian  Native  Unknown  



Male  148,355 (57.58)  35,758 (13.88)  25,373 (9.85)  5639 (2.19)  3081 (1.20)  2135 (0.83) 

Female  19,183 (7.45)  11,828 (4.59)  4232 (1.64)  981 (0.38)  707 (0.27)  361 (0.14) 



Married  74,487 (28.91)  23,308 (9.05)  14,523 (5.64)  3067 (1.19)  1747 (0.68)  1346 (0.52) 

Unmarried  93,051 (36.12)  24,278 (9.42)  15,082 (5.85)  3553 (1.38)  2041 (0.79)  1150 (0.45) 



1830  96,799 (37.57)  20,047 (7.78)  17,016 (6.60)  3235 (1.26)  2115 (0.82)  1062 (0.41) 

3140  36,003 (13.97)  12,468 (4.84)  6606 (2.56)  1361 (0.53)  925 (0.36)  625 (0.24) 

4150  26,167 (10.16)  12,710 (4.93)  4758 (1.85)  1564 (0.61)  564 (0.22)  673 (0.26) 

≥51  8569 (3.33)  2361 (0.92)  1225 (0.48)  460 (0.18)  184 (0.07)  136 (0.05) 



Unknown  2334 (0.91)  658 (0.26)  386 (0.15)  131 (0.05)  60 (0.02)  51 (0.02) 

Less than high school  2037 (0.79)  504 (0.20)  360 (0.14)  60 (0.02)  60 (0.02)  22 (0.01) 

High school graduate  129,921 (50.43)  37,506 (14.56)  23,592 (9.16)  4732 (1.84)  3004 (1.17)  1808 (0.70) 

Some college  16,743 (6.50)  4819 (1.87)  2933 (1.14)  598 (0.23)  376 (0.15)  287 (0.11) 

College graduate  12,024 (4.67)  3160 (1.23)  1893 (0.73)  879 (0.34)  217 (0.08)  223 (0.09) 

Post college education  4479 (1.74)  939 (0.36)  441 (0.17)  220 (0.09)  71 (0.03)  105 (0.04) 
The EAGL algorithm begins with a DAG structure provided by a scorebased algorithm [
To evaluate the model, the area under the curve (AUC) of the receiver operating characteristic (ROC) curve [
The area under the curve performance of the sparse probabilistic graphical model learned by the eigenvalue analysis of the graph Laplacian algorithm directly from the data with different choices of tuning parameters (λ=0,10^{–2},10^{–1},...,10^{5}) for predicting future comorbidities (year 2 to year 5), given the comorbidity information of the past year (year 1), along with the area under the curve performance of a comparing algorithm, namely, Akaike information criterion (AIC) as well as the associated summarization ratios.
Prediction year  Lambda  

0.00  0.01  0.10  1.00  10.00  100.00  1000.00  10000.00  100000.00  AIC  



TBI^{a}  75.69  75.69  75.69  75.88  76.69  79.57  79.91  79.88  79.88  59.49 

PTSD^{b}  78.97  78.97  79.08  79.53  81.31  83.11  84.04  83.70  83.70  63.45 

BaPa^{c}  63.16  63.16  63.16  63.63  48.57  78.29  78.83  78.82  78.82  78.51 

SuAb^{d}  72.93  72.93  73.04  73.33  75.22  74.58  82.50  85.00  85.00  61.32 

Depr^{e}  68.24  68.24  68.27  68.45  71.02  74.26  81.47  81.61  81.61  59.05 



TBI  72.22  72.22  72.19  72.63  74.82  76.28  76.23  76.11  76.11  62.28 

PTSD  76.01  76.01  76.02  76.84  78.71  80.11  80.61  80.35  80.35  61.95 

BaPa  60.92  60.92  60.98  61.82  70.07  73.15  73.51  73.84  73.84  73.27 

SuAb  70.80  70.80  70.82  71.02  73.62  68.51  79.84  81.13  81.13  61.83 

Depr  65.16  65.16  65.18  65.93  69.01  70.48  77.13  77.10  77.10  56.09 



TBI  70.86  70.86  70.81  71.11  72.96  73.20  72.38  72.39  72.39  60.71 

PTSD  73.21  73.21  73.35  74.11  75.97  78.00  78.22  77.84  77.84  61.97 

BaPa  61.81  61.81  61.82  62.50  69.96  72.98  72.96  72.61  72.61  72.84 

SuAb  68.97  68.97  68.82  68.34  70.88  74.96  77.92  79.64  79.64  60.72 

Depr  64.73  64.73  64.79  65.09  67.29  68.00  72.65  73.54  73.54  56.23 



TBI  70.88  70.88  70.92  71.78  72.50  73.47  69.51  69.38  69.38  59.70 

PTSD  72.72  72.72  72.88  73.43  75.21  76.50  76.15  74.86  74.86  60.59 

BaPa  53.46  53.46  53.41  54.09  69.63  72.64  73.04  68.52  68.52  72.30 

SuAb  63.73  63.73  63.74  61.34  63.46  73.65  76.72  77.26  77.26  61.46 

Depr  64.01  64.01  64.11  64.87  66.58  67.89  69.99  71.37  71.37  56.07 



Edges, n  141  140  139  128  107  73  39  24  24  47 

Edge removal (%)  0.00  0.71  1.42  9.22  24.11  48.23  72.34  82.98  82.98  66.67 
^{a}TBI: traumatic brain injury.
^{b}PTSD: posttraumatic stress disorder.
^{c}BaPa: back pain.
^{d}SuAb: substance abuse.
^{e}Depr: depression.
The relationship between the change in the tuning parameter (λ) and the second eigenvalue.
The relationship between the change in the tuning parameters (λ) and the area under the curve: (a) year2; (b) year3; (c) year4; (d) year5 of the study. BaPa: back pain; TBI: traumatic brain injury, PTSD: posttraumatic stress disorder and SuAb: substance abuse, MCC: multiple chronic conditions, EAGL: eigen analysis of graph Laplacian.
In many reallife situations, we are given a graphical model that could potentially be simplified. The EAGL algorithm, which is based on the second eigenvalue of the graph Laplacian, can be used to identify and prune insignificant edges of the graph to achieve the desired level of summarization. The EAGL algorithm begins by calculating the second eigenvalue of the graph Laplacian of the given graphical model. It then extracts the DFS tree to determine the edges to avoid isolated nodes. Next, from the set of edges that is not lying on the DFS tree, the algorithm (temporarily) removes edges one at a time and calculates the percentage of the change in the second eigenvalue of the remaining graph Laplacian. Subsequently, it (permanently) removes the edge, resulting in a minimum change in the second eigenvalue of the graph Laplacian. The last 2 steps are repeated until a stopping criterion is met. Once the summarized network structure is attained, the weight S of the edges (conditional probabilities) are estimated using a standard parameter estimation algorithm [
Here, we are interested in summarizing an existing probabilistic graphical model of MCC relationships attained using a scorebased method [
Although increasing the summarization ratio generally results in a sparser graphical model, for mild summarization ratios (<10%), using EAGL can also improve the predictive performance of the graphical model by preserving more informative edges/paths as it should. However, a large choice of summarization ratios (>10%) can decrease the predictive performance, depending on the topological location of the node (chronic conditions) and the associated edges that have been pruned (
The area under the curve performance of the original and summarized probabilistic graphical models at different summarization ratios (1%, 5%, 10%, and 20%) for predicting future comorbidities (year 2 to year 5), given the comorbidity information of the past year (year 1).
Prediction year  EAGL^{a}  MIOBI [ 
CHEETAH [ 


Original  1.00  5.00  10.00  20.00  Original  1.00  5.00  10.00  20.00  Original  1.00  5.00  10.00  20.00  



TBI^{b}  75.69  75.63  75.53  75.09  63.34  75.69  75.78  75.63  63.99  56.08  75.69  75.70  65.25  63.95  61.80  

PTSD^{c}  78.97  78.94  80.20  80.87  81.51  78.97  79.32  80.57  70.86  71.04  78.97  79.12  79.54  80.19  71.15  

BaPa^{d}  63.16  63.18  61.05  61.37  65.53  63.16  62.99  63.14  62.44  63.25  63.16  63.20  63.43  64.16  64.80  

SuAb^{e}  72.93  72.95  75.74  70.59  68.26  72.93  72.93  72.88  73.54  69.99  72.93  73.10  73.78  73.96  74.29  

Depr^{f}  68.24  68.23  70.48  62.88  59.27  68.24  68.24  68.03  66.20  55.50  68.24  68.36  68.51  68.74  70.22  



TBI  72.22  72.25  72.15  72.13  71.78  72.22  72.24  72.36  70.20  59.82  72.22  72.22  64.62  63.54  61.33  

PTSD  76.01  75.97  77.37  77.81  78.57  76.01  76.40  77.89  75.34  76.60  76.01  76.15  76.90  76.78  69.36  

BaPa  60.92  60.98  59.32  59.49  61.88  60.92  60.96  61.12  61.34  58.44  60.92  61.04  61.29  61.49  61.80  

SuAb  70.80  70.81  72.49  70.54  67.58  70.80  70.73  70.72  70.70  71.08  70.80  70.83  71.53  71.73  71.97  

Depr  65.16  65.20  66.95  68.29  58.28  65.16  65.13  65.16  64.45  64.24  65.16  65.37  65.84  65.99  66.85  



TBI  70.86  70.78  71.10  70.76  70.48  70.86  70.82  70.82  69.41  68.20  70.86  70.27  64.85  64.08  61.94  

PTSD  73.21  73.18  74.29  74.95  76.02  73.21  73.52  74.99  72.92  74.51  73.21  73.34  73.88  74.11  67.64  

BaPa  61.81  61.81  60.24  60.38  63.01  61.81  61.76  62.01  62.62  59.29  61.81  61.93  62.14  62.43  63.03  

SuAb  68.97  69.00  70.28  69.13  68.24  68.97  68.97  68.73  69.28  70.99  68.97  69.14  69.53  69.78  70.24  

Depr  64.73  64.71  65.48  65.53  61.55  64.73  64.73  64.63  63.97  64.03  64.73  64.88  65.09  65.25  65.68  



TBI  70.88  70.91  71.55  71.63  70.74  70.88  70.82  70.91  70.53  69.14  70.88  70.52  66.02  65.25  63.35  

PTSD  72.72  72.70  73.55  73.75  74.34  72.71  72.95  73.91  72.24  73.61  72.71  72.85  73.19  73.39  67.30  

BaPa  53.46  53.49  51.20  50.58  50.67  53.46  53.21  53.11  53.03  49.26  53.46  53.51  53.50  53.73  53.01  

SuAb  63.73  63.76  64.35  62.59  59.70  63.73  63.61  63.46  63.07  61.47  63.73  63.90  63.99  63.87  64.12  

Depr  64.01  63.99  64.78  64.50  61.23  64.01  63.96  63.74  63.38  63.32  64.01  64.19  64.50  64.70  65.19 
^{a}EAGL: Eigenvalue analysis of the graph Laplacian.
^{b}TBI: traumatic brain injury.
^{c}PTSD: posttraumatic stress disorder.
^{d}BaPa: back pain.
^{e}SuAb: substance abuse.
^{f}Depr: depression.
Decrease in the second eigenvalue with reduction in the number of edges.
The relationship between the changes in the tuning parameters (λ) and the area under the curve in the second eigenvalue (λ) over (a) year2, (b) year3, (c) year4, and (d) year5 of the study. BaPa: back pain; TBI: traumatic brain injury; PTSD: posttraumatic stress disorder; SuAb: substance abuse; MCC: multiple chronic conditions; EAGL: eigenvalue analysis of the graph Laplacian.
A lexicon graph contains a list of stems and affixes, together with basic information about them in the form of a graphical model. This is generally used to represent interconnected word pairs and their frequencies in natural language processing. Here, we are interested in exploring the opportunity to summarize a graphical model of MCCrelated terms (Lexicon graph) with no supporting data using the EAGL algorithm. The graphical model was developed based on a lexicon graph from a collection of medical journals. The journals were extracted using the following keywords: Veterans, Traumatic Brain Injury, Back Pain, PostTraumatic Stress Disorder, Depression, Substance Abuse, Chronic Diseases, Comorbidity, Multimorbidity, chronic conditions, chronic illness, and chronic pain. A total of 20 peerreviewed journal papers were collected based on Google Scholar ranking (without expert opinion).
(a) Lexicon graph of the top 200 most frequent word pairs attained from text mining of 20 medical journal papers; (b) lexicon graph after summarization algorithm (70% summarization) was performed in the graph. OEF: operation enduring freedom; OIF: operation Iraqi freedom.
Summarized lexicon graph in
In this section, we derive the time complexity of algorithms shown in
Graphical models are increasingly being used for descriptive, predictive, and prescriptive analytics in various applications, including social media, computer networks, genetics, and disease prognosis [
In this study, we have addressed the problem of summarizing complex graphical models and identifying their important patterns by modifying the edges of the graph. These types of graphical frameworks are useful for analyzing plausible interactions between disease states [
The first approach (using direct learning) results in a refined model where network analysis can be performed by an end user with specific needs and expertise. Our direct learning model (
The second approach (
Our findings revealed that the proposed summarization algorithm can indeed improve the predictive accuracy of the summarized graphical model while reducing its size and increasing the inference efficiency. We used 2 data sets of (1) 257,633 veteran patients who have been monitored for the emergence of 5 multiple conditions (TBI, PTSD, BaPa, Dep, and SuAb) over 5 years and (2) the coappearance of the 200 most frequent word pairs in the literature of MCC to validate the performance of the proposed EAGL approach.
Although the statistical details of the proposed model might be complex for some practitioners to understand, the resulting algorithm can be seen as a step toward creating more interpretable analytical models for understanding the evolution of MCC, by removing less informative edges in complex networks of MCC (resulting in a sparser network), without losing predictive accuracy. In fact, practitioners do not need to know the details of the proposed algorithm to utilize it. They can use a simple tuning parameter (λ) to control the level of resulting network sparsity (number of remaining edges), that is, setting a high value for the tuning parameter results in a very sparse network (with few edges), which is easy to understand (
Finally, the proposed EAGL approach can help medical practitioners and health care analysts not only in terms of developing a predictive tool to analyze the probability of a new chronic condition development, given the existing conditions (
Although the proposed EAGL algorithm successfully extracts important connections and controls the level of sparsity, it has a few limitations and potential problems. Algorithm presented in
In this work, we propose a graph summarization approach that utilizes the second eigenvalue of the graph Laplacian to identify and prune less informative edges of the complex graphical models of MCC interaction. We developed 3 algorithms based on the proposed approach to deal with different scenarios with respect to the availability of data and/or a graphical model. The first algorithm learns a sparse graphical model of MCC interactions directly from the data by regularizing an existing scorebased structure learning algorithm to achieve a desired level of sparsity. The second algorithm summarizes an existing graph of MCC interactions by removing less informative connections with respect to a supporting data set. The third algorithm simplifies a given MCC graph by removing the less important edges without a supporting data set. We validated the performance of the first 2 algorithms based on a large data set of veteran patients who have been monitored for over 5 years and 5 multiple chronic medical conditions, including PTSD, TBI, Depr, BaPa, and SuAb. We also validated the third algorithm based on a data set of coappearances of the 200 most frequent word pairs in the literature of MCC. The results showed that the proposed EAGLE algorithm effectively extracts important connections and dependency patterns from the complex graphical model of the interactions of MCC. It can also control the level of sparsity in the resulting graph based on the practitioners’ needs using a simple tuning parameter. Finally, it improves the predictive accuracy of the resulting summarized graphical model.
Results of the eigenvalue analysis of the graph Laplacian algorithm based on the first eigenvalue.
A sample example of the eigenvalue analysis of the graph Laplacian algorithm for a small graph.
Multiple chronic conditions term lexicon.
Akaike information criterion
area under the curve
back pain
Bayesian network
directed acyclic graph
depression
depthfirst search
eigenvalue analysis of the graph Laplacian
multiple chronic conditions
maximum weight spanning tree
posttraumatic stress disorder
receiver operating characteristic
substance abuse
traumatic brain injury
Veterans Health Administration
This research work was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award number 1SC2GM11826601 and the US Department of Veterans Affairs funds, I01HX000329 and IK6HX002608 to MP. The sponsors played no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. Any opinions, findings, conclusions, or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the views of the US Government or the US Department of Veterans Affairs, and no official endorsement should be inferred.
SF, AA, and SS developed the EAGL algorithms. SF preprocessed the data, coded the algorithms, and conducted the numerical studies. CC prepared the terms for the MCC database and
None declared.