Published on in Vol 10, No 4 (2022): April

Preprints (earlier versions) of this paper are available at, first published .
Exploring Patient Multimorbidity and Complexity Using Health Insurance Claims Data: A Cluster Analysis Approach

Exploring Patient Multimorbidity and Complexity Using Health Insurance Claims Data: A Cluster Analysis Approach

Exploring Patient Multimorbidity and Complexity Using Health Insurance Claims Data: A Cluster Analysis Approach

Original Paper

1Center for Primary Care and Public Health (Unisanté), University of Lausanne, Lausanne, Switzerland

2Groupe Mutuel, Martigny, Switzerland

3Department of Actuarial Science, Faculty of Business and Economics, and Swiss Finance Institute, University of Lausanne, Lausanne, Switzerland

*these authors contributed equally

Corresponding Author:

Anna Nicolet, PhD

Center for Primary Care and Public Health (Unisanté)

University of Lausanne

Route de la Corniche

Lausanne, 1010


Phone: 41 21 314 23 4


Background: Although the trend of progressing morbidity is widely recognized, there are numerous challenges when studying multimorbidity and patient complexity. For multimorbid or complex patients, prone to fragmented care and high health care use, novel estimation approaches need to be developed.

Objective: This study aims to investigate the patient multimorbidity and complexity of Swiss residents aged ≥50 years using clustering methodology in claims data.

Methods: We adopted a clustering methodology based on random forests and used 34 pharmacy-based cost groups as the only input feature for the procedure. To detect clusters, we applied hierarchical density-based spatial clustering of applications with noise. The reasonable hyperparameters were chosen based on various metrics embedded in the algorithms (out-of-bag misclassification error, normalized stress, and cluster persistence) and the clinical relevance of the obtained clusters.

Results: Based on cluster analysis output for 18,732 individuals, we identified an outlier group and 7 clusters: individuals without diseases, patients with only hypertension-related diseases, patients with only mental diseases, complex high-cost high-need patients, slightly complex patients with inexpensive low-severity pharmacy-based cost groups, patients with 1 costly disease, and older high-risk patients.

Conclusions: Our study demonstrated that cluster analysis based on pharmacy-based cost group information from claims-based data is feasible and highlights clinically relevant clusters. Such an approach allows expanding the understanding of multimorbidity beyond simple disease counts and can identify the population profiles with increased health care use and costs. This study may foster the development of integrated and coordinated care, which is high on the agenda in policy making, care planning, and delivery.

JMIR Med Inform 2022;10(4):e34274



Health care systems worldwide are facing considerable challenges from the increasing number of chronic and multimorbid patients, characterized by complex needs and frequent transitions between care settings [1]. In Switzerland, 2.2 million people report a chronic disease and nearly 20% of the population older than 50 years have multiple chronic diseases (multimorbidity) [2]. Although the trend of progressing multimorbidity is widely recognized [3-6], it is still unclear how best to take care of patients with multimorbidity and which interventions would be effective. For more than two decades, integrated and coordinated care have been developed worldwide [7]. Nevertheless, integrated and coordinated care faces continuing challenges such as scaling-up, implementation, and sustainability difficulties. Additionally, integrated and coordinated care requires development of novel approaches to evaluate and measure patients multimorbidity and complexity. This is key to stratify the targeted population and adapt the intervention to the needs of the patients. Often, such evaluations and measures rely on morbidity indices (eg, Charlson and Elixhauser) or on the number of (self-reported) chronic conditions or comorbidities [8]. Whereas the former were developed in an inpatient setting as predictors of mortality, the latter may not comprehensively reflect the patient’s disease burden and complexity. Despite these limitations, they remain often used because of their relative accessibility and simplicity. In settings where electronic medical (health) records, national disease registries, or data on chronic conditions are unavailable, administrative health insurance claims data represent a potentially useful source of information. In fact, they are increasingly used in health services research, especially to express multimorbidity using pharmacy-based cost groups (PCGs) [9,10]. PCGs, based on use of prescribed drugs rather than on clinical information, were developed as a proxy for morbidity measure [11]. Although the approach has limitations related to underestimation of medicines used, unclaimed, or paid out-of-pocket and thus not present in the data or the assumption that the drug is used exclusively for treating the particular condition [11,12], it allows mapping patient profiles to reflect their morbidity status. As such mapping approaches and comorbidity counts are considered simplistic [13], researchers may consider alternative methods to investigate patient complexity more exhaustively. One such method is cluster analysis, which relies on the idea that many common conditions cluster together in the population in predictable patterns [13]. It has been shown that cluster analysis of real-world data for drug use research can be used for detecting clinically plausible subgroups [14]. Similar approaches of classifications based on multimorbidity patterns have been applied in the literature [14-16], but using PCGs as the multimorbidity indicator for cluster analysis is novel. In that context, the aim of our study is to investigate patient multimorbidity and complexity beyond simple mapping and counts of PCGs, using clustering methodology in claims data of Swiss residents aged ≥50 years.

Data Source and Sample

We included data of 240,511 insured people aged ≥50 years continuously enrolled in one of the largest health insurance companies in Switzerland, Groupe Mutuel, for the 2015-2018 period. In addition to demographic information (age and gender), data contained PCGs for each individual, costs covered by the patient (cost sharing), type of health insurance model (with or without gatekeeping), and reimbursed health care services: number of visits to various physicians with associated costs and physicians’ specialization and hospitalizations. To identify insured persons with cost-intensive, chronic diseases and correspondingly high health care use based on their drug consumption, health insurance companies are translating the drug use data reflecting active ingredient and quantity, based on Anatomical Therapeutic Chemical and defined daily dose, into the PCGs. This procedure was developed and officially accepted by the Federal Office of Public Health in Switzerland [17]. In our study, the patients were classified as multimorbid when they were assigned two or more PCGs, based on their yearly drug use.

Ethical Considerations

Data were deidentified by the insurance company to guarantee anonymization, and ethical approval for this study was waived by the Cantonal Commission for the Ethics of Research on Human Beings (Lausanne, Switzerland).

Cluster Analysis

We adopted a clustering methodology based on random forests (RFs) [18]—a popular classification and regression tree-based method—that includes several steps and machine learning algorithms [19-21]. The methodology is inspired by a clustering methodology designed by Breiman and Cutler [19], the creators of RFs [20,21].

In a preprocessing step, we extracted 34 PCGs as the only input feature for the clustering procedure. We grouped the 34 PCGs into 15 disease categories, which were valued meaningful from a clinical perspective (Multimedia Appendix 1). We then considered the first year of information only, and extracted a 10% random sample, to allow for effective processing for the computationally expensive steps. To confirm the results, the random sampling was performed multiple times, which led to similar clusters. Finally, we discarded points showing no PCG or only one type of PCG. Since we ultimately use an algorithm to detect clusters based on density given by the distances between points, the presence of many identical points at the same positions may perturb the algorithm and unnecessarily make the computation more expensive. Keeping a small random sample of these points would reduce the perturbation but not change the results while adding a dispensable complication, notably for the hyperparameter selection needed to detect these additional clusters.

To initiate the clustering procedure, we created a synthetic data set of the same size as the original data, by random sampling from the distributions of each input variable within the data. The idea is then to train an RF model to classify synthetic and original points, with the aim of taking advantage of the proximity measure, an embedded RF metric of similarity between points. An RF aggregates the prediction of multiple decision trees (DTs) by considering the class they predict in majority. DTs are classification models that separate the data points into subspaces (leaves) by imposing thresholds on the input variables and predicting the class within each subspace as the majority class. The proximity between two points is then computed as the number of times they fall in the same leaf across the trees in the forest. To stabilize the random effects of RFs, we trained 10 RF models, computed the proximities for all pairs of points for each model, and averaged them to obtain a mean proximity matrix characterizing the data. We then used multidimensional scaling (MDS) [22] to project the corresponding distance matrix (1 – proximity matrix / (number of trees)) in 2D while preserving the distances and allow for visualization of the resulting clusters. Finally, we applied hierarchical density-based spatial clustering of applications with noise (HDBSCAN) [23] to detect clusters within the obtained 2D data, after discarding the synthetic points from the data. HDBSCAN extracts clusters as dense gatherings of points separated by sparse regions with few points. Given that no cross-validation is possible with clustering methodologies, reasonable hyperparameters were chosen for the RF, MDS, and HDBSCAN steps based on various metrics embedded in the algorithms and the clinical relevance of the obtained clusters. The metrics includes the out-of-bag (OOB) misclassification error, which shows how well RF differentiates the original data from the synthetic one. The outcome reflects how much structure there is in the data [19]. Another metric was normalized stress, measuring whether the distances between points are reasonably preserved after projection [22], and the cluster persistence, HDBSCAN embedded metrics indicating how well the clusters are defined and separated from each other [23]. In practice, we used the HDBSCAN and Scikit-learn libraries (in Python) for the final clustering and all previous steps.

After discarding individuals with missing information, our data set comprised 18,732 individuals (points). An initial examination of the data set exhibited three large “single” clusters that we extracted prior to the clustering procedure, showing no PCGs, only hypertension PCGs, and only mental disease PCGs, representing 67.9% (n=12,720), 9.7% (n=1813), and 4.1% (n=765) of the population, respectively. Clustering analyses, performed on the remaining 3434 patients not included in the latter “single” clusters, identified four distinct clusters: Cluster 0 to Cluster 3, numbered in the order in which they are detected while applying HDBSCAN (Figure 1). The clusters can be clearly visualized from this tree (Figure 2); and a good persistence of 0.29, 0.24, 0.15, and 0.24, respectively, was found. The average OOB misclassification error from the 10 RFs was 0.51, which is quite high, showing that RF does not differentiate well between the original and the synthetic data, and there is not much structure in the data. Regarding the performed MDS, the normalized stress was 0.31, indicating reasonable preserving of the distances between points.

The 4 detected clusters encompass different mixes of PCGs (Table 1 and Figure 3): Cluster 0 comprises a large mix of PCGs (mental + hypertension + pain + asthma [chronic obstructive pulmonary disease]) often appearing jointly; Cluster 1 comprises PCGs (thyroid, hypertension, glaucoma, and mix of others) appearing jointly less often; Cluster 2 comprises asthma, Parkinson, cardiac diseases, and pain rarely appearing jointly; and Cluster 3 comprises a large mix of PCGs almost never appearing jointly (single diseases).

The following description and interpretation of clusters is based on the descriptive statistics of health care use and costs data (Table 1), which help to understand the underlying principle of grouping individuals into PCG clusters. First, the members of Cluster 0 (n=817, 4.4%) had the highest number of PCGs and highest costs and health care use, and were referred to as “complex high-cost high-need patients” (for a detailed description, see Table 1). The degree of complexity in these settings was reflected as the combination of the following characteristics interpreted from descriptive statistics (Table 1): average number of PCGs, percentage of multimorbid patients, levels of health care use (eg, number of doctor consultations and hospital stays), and costs in the population subgroup. The members of Cluster 1 (n=709, 3.8%), although having multiple PCGs, had health care costs and use lower than in Cluster 0; thus, they were referred to as “slightly complex with inexpensive low-severity PCGs.” The members of Cluster 2 (n=531, 2.8%) were of the oldest age and presented especially high use of hospitalizations and visits to the generalist doctor and, thus, were referred to as “oldest at high risk.” High risk, interpreted in these settings from the descriptive statistics, was reflected by relatively high use of hospital care, yet lower than in the most complex cluster: long length of stay (5.6 and 6.6 nights for clusters “Oldest at risk” and “Complex high-cost high-need,” respectively) and high inpatient costs (CHF 2749 [US $2950] and CHF 3109 [US $3333], respectively). The members of Cluster 3 (n=1056, 5.6%) were characterized by a relatively small number of PCGs (close to 1) and the highest costs of medications and, thus, were referred to as “patients with 1 costly disease.”

Figure 1. MDS projection of the data in two dimensions. The four clusters found by HDBSCAN are marked by the different colors and coded with the labels 0, 1, 2, and 3. The code –1 refers to the outliers. HDBSCAN: hierarchical density-based spatial clustering of applications with noise; MDS: multidimensional scaling.
View this figure
Figure 2. Condensed tree resulting from the hierarchical density-based spatial clustering of applications with noise algorithm performed on the data. Note: similar to a classical dendogram in a hierarchical clustering setting, the first yellow rectangle represents the entire data, which is split into two parts (called “branches”) when we reduce the maximum distance allowed between points within each branch (λ value = 1 / distance). Each rectangle represents a subpart of the data after a split and with a size proportional to the number of data points in the subpart. The entire data splits into cluster 0 and the green rectangle, which further splits into cluster 1 and a turquoise rectangle, when we reduce the distance allowed. The 4 detected clusters (signified by a circle and their number) are the branches that persist the most (do not split further, according to various rules of the algorithm) when the imposed maximum distance between points decreases while keeping a minimum size. The persistence is proportional to the length of the rectangles across the vertical axis. The tree can be interpreted as a probability distribution function upside down, with each cluster being a peak in the distribution.
View this figure
Table 1. Descriptive statistics of clusters.
StatisticsAll dataOutliersCluster 0 “Complex high-cost high-need”Cluster 1 “Slightly complex with inexpensive low-severity PCGsaCluster 2 “Oldest at high risk”Cluster 3 “Patients with 1 costly disease”No PCGsHypertension “Only hypertension”Mental health “Only mental diseases”
Patients, n (%)18,732
Age (years), mean (SD)65.0
Sex, n (%)


Deductible (CHF; US $), mean794
Model with
Number of PCGs, mean0.
Multimorbid (yes)b0.
Ambulatory costs (CHF; US $), mean5395
Inpatient costs (CHF; US $), mean1419
Costs of medications (CHF; US $), mean1563
Total cost (CHF; US $), mean8929
Number of days in the hospital, mean2.
Number of hospitalizations in a year, mean0.
Total number of consultations, mean11.916.
Number of consultations with generalist, mean7.
PCG groups in the clusterAll 34 PCGsMostly PainMental + hypertension + pain + asthma (COPDc)Thyroid + hypertension + glaucoma + mix of othersAsthma + Parkinson + cardiac diseases + painCancer + diabetes + inflammatory + immune + other mental + glaucoma + HIVN/AdHypertensionMental diseases
Description of the clusters based on overall descriptive statisticsN/AAverage age, slightly fewer male patients, higher hospital costs and hospital staysAverage age, slightly fewer male patients, lowest deductibles, highest amount of PCGs and multimorbidity, highest health care use and costs (except for costs of medications)Slightly older, more female patients, relatively low deductibles, high amount of PCGs (1.7) and multimorbidity (but less than cluster 0), relatively low health care use and costsOldest, relatively low deductibles, some complexity (more than 1 PCGs on average), very high use of doctor visits (especially generalist), many hospitalizations and high inpatient costsRelatively old, on average 1 PCG, highest cost of medicaments, and high ambulatory costs, relatively low hospitalizations and doctor visitsYoung, highest deductibles, low health care use and costsSlightly older, more male patients, relatively low health care use and costsYoungest, more female patients, relatively low deductibles, low health care use and costs (but higher than for hypertension group), a lot of visits to doctors

aPCG: pharmacy-based cost group.

bRatios rounded off to one decimal place.

cCOPD: chronic obstructive pulmonary disease.

dN/A: not applicable.

Figure 3. Joint distributions of PCGs within the 4 clusters (group 0-3) and outliers (group –1). PCG: pharmacy-based cost group.
View this figure

Our study shows that performing cluster analysis to explore patient multimorbidity and complexity is feasible. We demonstrated that individuals with single PCGs of mental diseases or hypertension, individuals with multiple PCGs, or individuals with a single high-cost PCG have different health care use patterns and represent different complexity groups.

Earlier studies focusing on chronic conditions identified from electronic health records evidenced the existence of systematic associations between chronic diseases, whereby chronic diseases, often from dissimilar disease categories, coappeared within a multimorbidity pattern or cluster [24-26]. Importantly, though, these studies showed that the complexity of multimorbidity patterns in terms of diseases and associated drug use increased with age, which holds true for both genders. Moreover, in line with our findings, multiple earlier studies used cluster analysis for identifying clinically homogenous multimorbidity patterns in the population, where clusters were composed of diagnosis-related groups [16,27-30]. However, these studies used measures of multimorbidity and comorbidity or clinical diagnosis data rather than PCGs from claims data. This makes direct comparison of results challenging, due to the differences in methodologies and level of diagnosis details. A recent systematic review confirmed that analytical methods used to identify patient profiles with multimorbid conditions are heterogeneous (including factor analysis, multiple correspondence analysis, hierarchical clustering, and three-step unified-clustering method), which may explain the variation in the multimorbidity patterns reported in various studies [31]. Despite those differences, the observed most prevalent clusters or groups are similar across studies and included hypertensive or metabolic diseases [28,29] and mental and behavioral diseases [16]. The greater prevalence of and similarities in metabolic and mental clusters were confirmed by a systematic review of multimorbidity patterns, whereby these clusters were identified in 9 and 10 of 14 reviewed articles, respectively [32]. One study compared multimorbidity patterns between populations of two European countries (Spain and the Netherlands) and found that, indeed, the highest similarities were observed in the cardio metabolic cluster, even though the populations are likely to differ across countries [26].

The existing literature on the use of cluster analysis to identify homogenous segments based on health care use and expenditures is limited [33-37]. Specifically, the study by Nnoaham and Cann [33] identified segments (or clusters), similar to ours, based on health care use (expressed by visits to the physicians, medications, and admissions) and complexity (expressed by long-term conditions). Other studies used cluster analysis to identify groups with high expenditures and deduced that, despite having a lot of heterogeneity, the high expenditures cluster typically exhibited fair or poor health with more medical conditions or comorbidities [34,35]. These findings confirm ours; they nevertheless need to be interpreted with caution due to differences in methodologies, age of the population, and level of details available for background individual characteristics and diagnoses. There is evidence that cluster analysis may provide more information to decision makers than a list of possible statistically significant variables or a list of individuals who are the highest users [35].

To our knowledge, this is the first study using cluster analysis to explore patients’ multimorbidity and complexity, reflected by the mix of PCGs and health care use patterns. In addition, it benefits from the richness of health care use data, a large sample size, and advanced clustering methods. However, the study has certain limitations. The first limitation stems from the process of multiple parameters configuration, which increases complexity while not allowing results validation. Thus, the cluster interpretation has to rely on metrics from the algorithms, descriptive statistics, and clinical relevance. Second, as the data were lacking clinical information, we only relied on PCGs mapping, which may give an incomplete picture of drug data [9,11,12].

Our study shows that PCG-based cluster analysis of health care use claims data allows diverting from an approach of simple comorbidity counts and can identify the population profiles with increased health care use and costs. Such results may provide insightful information for policy making, care planning, and care delivery to facilitate the transformation from procedures and guidelines focusing on a single disease toward development of integrated and better coordinated care.


This work was supported by the Swiss National Science Foundation within the Smarter health care—National Research Programme (NRP 74) and grant 407440_183447.

Conflicts of Interest

None declared.

Multimedia Appendix 1

Lists of pharmacy-based cost groups used to identify chronic diseases in insurance claims data.

DOCX File , 15 KB

  1. Prince MJ, Wu F, Guo Y, Gutierrez Robledo LM, O'Donnell M, Sullivan R, et al. The burden of disease in older people and implications for health policy and practice. Lancet 2015 Feb 07;385(9967):549-562. [CrossRef] [Medline]
  2. Bachmann N, Burla L, Kohler D. La santé en Suisse – Le point sur les maladies chroniques: Rapport national sur la santé 2015. OBSAN. Berne: Hogrefe Verlag; 2015.   URL: [accessed 2022-03-16]
  3. Smith SM, Wallace E, O'Dowd T, Fortin M. Interventions for improving outcomes in patients with multimorbidity in primary care and community settings. Cochrane Database Syst Rev 2021 Jan 15;1:CD006560 [FREE Full text] [CrossRef] [Medline]
  4. Smith SM, Wallace E, O'Dowd T, Fortin M. Interventions for improving outcomes in patients with multimorbidity in primary care and community settings. Cochrane Database Syst Rev 2016 Mar 14;3:CD006560 [FREE Full text] [CrossRef] [Medline]
  5. Souza DLB, Oliveras-Fabregas A, Minobes-Molina E, de Camargo Cancela M, Galbany-Estragués P, Jerez-Roig J. Trends of multimorbidity in 15 European countries: a population-based study in community-dwelling adults aged 50 and over. BMC Public Health 2021 Jan 07;21(1):76 [FREE Full text] [CrossRef] [Medline]
  6. Pefoyo AJK, Bronskill SE, Gruneir A, Calzavara A, Thavorn K, Petrosyan Y, et al. The increasing burden and complexity of multimorbidity. BMC Public Health 2015 Apr 23;15:415 [FREE Full text] [CrossRef] [Medline]
  7. Amelung V, Stein V, Suter E, Goodwin N, Nolte E, Balicer R, editors. Handbook Integrated Care. Cham: Springer; 2017.
  8. Sharabiani MTA, Aylin P, Bottle A. Systematic review of comorbidity indices for administrative data. Med Care 2012 Dec;50(12):1109-1118. [CrossRef] [Medline]
  9. Huber CA, Szucs TD, Rapold R, Reich O. Identifying patients with chronic conditions using pharmacy data in Switzerland: an updated mapping approach to the classification of medications. BMC Public Health 2013 Oct 30;13:1030 [FREE Full text] [CrossRef] [Medline]
  10. Huber CA, Schneeweiss S, Signorell A, Reich O. Improved prediction of medical expenditures and health care utilization using an updated chronic disease score and claims data. J Clin Epidemiol 2013 Oct;66(10):1118-1127. [CrossRef] [Medline]
  11. Lamers LM, van Vliet RCJA. The Pharmacy-based Cost Group model: validating and adjusting the classification of medications for chronic conditions to the Dutch situation. Health Policy 2004 Apr;68(1):113-121. [CrossRef] [Medline]
  12. Chini F, Pezzotti P, Orzella L, Borgia P, Guasticchi G. Can we use the pharmacy data to estimate the prevalence of chronic conditions? a comparison of multiple data sources. BMC Public Health 2011 Sep 05;11:688 [FREE Full text] [CrossRef] [Medline]
  13. Whitson HE, Johnson KS, Sloane R, Cigolle CT, Pieper CF, Landerman L, et al. Identifying patterns of multimorbidity in older Americans: application of latent class analysis. J Am Geriatr Soc 2016 Aug;64(8):1668-1673 [FREE Full text] [CrossRef] [Medline]
  14. Khalid S, Prieto-Alhambra D. Machine learning for feature selection and cluster analysis in drug utilisation research. Curr Epidemiol Rep 2019 Jul 27;6(3):364-372. [CrossRef]
  15. Wartelle A, Mourad-Chehade F, Yalaoui F, Chrusciel J, Laplanche D, Sanchez S. Clustering of a health dataset using diagnosis co-occurrences. Appl Sci 2021 Mar 07;11(5):2373. [CrossRef]
  16. Violán C, Roso-Llorach A, Foguet-Boreu Q, Guisado-Clavero M, Pons-Vigués M, Pujol-Ribera E, et al. Multimorbidity patterns with K-means nonhierarchical cluster analysis. BMC Fam Pract 2018 Jul 03;19(1):108 [FREE Full text] [CrossRef] [Medline]
  17. Polynomics AG. Aktualisierung der PCG-Liste für den Schweizer Risikoausgleich. Studie im Auftrag des Bundesamts für Gesundheit BAG - Schlussbericht. 2019.   URL: https:/​/www.​​dam/​bag/​en/​dokumente/​kuv-aufsicht/​pus/​risikoausgleich/​​Polynomics_Uni_Basel_Aktualisierung_PCG_Schlussbericht_2019-01-22.​pdf [accessed 2020-03-09]
  18. Breiman L. Random forests. Machine Learning 2001;45(1):5-32. [CrossRef]
  19. Breiman L, Cutler A. Random Forests Manual v4.0. UC Berkeley. 2003.   URL: [accessed 2020-10-30]
  20. Shi T, Seligson D, Belldegrun AS, Palotie A, Horvath S. Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinoma. Mod Pathol 2005 Apr;18(4):547-557. [CrossRef] [Medline]
  21. Allen E, Horvath S, Tong F, Kraft P, Spiteri E, Riggs AD, et al. High concentrations of long interspersed nuclear element sequence distinguish monoallelically expressed genes. Proc Natl Acad Sci U S A 2003 Aug 19;100(17):9940-9945 [FREE Full text] [CrossRef] [Medline]
  22. Kruskal JB, Wish M. Multidimensional scaling. In: Multidimensional Scaling. Quantitative Applications in the Social Sciences, ed. I. Thousand Oaks, CA: SAGE Publications; 1978.
  23. McInnes L, Healy J, Astels S. hdbscan: hierarchical density based clustering. J Open Source Software 2017 Mar;2(11):205. [CrossRef]
  24. Ioakeim-Skoufa I, Poblador-Plou B, Carmona-Pírez J, Díez-Manglano J, Navickas R, Gimeno-Feliu LA, et al. Multimorbidity patterns in the general population: results from the EpiChron cohort study. Int J Environ Res Public Health 2020 Jun 14;17(12):4242 [FREE Full text] [CrossRef] [Medline]
  25. Mucherino S, Gimeno-Miguel A, Carmona-Pirez J, Gonzalez-Rubio F, Ioakeim-Skoufa I, Moreno-Juste A, et al. Changes in multimorbidity and polypharmacy patterns in young and adult population over a 4-year period: a 2011-2015 comparison using real-world data. Int J Environ Res Public Health 2021 Apr 21;18(9):4422 [FREE Full text] [CrossRef] [Medline]
  26. Poblador-Plou B, van den Akker M, Vos R, Calderón-Larrañaga A, Metsemakers J, Prados-Torres A. Similar multimorbidity patterns in primary care patients from two European regions: results of a factor analysis. PLoS One 2014;9(6):e100375 [FREE Full text] [CrossRef] [Medline]
  27. Déruaz-Luyet A, N'Goran AA, Senn N, Bodenmann P, Pasquier J, Widmer D, et al. Multimorbidity and patterns of chronic conditions in a primary care population in Switzerland: a cross-sectional study. BMJ Open 2017 Jul 02;7(6):e013664 [FREE Full text] [CrossRef] [Medline]
  28. Marengoni A, Rizzuto D, Wang H, Winblad B, Fratiglioni L. Patterns of chronic multimorbidity in the elderly population. J Am Geriatr Soc 2009 Feb;57(2):225-230. [CrossRef] [Medline]
  29. Guisado-Clavero M, Roso-Llorach A, López-Jimenez T, Pons-Vigués M, Foguet-Boreu Q, Muñoz MA, et al. Multimorbidity patterns in the elderly: a prospective cohort study with cluster analysis. BMC Geriatr 2018 Jan 16;18(1):16 [FREE Full text] [CrossRef] [Medline]
  30. Egan BM, Sutherland SE, Tilkemeier PL, Davis RA, Rutledge V, Sinopoli A. A cluster-based approach for integrating clinical management of Medicare beneficiaries with multiple chronic conditions. PLoS One 2019;14(6):e0217696 [FREE Full text] [CrossRef] [Medline]
  31. Ng SK, Tawiah R, Sawyer M, Scuffham P. Patterns of multimorbid health conditions: a systematic review of analytical methods and comparison analysis. Int J Epidemiol 2018 Oct 01;47(5):1687-1704. [CrossRef] [Medline]
  32. Prados-Torres A, Calderón-Larrañaga A, Hancco-Saavedra J, Poblador-Plou B, van den Akker M. Multimorbidity patterns: a systematic review. J Clin Epidemiol 2014 Mar;67(3):254-266. [CrossRef] [Medline]
  33. Nnoaham KE, Cann KF. Can cluster analyses of linked healthcare data identify unique population segments in a general practice-registered population? BMC Public Health 2020 May 27;20(1):798 [FREE Full text] [CrossRef] [Medline]
  34. Powers BW, Yan J, Zhu J, Linn KA, Jain SH, Kowalski JL, et al. Subgroups of high-cost Medicare advantage patients: an observational study. J Gen Intern Med 2019 Feb;34(2):218-225 [FREE Full text] [CrossRef] [Medline]
  35. Agterberg J, Zhong F, Crabb R, Rosenberg M. Cluster analysis application to identify groups of individuals with high health expenditures. Health Serv Outcomes Res Method 2020 Aug 01;20(2-3):140-182. [CrossRef]
  36. Copeland LA, Zeber JE, Wang C, Parchman ML, Lawrence VA, Valenstein M, et al. Patterns of primary care and mortality among patients with schizophrenia or diabetes: a cluster analysis approach to the retrospective study of healthcare utilization. BMC Health Serv Res 2009 Jul 26;9:127 [FREE Full text] [CrossRef] [Medline]
  37. Vuik SI, Mayer E, Darzi A. A quantitative evidence base for population health: applying utilization-based cluster analysis to segment a patient population. Popul Health Metr 2016 Nov 25;14:44 [FREE Full text] [CrossRef] [Medline]

DT: decision tree
HDBSCAN: hierarchical density-based spatial clustering of applications with noise
MDS: multidimensional scaling
OOB: out-of-bag
PCG: pharmacy-based cost group
RF: random forest

Edited by C Lovis; submitted 14.10.21; peer-reviewed by W Zhang, I Ioakeim-Skoufa; comments to author 20.12.21; revised version received 04.02.22; accepted 06.02.22; published 04.04.22


©Anna Nicolet, Dan Assouline, Marie-Annick Le Pogam, Clémence Perraudin, Christophe Bagnoud, Joël Wagner, Joachim Marti, Isabelle Peytremann-Bridevaux. Originally published in JMIR Medical Informatics (, 04.04.2022.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on, as well as this copyright and license information must be included.