This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
In the current era of personalized medicine, there is increasing interest in understanding the heterogeneity in disease populations. Cluster analysis is a method commonly used to identify subtypes in heterogeneous disease populations. The clinical data used in such applications are typically multimodal, which can make the application of traditional cluster analysis methods challenging.
This study aimed to review the research literature on the application of clustering multimodal clinical data to identify asthma subtypes. We assessed common problems and shortcomings in the application of cluster analysis methods in determining asthma subtypes, such that they can be brought to the attention of the research community and avoided in future studies.
We searched PubMed and Scopus bibliographic databases with terms related to cluster analysis and asthma to identify studies that applied dissimilaritybased cluster analysis methods. We recorded the analytic methods used in each study at each step of the cluster analysis process.
Our literature search identified 63 studies that applied cluster analysis to multimodal clinical data to identify asthma subtypes. The features fed into the cluster algorithms were of a mixed type in 47 (75%) studies and continuous in 12 (19%), and the feature type was unclear in the remaining 4 (6%) studies. A total of 23 (37%) studies used hierarchical clustering with Ward linkage, and 22 (35%) studies used kmeans clustering. Of these 45 studies, 39 had mixedtype features, but only 5 specified dissimilarity measures that could handle mixedtype features. A further 9 (14%) studies used a preclustering step to create small clusters to feed on a hierarchical method. The original sample sizes in these 9 studies ranged from 84 to 349. The remaining studies used hierarchical clustering with other linkages (n=3), medoidbased methods (n=3), spectral clustering (n=1), and multiple kernel kmeans clustering (n=1), and in 1 study, the methods were unclear. Of 63 studies, 54 (86%) explained the methods used to determine the number of clusters, 24 (38%) studies tested the quality of their cluster solution, and 11 (17%) studies tested the stability of their solution. Reporting of the cluster analysis was generally poor in terms of the methods employed and their justification.
This review highlights common issues in the application of cluster analysis to multimodal clinical data to identify asthma subtypes. Some of these issues were related to the multimodal nature of the data, but many were more general issues in the application of cluster analysis. Although cluster analysis may be a useful tool for investigating disease subtypes, we recommend that future studies carefully consider the implications of clustering multimodal data, the cluster analysis process itself, and the reporting of methods to facilitate replication and interpretation of findings.
There is mounting evidence to suggest that some disease labels are in fact
It is now understood that asthma is one such umbrella term used to encompass multiple diverse underlying disease symptoms and pathophysiology [
Cluster analysis is a statistical technique used to identify subgroups in data based on multiple variables (for convenience, herein, we have used the term
Clinical datasets are often
This review aimed to comprehensively explore whether studies applying cluster analysis to multimodal clinical data to subtype asthma are using appropriate clustering methodologies. The contribution of this study is to make recommendations for the robust application of cluster analysis to multimodal clinical data. We believed this would be of interest to the evergrowing number of asthma researchers engaging or planning to engage in disease subtyping, as well as to the wider community of researchers applying cluster techniques for the purpose of disease subtyping.
This review is reported following the Preferred Reporting Items for Systematic Reviews and MetaAnalyses (PRISMA) guidelines.
We sought to identify studies that applied cluster analysis to multimodal clinical data with the aim of identifying subtypes of asthma. One researcher (EH) searched PubMed and Scopus databases (search queries are provided in
We excluded nonrelevant studies by first screening the abstracts, then referring to the full text where necessary. We excluded studies in which (1) none of the aims or objectives were to identify subtypes of asthma (studies looking exclusively at, eg, childhood wheeze were excluded); (2) the data were not multimodal (ie, were measured from a common source and on a common scale); and (3) none of the features were considered clinical (eg, studies concerned only with omics data). Finally, we excluded studies that used latent class analysis or mixture models to group their data to narrow the scope of this review to methods that cluster samples based on pairwise dissimilarities. The use of latent class analysis to distinguish asthma phenotypes has been reviewed previously by Howard et al [
The following query was inserted in PubMed on May 23, 2019:
English[Language] AND (“2008/01/01”[Date  Publication] : “2019/05/23”[Date  Publication]) AND (“cluster analysis”[Text Word] OR “clustering*”[Text Word]) AND “asthma*”[Text Word] NOT (comment[Publication Type] OR editorial[Publication Type] OR letter[Publication Type] OR review[Publication Type] OR metaanalysis[Publication Type])
The following query was inserted in Scopus on May 23, 2019:
PUBYEAR > 2007 AND (TITLEABSKEY ( “cluster analysis” ) OR TITLEABSKEY(“clustering*”)) AND TITLEABSKEY (“asthma*”) AND SRCTYPE (“j”) AND DOCTYPE (“ar”) AND LANGUAGE (“English”)
In total, 2 researchers (EH and HT) independently extracted information from the full text and supplementary material of each study. Information was extracted following the steps outlined in the following
To provide context for this review, we outlined the key steps in the application of cluster analysis to multimodal clinical data.
Schematic of the typical cluster analysis steps.
The first step is to identify the set of features of interest, which we referred to as
Most common cluster analysis methods use
Despite the widespread use of cluster analysis, at present, there is no consensus regarding the minimum sample size required to ensure stable and meaningful clustering. Dolnicar et al [
The features that we may want to use in a clustering algorithm often come from multimodal clinical data. Hence, they may be of different types (eg, continuous, nominal, ordinal, binary, etc) and are likely to be measured on different scales (eg, kilogram for mass, years for age). Most dissimilarity measures and clustering algorithms assume that the features are of the same type and are measured on a common scale. These requirements can be addressed using
When dealing with categorical features, it is vital to consider how these are encoded (nominal, ordinal, or binary), as this determines how they are treated in the calculation of dissimilarities and in the clustering algorithm. A common approach is to encode ordinal features as integers and to encode nominal features as dummy binary features [
Feature scaling may be used to address 3 issues related to continuous features. The first is that continuous features may be measured in different units and should therefore be rescaled to bring them onto a common scale before calculating dissimilarities. The second is that continuous features measured in the same units may have different variances. In some cases, the differences in variance may be useful for clustering, but in others, these may obscure the true underlying cluster structure in the data. In the latter case, the continuous features should be rescaled. Common approaches to these 2 issues are to standardize features to have 0 mean and unit variance (referred to as
The third issue is that the features may not follow the desired probability distribution properties for further analysis (eg, having Gaussiandistributed features). This issue needs to be considered when statistical methods make distributional assumptions. Although few dissimilaritybased clustering methods make distributional assumptions, several methods involve the calculation of cluster means (eg, kmeans, hierarchical clustering with the Ward linkage). The mean is a poor choice of summary statistic for a feature that is skewed (or a feature with multiple modes), so a power transformation may be advantageous as a preprocessing step when using such clustering methods.
When dealing with mixedtype data, it may be necessary to scale the categorical features to avoid assigning categorical features greater weight over continuous features or vice versa. This issue is discussed in detail in the context of dissimilarity measures by Hennig and Liao [
There are generally 2 motivations for reducing the dimensionality of a dataset before applying cluster analysis. First, as previously mentioned in the
Feature selection involves selecting a subset of the available features for use in cluster analysis. Herein, we have referred to the features selected for the cluster analysis as
Feature transformation involves combining original features to create new features. Generally, a subset of these new features is selected for inclusion in the analysis. It is beyond the scope of this review to provide indepth details on the methods of feature transformation (also known as
Modelfree clustering methods rely on a
There are many different methods of cluster analysis (eg, kmeans, hierarchical clustering with the Ward linkage, spectral clustering), and each method may be implemented using different algorithms. A comprehensive overview of the wide range of clustering methods can be found elsewhere [
A key challenge in cluster analysis is choosing the number of clusters to present in the final solution, which is typically unknown
Providing a detailed commentary on these strategies is beyond the scope of this review. An overview of strategies for choosing
We highlighted the possibility that there might not be meaningful clustering of the data to form groups, and thus, the entire dataset is treated as 1 cluster. This may reflect the lack of statistical power (sufficiently large sample size) to determine clusters or that the investigated problem using that dataset is not amenable to clustering using the available sample size and features. Some statistics used for choosing k, such as the Gap statistic [
Assessing the quality of a clustering solution produced using any cluster algorithm is challenging. Unlike supervised learning setups, there is no
Most importantly, it is crucial to assess the
Beyond stability, there are numerous steps one may take to ensure the integrity of their cluster analysis findings, for example, repeating the analysis in a different cohort or at a different time point, or altering the encoding of a feature. These steps are often referred to as reproducibility testing. However, we avoided this term because it implies that we seek the exact same results, which we do not feel is reasonable in all scenarios. To extract this information from the studies in this review, 2 reviewers independently extracted details of postprocessing methods, which we felt assessed the quality of the cluster results, but did not come under stability. In our schematic and results, we referred to these methods as testing the quality of the cluster results.
We identified 63 studies that used cluster analysis to identify subtypes of asthma using multimodal clinical data (
Flow of studies into review.
A total of 42 (67%) studies identified candidate features based on previous studies or clinical input (relevance to asthma subtypes, avoiding clinical redundancy, and easily measured in clinical practice). The numbers used in each method are summarized in
A total of 42 (67%) studies detailed their methods for dealing with missing data; the methods used are shown in
Initial considerations across the asthma studies we have included in this review (N=63).
Method  Values, n (%)^{a}  



Clinical intuition and understanding  33 (52) 

Avoid clinical redundancy  15 (24) 

Previous studies  15 (24) 

Easily measured in clinical practice  8 (13) 



Complete case analysis  22 (35) 

Features with >x%^{b} missing values removed  14 (22) 

Imputed  11 (17) 

Patients with >x%^{b} missing values removed  5 (8) 

No missing data present  2 (3) 

Clustering methods handle missing data  1 (2) 
^{a}One study may use multiple methods; some studies may use no methods.
^{b}x>0.
The sample sizes for cluster analysis ranged from 40 to 3612, with a median of 195 patients.
Number of patients versus final number of cluster features. The line corresponds to the number of patients that is equal to 70 times the number of features.
Judging whether feature scaling and encoding were appropriate depends on the methods of cluster analysis used and vice versa. Therefore, we reported the methods of feature scaling and encoding alongside the methods of cluster analysis in
Breakdown of methods used by studies applying hierarchical clustering with Ward's linkage (N=23).
Data type, dissimilarity, and scaling of continuous features  Categorical features encoded as binary?  Value, n (%)  




Not detailed  N/A^{a}  1 (4)  




Scaled but method unspecified  Yes 
1 (4) 

Scaled to lie in the interval of 0 to 1  Yes  1 (4)  
zscores  Yes 
1 (4) 

Not detailed  Yes 
3 (13) 



zscores  Yes 
2 (9) 



Gower standardisation  No  3 (13)  
Scaled but method unspecified  No  1 (4)  


Not detailed  No  1 (4) 
^{a}N/A: not applicable (irrelevant for continuous features).
^{b}Computing the Gower coefficient normalizes the distance between feature samples by dividing by the feature range. Therefore, it is not necessary to normalize continuous features prior to computing the Gower coefficient.
Breakdown of methods used by studies applying kmeans (N=22).
Data type, dissimilarity, and scaling of continuous features  Categorical features encoded as binary?  Value, n (%)  







zscores for one feature  N/A^{a}  1 (5) 


No details  N/A  3 (14) 





No details  N/A  1 (5) 







Scaled but method unspecified  No  1 (5) 


zscores  Yes  6 (27) 


zscores for one feature  No  1 (5) 


No details  Yes 
1 (5) 





zscores  Yes  1 (5) 


No details  No  1 (5) 







No details  No  3 (14) 





zscores  No  1 (5) 
^{a}N/A: not applicable (irrelevant for continuous features).
Breakdown of methods used by studies applying SPSS TwoStep (N=7).
Data type, dissimilarity, and scaling of continuous features  Categorical features encoded as binary?  Value, n (%)  







No details  N/A^{a}  1 (14) 







Scaled to lie in the interval 0 to 1  Yes  1 (14) 


zscores  No  1 (14) 


No details  Yes  2 (29) 





Scaled but method unspecified  No  1 (14) 


No details  No  1 (14) 
^{a}N/A: not applicable (irrelevant for continuous features).
A total 23 (37%) studies applied univariate feature transformation to bring features closer to a normal distribution. The most common univariate feature transformation was logarithmic transformation, applied to nonnormally distributed features in 33% of studies. Lefaudeux et al [
A total of 22 (35%) studies detailed methods of feature selection to identify their cluster features. The number of features selected in the 63 studies included in this review ranged from 2 to 120, with a median of 12 features. In addition, 47 (75%) studies had mixedtype features, and 12 (19%) had continuous features, and in 4 (6%) studies, the type of features was unclear. Methods for feature selection are listed in
A total of 13 (20%) studies used PCA or factor analysis for feature selection. These are not typically methods that should be used for feature selection; we defer further elaboration on the topic for the Discussion. All but one of these studies computed the components (or factors) that represent an underlying latent feature structure, then selected 1 (or in some cases multiple [
Feature engineering methods used in the asthma studies included in this review.
Method  Values, n (%)^{a}  


Logarithmic transformation  21 (33)  
BoxCox transformation  1 (2)  
Method not explained  1 (2)  


Factor analysis^{b}  8 (13)  
Principal component analysis^{b}  5 (8)  
Avoid collinearity  3 (5)  
Avoid multicollinearity  3 (5)  
Supervised learning methods  2 (3)  
Multiple correspondence analysis  1 (2)  


Principal component analysis  4 (6)  
Factor analysis  1 (2)  
Multiple correspondence analysis  1 (2) 
^{a}As a percentage of all 63 studies.
^{b}These are not typically methods of feature selection but have been used in these studies.
Three (5%) studies considered collinearity via pairwise correlations, although the exact criteria for selection features based on this were unclear [
Furthermore, 2 (3%) studies selected features using statistical hypothesis tests with respect to the outcome of interest. Sakagami et al [
A total of 6 (10%) studies performed feature transformation before cluster analysis; the methods are summarized in
Khusial et al [
SendínHernández et al [
A total of 23 (37%) studies applied hierarchical clustering with the Ward method [
A total of 3 (5%) further studies (in addition to the 23 studies introduced at the start of the paragraph) applied hierarchical clustering to continuous data. Amore et al [
A total of 22 (35%) studies used kmeans clustering as the principal clustering technique. A breakdown of the methods used by these 3 studies is given in
When dealing with very large sample sizes, it can be advantageous to introduce a precluster step. The aim is to group samples and to use these groups or
A total of 7 (11%) studies used the SPSS TwoStep clustering method [
None of the studies in this review adequately considered the distributional assumptions made by the SPSS TwoStep method. Ruggieri et al [
Two (3%) further studies preclustered samples (Just et al [
Three studies used kmedoid methods. A breakdown of the methods used by these 3 studies is given in
Kernel kmeans and spectral clustering are different but related methods, which may be used to identify clusters that are not linearly separable in the input feature space [
Wang et al [
A total of 54 (86%) studies explained in detail the methods used to select the number of clusters. Of these, 20 (32%) studies used more than one method for choosing the number of clusters. The maximum number of methods used was 6.
A total of 27 (43%) studies used a dendrogram to choose the number of clusters to include in their study (
Of the 8 (13%) studies that specified a maximum number of clusters, the maximum number ranged between 2 and 15 clusters. Seven (11%) studies used a statistic (or multiple statistics), including the cindex [
Four studies (6%) avoided very small clusters. Approaches to this include merging 2 clusters containing 6 and 12 samples [
A total of 11 (17%) studies tested the stability of their cluster solution; the methods are detailed in
A total of 24 (38%) studies assessed the quality of their solution using methods beyond those assessing stability. The methods are detailed in
Of the 30 studies that assessed the stability or quality of their cluster analysis, 21 (70%) reported their findings. However, the reporting of these results was in many cases brief, consisting of statements such as “the clusters were shown to be stable” without providing supporting evidence.
Postprocessing methods used in the asthma studies included in this review.
Method  Values, n (%)^{a}  



Dendrogram  27 (43) 

Hierarchical clustering with Ward linkage  19 (30) 

Specify a maximum number of clusters^{b}  8 (13) 

Statistic(s)  7 (11) 

Silhouette plot or average silhouette width  5 (8) 

Bayesian information criterion  4 (6) 

Specify a minimum size of smallest cluster^{b}  4 (6) 

Previous studies  3 (5) 

Unclear  3 (5) 

Clinical interpretation  2 (3) 

Scree plot  1 (2) 



Repeated in random subset  3 (5) 

Leaveoneout crossvalidation  3 (5) 

Bootstrap methods  3 (5) 

Unclear methods  2 (3) 

Train and test set  1 (2) 



Repeated in selected subset  8 (13) 

Repeated with difference methods  6 (10) 

Repeated with different initial configurations  5 (8) 

Repeated in separate cohort  4 (6) 

Repeated with altered features  3 (5) 

Repeated at different time point  3 (5) 

Repeated with different software  1 (2) 
^{a}Studies may have used more than 1 method.
^{b}These methods were not included when calculating the number of methods used to choose the number of clusters.
We identified 63 studies that applied cluster analysis to multimodal clinical data to identify subtypes of asthma. We explored the clustering methodologies and their limitations in detail. The principal finding of this review was that the majority of the reviewed studies have flaws in the application of cluster analysis. Although some of these flaws were related to the multimodal nature of the clinical data, they extended to aspects of cluster analysis, which are agnostic of data type, such as sample size, stability, and reporting of the results.
These findings build on a previous review, which identified limitations such as lack of robustness in feature selection and neglect to specify distance measures in studies using cluster analysis to contribute to our understanding of the spectrum of asthma syndrome [
A widespread limitation in the reviewed studies was the small sample size. Studies had overall sample sizes as small as 40 patients, with clusters as small as 6 patients. We argue that there is limited utility in clustering data with such small sample sizes: they may result in clusters that are unstable [
In the following paragraphs, we discussed the limitations of 3 of the feature selection approaches applied by the reviewed studies. The first approach was to avoid collinearity or multicollinearity or excluding features that were considered to be
The second was the use of PCA or factor analysis to select features, which has a similar motivation to the concept described earlier for discarding statistically correlated features. There are methodological justifications for the use of PCA, factor analysis, or other nonlinear embedding methods for feature transformation [
The third approach to feature selection was the use of statistical hypothesis tests with respect to outcomes of interest, as done in 2 studies [
Feature transformation was applied in only 6 studies, and the methods were generally poorly reported. As with cluster analysis, feature encoding and scaling are important considerations in feature transformation, but none of the studies gave adequate details in their methods. The results of feature transformation were also poorly reported. Although the key reason for applying feature transformation methods is to reduce the dimensionality of the dataset, only 2 [
Most studies explicitly stated the clustering method that they used but were less explicit regarding the preprocessing steps and choice of dissimilarity measure. Hastie et al [
We expand on this statement, further adding that preprocessing steps such as feature scaling and feature encoding are also more important in obtaining success than the choice clustering algorithm. This is in line with the conclusions of Prosperi et al, who demonstrated that clustering using different feature sets and encodings in asthma datasets can lead to different cluster solutions [
First, the Euclidean distance was used with mixedtype data in over half of the studies (54%). Although the Euclidean distance is intended for use with continuous data, problems associated with applying it to mixedtype data may be mitigated by carefully considering feature scaling and feature encoding. However, in our review, we found that many studies did not specify their methods for rescaling, and many studies included ordinal and nominal categorical features but did not specify how these would be treated when calculating the Euclidean distances. The lack of consideration of feature scaling and encoding in these cases may have resulted in assigning an unintended weight structure to the cluster features.
Second, 4 studies used Gower coefficient in hierarchical clustering with Ward linkage [
A final point in the use of kmeans and hierarchical clustering using the Ward method with mixed cluster features is that the theory underpinning these methods involves the calculation of cluster means. The mean is not an appropriate summary statistic for categorical features, which are more typically summarized by the mode. For this reason, we suggest that kmedoids may be a more appropriate method for mixedtype features used in clustering. Instead of computing each cluster’s mean (as with hierarchical clustering using Ward’s method and kmeans), kmedoids compute each cluster’s medoid, defined as the sample in the cluster for which the average dissimilarity to all other samples in the cluster is minimized [
The SPSS TwoStep method was used in 7 of the 63 studies investigated here. We see 2 key limitations with the application of this method across the reviewed studies. First, none of the studies gave adequate consideration to the distributional assumptions made when using the loglikelihood distance, and most did not mention the assumptions at all. Second, this method is designed for clustering several millions of samples with many features within an acceptable time and makes a key compromise in doing so [
Only 1 study [
Selecting the number of clusters can be challenging and depends largely on the context of the application. In the case of the reviewed applications in asthma, the
Our review shows that studies rarely tested the stability and quality of their results, with a particular lack of emphasis on stability. This is concerning, as many studies use methods such as kmeans, which reach local minima, and apply them to small sample sizes, thus increasing the risk of obtaining unstable results. We argue that because of the unsupervised nature of cluster analysis, testing the stability and quality of the results should be a key theme and would like to urge researchers and peer reviewers in this research field to carefully consider these aspects. However, we do appreciate that assessing the stability and quality of a solution in the absence of
Although this review focused on applications in subtyping asthma, the identified issues have been found in studies using cluster analysis to subtype other diseases. For example, recent studies in autism [
For a recent example of a wellconsidered and wellreported application of cluster analysis to multimodal clinical data, we refer the reader to Pikoula et al’s study of Chronic Obstructive Pulmonary Disease subtypes [
The literature search presented in this study is comprehensive but practically cannot be exhaustive. We restricted the search to articles that included the terms
We did not fully explore multiple kernel kmeans [
This review highlights a number of issues in previous applications of cluster analysis to multimodal clinical data in asthma. We make the following key recommendations based on these findings:
Careful consideration should be given to the preprocessing of multimodal clinical data and how the scaling and encoding of features may affect their weighting in the analysis.
The choice of dissimilarity measures and cluster analysis methods are dependent on one another as well as on the scaling and encoding of the data. Certain combinations of these data analytics components may be incompatible and give unreliable results.
The stability and quality of the cluster results should be thoroughly evaluated.
The abovementioned recommendations focus on the application of cluster analysis, but we put similar emphasis on the clear reporting of each of the abovementioned points, as this was also found to be lacking in the reviewed papers.
Preferred Reporting Items for Systematic Reviews and MetaAnalyses checklist.
Data dictionary.
Study characteristics.
Breakdown of methods used by the 11 studies that did not use the three most common clustering methods.
Illustrative example of the use of Gower coefficient with hierarchical clustering and Ward linkage.
cubic cluster criterion
Health Data Research
multiple correspondence analysis
principal component analysis
Preferred Reporting Items for Systematic Reviews and MetaAnalyses
This study was supported by the Health Data Research, United Kingdom (HDR UK), which receives funding from HDR UK Ltd (HDR5012) funded by the UK Medical Research Council, Engineering and Physical Sciences Research Council, Economic and Social Research Council, Department of Health and Social Care (England), Chief Scientist Office of the Scottish Government Health and Social Care Directorates, Health and Social Care Research and Development Division (Welsh Government), Public Health Agency (Northern Ireland), the British Heart Foundation, and the Wellcome Trust and by the Asthma UK Centre for Applied Research, which is funded by Asthma UK. The funders had no role in the study or the decision to submit this work to be considered for publication.
EH was responsible for conducting the study. EH conducted the identification of articles and screened them for eligibility. EH and HT independently extracted data according to the described methodology and synthesized the findings. EH wrote up the first draft of the manuscript, and AT, AS, and HT contributed to the final version.
AS is supported by a research grant from the Asthma UK Centre for Applied Research. All other authors have no conflict of interest pertaining to this study to declare.