Identifying Key Predictors of Cognitive Dysfunction in Older People Using Supervised Machine Learning Techniques: Observational Study

doi:10.2196/20995

Original Paper

¹School of Computing, Engineering and Intelligent Systems, Ulster University, Derry~Londonderry, United Kingdom

²School of Biomedical Sciences, Nutrition Innovation Centre for Food and Health, Ulster University, Coleraine, United Kingdom

³School of Geography and Environmental Sciences, Ulster University, Coleraine, United Kingdom

⁴School of Computing, Ulster University, Jordanstown, United Kingdom

⁵School of Health, Wellbeing and Social Care, The Open University, Belfast, United Kingdom

⁶School of Medicine, Trinity College Dublin, Dublin, Ireland

⁷Mercers Institute for Research on Ageing, St James's Hospital, Dublin, Ireland

Corresponding Author:

Debbie Rankin, BSc, PhD

School of Computing, Engineering and Intelligent Systems

Ulster University

Northland Road

Derry~Londonderry, BT48 7JL

United Kingdom

Phone: 44 287167 ext 5841

Email: d.rankin1@ulster.ac.uk

Background: Machine learning techniques, specifically classification algorithms, may be effective to help understand key health, nutritional, and environmental factors associated with cognitive function in aging populations.

Objective: This study aims to use classification techniques to identify the key patient predictors that are considered most important in the classification of poorer cognitive performance, which is an early risk factor for dementia.

Methods: Data were used from the Trinity-Ulster and Department of Agriculture study, which included detailed information on sociodemographic, clinical, biochemical, nutritional, and lifestyle factors in 5186 older adults recruited from the Republic of Ireland and Northern Ireland, a proportion of whom (987/5186, 19.03%) were followed up 5-7 years later for reassessment. Cognitive function at both time points was assessed using a battery of tests, including the Repeatable Battery for the Assessment of Neuropsychological Status (RBANS), with a score <70 classed as poorer cognitive performance. This study trained 3 classifiers—decision trees, Naïve Bayes, and random forests—to classify the RBANS score and to identify key health, nutritional, and environmental predictors of cognitive performance and cognitive decline over the follow-up period. It assessed their performance, taking note of the variables that were deemed important for the optimized classifiers for their computational diagnostics.

Results: In the classification of a low RBANS score (<70), our models performed well (F₁ score range 0.73-0.93), all highlighting the individual’s score from the Timed Up and Go (TUG) test, the age at which the participant stopped education, and whether or not the participant’s family reported memory concerns to be of key importance. The classification models performed well in classifying a greater rate of decline in the RBANS score (F₁ score range 0.66-0.85), also indicating the TUG score to be of key importance, followed by blood indicators: plasma homocysteine, vitamin B6 biomarker (plasma pyridoxal-5-phosphate), and glycated hemoglobin.

Conclusions: The results suggest that it may be possible for a health care professional to make an initial evaluation, with a high level of confidence, of the potential for cognitive dysfunction using only a few short, noninvasive questions, thus providing a quick, efficient, and noninvasive way to help them decide whether or not a patient requires a full cognitive evaluation. This approach has the potential benefits of making time and cost savings for health service providers and avoiding stress created through unnecessary cognitive assessments in low-risk patients.

JMIR Med Inform 2020;8(9):e20995

doi:10.2196/20995

Keywords

classification; supervised machine learning; cognition; diet; aging; geriatric assessment

Globally, populations are aging. By 2050, it is estimated that more than 2 billion people will be aged over 60 years [1]. Cognitive function generally declines with age and ranges in severity from mild cognitive impairment (MCI) to dementia. MCI can be defined as cognitive decline greater than that expected for an individual’s age and education level, but it does not interfere with activities of daily living, whereas dementia profoundly impacts normal functioning [2,3]. Dementia currently affects 50 million people worldwide, and it is estimated that this will increase to 152 million by 2050. The annual cost of dementia is estimated at US $1 trillion and is expected to more than double by 2030 [4]. Therefore, strategies that promote better brain health and well-being in older age are an urgent public health priority.

Alzheimer disease is the most common form of dementia, with other forms including vascular dementia, dementia with Lewy bodies, frontotemporal dementia, and mixed dementia. Risk factors for dementia are disease dependent but commonly include age, genetics and medical conditions including cardiovascular disease and diabetes, diet, lifestyle, and environmental factors [5]. An important recent report highlighted the complexity of dementia and the potential to prevent or delay the onset of the disease through interventions targeted at modifiable risk factors [6]. In particular, nutrition has been identified as a key area of interest, and emerging evidence links lower levels of certain vitamins with cognitive dysfunction in older adults, whereas certain dietary patterns and components appear to have protective roles in maintaining cognitive health [7].

The application of data mining within health care has become increasingly popular, driven particularly by the large amount of complex data available that test the capabilities of traditional statistical approaches [8]. In health care, as in other areas, data mining has provided a means of accessing and analyzing large volumes of data to better inform and drive change. Classification models, in particular, have been utilized extensively in the understanding of MCI. These models can help us to understand patterns in the behavior of data in terms of diagnosing MCI, specifically in the consideration of key features pertaining to a diagnosis of impairment [9,10] or predicting the progression of the impairment [11]. Furthermore, models have been developed to apply a more objective approach to the MCI diagnosis [12], not to undermine but rather to support a clinician’s analysis [13]. Na c [14] investigated the use of noninvasive, easy-to-collect variables that are commonly collected in community health care settings such as sociodemographic, health, functional, and interpersonal variables, for the prediction of cognitive impairment among community-dwelling older adults, using the Korean Longitudinal Study of Aging (KLoSA) data set [15] and a gradient boosting machine classifier.

Many studies apply machine learning approaches to the popular Open Access Series of Imaging Studies [16], Alzheimer Disease Neuroimaging Initiative (ADNI) [17], and Australian Imaging Biomarkers and Lifestyle Flagship Study of Aging (AIBL) [18] data sets consisting of neuroimaging data (eg, magnetic resonance imaging [MRI] and positron emission tomography scan data) from participants ranging from no cognitive impairment to MCI to Alzheimer disease [19]. These data sets also include a range of demographic, biomarker, clinical, and cognitive assessment data. Ding et al [20] used a Bayesian network approach for the classification of Alzheimer disease with heterogeneous features from the AIBL data set and demonstrated that machine learning could be used to select features and their appropriate combinations that are relevant for Alzheimer disease severity classification with high accuracy. Korolev et al [21] used a kernel-based classifier and the ADNI data set to develop a prognostic model for predicting MCI-to-dementia progression over a 3-year period.

The aim of our study is to compare the selection of data analytics techniques to identify determinants of cognitive health in community-dwelling older adults using existing data from the Trinity-Ulster and Department of Agriculture (TUDA) study (ClinicalTrials.gov identifier: NCT02664584). The TUDA study was designed to investigate nutritional, health, and lifestyle factors in the development of diseases related to aging, including dementia. A range of analytical models on the data were developed to determine factors that may predict poorer cognitive performance and cognitive decline over time, assessed using an in-depth neuropsychiatric test.

Cross-Industry Process for Data Mining Methodology

In this study, the widely used cross-industry process for data mining (CRISP-DM) research methodology was adopted [22]. CRISP-DM has 6 main steps: business understanding, data understanding, data preparation, modeling, evaluation, and deployment. In the business understanding phase, the objective of this study was to use classification techniques to identify the key patient predictors considered most important in the classification of cognitive dysfunction, which itself is a predictor of dementia. In the data understanding phase, the data quality was examined to understand data collection methods and the features contained within the TUDA data set, as described in the next section (The Data). In the data preparation phase, the TUDA data set was preprocessed to cleanse the data set and select features relevant to the modeling phase. Feature selection methods and the results of feature selection are described in the subsequent sections (The Data and Feature Selection sections in Methods and the Feature Selection section in Results). In the modeling phase, a number of machine learning modeling techniques were selected and applied to the prepared data and their parameters were calibrated to optimal values to increase the knowledge extracted from the data (described in the Machine Learning Techniques section in Methods and the RBANS Classification and Classifying Cognitive Decline Using the Rate of Change in the RBANS Score sections in Results). Upon building the models that produced the highest quality knowledge from the data analysis perspective, the models were thoroughly evaluated to ensure robustness and achievement of the business objectives. The knowledge gained from the models was then presented to clinical experts in a way that could be used and understood.

The Data

The TUDA cohort provides detailed nutrition and health data, along with related lifestyle, clinical, and biochemical details, on a total of 5186 community-dwelling older adults aged 60 to 102 years, making this cohort one of the most comprehensively characterized cohorts of its kind for aging research internationally. With an overall goal to address the prevention of age-related diseases, the TUDA study is aimed at investigating nutrition and related factors in the development of common diseases of aging. TUDA study participants were recruited between 2008 and 2012 from hospital outpatient or general practice clinics in the Republic of Ireland or Northern Ireland via standardized protocols for participant sampling, assessment, and data recording and with a centralized laboratory analysis. In brief, the inclusion criteria for the TUDA study were being born on the island of Ireland, aged >60 years, and not having an existing diagnosis of dementia. Nonfasting blood samples were collected from all participants, and a wide range of parameters including routine biochemistry and hematological profiles, along with biomarkers of micronutrient status, were measured. A comprehensive health and lifestyle questionnaire was administered as part of the 90-min interview to capture medical and demographic details, along with comprehensive information on medication and vitamin supplement usage. Physiological function tests, blood pressure, bone health (dual-energy x-ray absorptiometry scans), and cognitive function tests were also performed. A subset of approximately 19.03% (987/5186) of participants were reassessed 5 to 7 years after their initial assessment to investigate the progression of risk factors and disease over time.

A summary of the characteristics of the subset of the TUDA cohort (n=2869) analyzed in this study is shown in Table 1. Preprocessing and feature selection performed on the original data set to reach this subset of data are described in the Feature Selection sections of the Methods and Results sections.

Cognitive function was assessed at both time points using 3 assessment tools, the Mini-Mental State Examination (MMSE), the Frontal Assessment Battery (FAB), and RBANS, and the rate of cognitive decline was calculated over the 5- to 7-year follow-up period. For the purposes of this study, the cognitive function outcome indicator is categorized based on RBANS. RBANS is an age-adjusted and sensitive neuropsychiatric battery for assessing global cognitive function [23]. This tool has also been validated to assess specific cognitive domains within the brain, including immediate and delayed memory, visual-spatial, language, and attention, which are combined to provide a total score, with lower scores generally indicative of poorer cognitive performance.

The rate of RBANS change over the 5- to 7-year period between the initial assessment and the follow-up assessment was computed as the difference between a participant’s RBANS score at each sampling point, normalized to account for the time between each assessment, where this can differ by up to 2 years across participants (Figure 1).

Table 1. General characteristics of the Trinity-Ulster and Department of Agriculture study participants.

Characteristics		Males (n=1191)		Females (n=1678)
Age (years), mean (SD)		72.1 (7.8)		72.2 (7.8)
Education (years)^a, mean (SD)		16.3 (3.3)		16.1 (2.8)
Health and lifestyle
	BMI (kg/m²), mean (SD)		28.9 (4.3)		28.7 (5.7)
	Waist-to-hip ratio, mean (SD)		0.97 (0.07)		0.88 (0.07)
	Instrumental activities of daily living, mean (SD)		25.0 (4.1)		24.9 (3.5)
	Physical self-maintenance scale score, mean (SD)		23.3 (1.6)		23.1 (1.7)
	Timed Up and Go (seconds), mean (SD)		12.9 (9.1)		13.0 (8.0)
	Living alone, n (%)		260 (21.8)		632 (37.7)
	Current smoker, n (%)		122 (10.2)		194 (11.6)
	Alcohol (units/week), mean (SD)		8.8 (14.6)		2.9 (6.7)
	Socioeconomically most deprived, n (%)		291 (24.4)		426 (25.4)
Neuropsychiatric assessment
	MMSE^b score, mean (SD)		27.8 (1.4)		27.9 (1.4)
	RBANS^c score, mean (SD)		87.3 (14.5)		88.9 (15.2)
	RBANS class=“low” (target), n (%)^d		133 (11.2)		168 (10.0)
	RBANS class=“high” (target), n (%)^d		1058 (88.8)		1510 (90.0)
	FAB^e score, mean (SD)		15.7 (2.2)		15.9 (2.1)
	Depression CES-D^f score, mean (SD)		4.8 (6.2)		6.1 (7.7)
	Anxiety (HADS^g score), mean (SD)		2.6 (3.2)		3.5 (3.8)
Clinical measures
	White cell count (10⁹/L), mean (SD)		7.1 (3.6)		6.9 (3.3)
	Hemoglobin (g/DL), mean (SD)		14.2 (1.5)		13.0 (1.3)
	Mean corpuscular volume (FL^h), mean (SD)		90.7 (5.5)		90.6 (5.1)
	Platelet count (10⁹/L), mean (SD)		229 (59.0)		265 (66.9)
	Urea (mmol/L), mean (SD)		7.2 (2.9)		6.7 (2.3)
	Creatinine (μmol/L), mean (SD)		98 (31.0)		79 (22.4)
	Albumin (g/L), mean (SD)		42 (3.7)		42 (3.4)
	Gamma GT (U/L), mean (SD)		43 (47.5)		34 (36.0)
	Sodium (mmol/L), mean (SD)		140 (5.1)		139 (3.2)
	Potassium (mmol/L), mean (SD)		4.3 (0.5)		4.2 (0.4)
	Calcium (mmol/L), mean (SD)		2.3 (0.1)		2.3 (0.1)
	Phosphate (mmol/L), mean (SD)		1.0 (0.2)		1.1 (0.2)
	Alkaline phosphatase (U/L), mean (SD)		82 (34.2)		82 (25.7)
	Low-density lipoprotein (mmol/L), mean (SD)		2.23 (0.8)		2.58 (0.9)
	High-density lipoprotein (mmol/L), mean (SD)		1.23 (0.4)		1.55 (0.4)
	Triglycerides (mmol/L), mean (SD)		1.78 (1.0)		1.62 (1.0)
	C-reactive protein (mg/L), mean (SD)		6.1 (11.1)		5.5 (11.9)
	Glycated hemoglobin (%), mean (SD)		6.0 (1.0)		5.9 (0.7)
	Parathyroid hormone (pg/mL), mean (SD)		45.2 (30.8)		47.2 (31.9)
	Glomerular filtration rate (mL/min), mean (SD)		77.2 (25.3)		67.8 (22.6)
Nutritional biomarkers
	Red blood cell folate (nmol/L), mean (SD)		1053 (591.1)		1100 (582.7)
	Serum vitamin B12 (pmol/L), mean (SD)		267 (191.0)		296 (277.3)
	Plasma vitamin B6 (nmol/L), mean (SD)		74.1 (53.2)		81.5 (69.7)
	Riboflavin (EGRacⁱ), mean (SD)		1.35 (0.2)		1.34 (0.2)
	Total plasma homocysteine (μmol/L), mean (SD)		15.1 (5.9)		14.1 (5.1)
	Total vitamin D (nmol/L), mean (SD)		51.6 (25.9)		56.0 (30.1)

^aEducation refers to the age of stopping formal education.

^bMMSE: Mini-Mental State Examination.

^cRBANS: Repeatable Battery for the Assessment of Neuropsychological Assessment.

^dRBANS score <70 is assigned class low and an RBANS score ≥70 is assigned class high.

^eFAB: Frontal Assessment Battery.

^fCES-D: Centre for Epidemiological Studies Depression.

^gHADS: Hospital Anxiety and Depression Scale.

^hFL: femtolitre.

ⁱEGRac: erythrocyte glutathione reductase activation coefficient, with a higher EGRac value indicating poorer riboflavin status.

Figure 1. Calculating Repeatable Battery for the Assessment of Neuropsychological Status rate of change over a 5- to 7-year period between initial assessment and follow-up assessment, normalized to account for the time between each assessment.

The data set initially contained 525 variables. During preprocessing, the data were cleansed to detect and correct inaccurate values, identify missing values and ensure consistent coding of these, ensure consistent coding of categorical variables, identify spelling and coding inconsistencies and correct these, transform text variables into categorical variables where possible, ensure numeric values fell within an appropriate and accurate range, check for consistency among dependent variables and correct any errors, and finally check for duplicate data and remove any redundancy. Normalization was carried out on the data table, including nonloss decomposition to decompose the large data table into smaller tables, transforming composite attributes into separate attributes, transforming multivalued attributes, repeating columns into separate tables, and recoding text attributes to categorical attributes where possible. This process reduced the number of variables to 345 within the data set. These variables were a combination of text, categorical, and numerical variables.

Feature Selection

Dimension reduction is an important stage for understanding information in a data set. Typical dimension reduction techniques, such as principal component analysis (PCA) [24], describe all the numerical variables contained within a data set in terms of a number of linear combinations (fewer than the original number of features) of these features. Although a widely used and appreciated method for reducing the number of dimensions within a data set, PCA is only valid for numerical features. In addition, a more transparent feature selection method is often required to remove redundant features of various types to reduce the size of the data set without losing potentially valuable information. Although a range of feature selection techniques exist because of the nature of the features in the TUDA data set and the prior knowledge that a large number of variables were likely to be highly correlated, a correlation analysis and clustering were used in this study to allow highly correlated features to be determined and redundant features to be removed. These methods also helped us to discuss, evaluate, and agree on the features to be retained in collaboration with the data gatekeepers and expert clinicians who had in-depth knowledge of the data. Further feature selection was not carried out as we elected to retain as many features as possible for use in training the classifiers. This section describes the feature selection techniques performed, and the results of feature selection are described in the Results section.

Manual Feature Selection

Manual feature selection was performed to remove features containing large amounts of missing data and, therefore, considered not useful for the analysis. Free-text variables that could not be encoded were also removed. On the basis of expert clinical knowledge, features deemed irrelevant to the study were removed, as well as a number of subjective features where a comparable, objective laboratory-obtained feature existed in the data set.

Correlation and Association

A correlation analysis is necessary before the development of classification models for 2 primary reasons: “Algorithms might ‘overfit’ predictions to spurious correlations in the data; multicollinear, correlated predictors could produce unstable estimates” [25] and “Perfectly correlated variables are truly redundant in the sense that no additional information is gained by adding them” [26]. In other words, as many machine learning algorithms rely on linearly independent variables, strongly correlated variables must be evaluated and removed to avoid unreliable results. Moreover, 2 variables that follow the same behavior add little to the information gained by the data set and thus are considered redundant. The correlation analysis allows the determination of highly correlated variables, which may undermine the consequential data analysis results. Owing to the difference in categorization of the variables within the data set, correlation coefficients were calculated for numerical-numerical pairs, whereas the strength of association was necessary for categorical-categorical variables and categorical-numerical variables. Correlations between numerical variables were calculated using the Spearman nonparametric correlation coefficient [27], the strength of association between categorical variables was calculated using the Cramér V statistic [28], and the coefficient of determination (R2) was calculated between categorical and numerical variables [29].

Clustering

Clustering is useful in feature selection [26] to analyze the data to find structural patterns. Clustering can be used together with correlation analysis to identify those variables that behave in a similar manner; thus, the information offered by the variables may prove redundant. Clustering of variables can take 1 of 2 forms: hierarchical, which outputs an informative hierarchy, and nonhierarchical, which divides the data into clusters, within which the variables may behave similarly. Owing to the nature of the information this study seeks to derive, the focus was placed on hierarchical clustering, illustrated specifically in the form of tree structures or dendrograms.

Ascendant hierarchical clustering can use a mixture of both numerical and categorical variables to arrange variables into homogenous clusters, that is, variables that are strongly related to each other [30]. The algorithm for finding these related clusters follows the concepts of PCA and multiple correspondence analysis (MCA). In PCA and MCA, the data set is analyzed to find new linearly independent variables to describe the same set of data. In this hierarchical clustering, these new synthetic variables are used as the center points of the clusters, and each original variable is then grouped according to its similarity to the cluster center, either using the sum of the correlation ratio, for numeric variables, or the squared correlation, for categorical variables.

Machine Learning Techniques

Machine learning techniques are regularly employed for detecting patterns and dependencies within data, such as within health care data. Specifically, machine learning algorithms can be used to look for combinations of variables and generate rules within data that can be used to reliably predict outcomes [25]. This style of problem relies on classification algorithms, where predictor variables are used to predict an outcome or a class variable. These predictions are based on a training sample of the data, usually consisting of a random sample of about 70% to 80% of the available data. The developed model comprises rules based on these training data and then tested against the remaining data (Figure 2). The training procedure is repeated on a number of different subsets of the data to reduce the likelihood of overfitting the model. In this study, 10-fold cross-validation was used to measure the performance of classifiers. Initially, the data were split into a training set (75%) and an evaluation set (25%). The models were trained using the training set with 10-fold cross-validation applied (with a 90%/10% train/test split at each fold). The modeling techniques of decision trees, random forests, and Naive Bayes were selected for their ease of interpretability. It is crucial that the results of modeling in this study can be explained to clinical experts. The individual algorithms were developed using the R caret package, specifically using the train and predict functions. The evaluation data set was used to evaluate the performance of the model found to be optimal during training for each of the 3 respective techniques considered.

Figure 2. Model development and testing protocol.

Decision Tree

Decision trees are one of the most common machine learning algorithms when using a combination of continuous and categorical variables, chosen for their computational efficiency and readability. The Classification and Regression Tree (CART) [31] algorithm, in particular, lends itself well to explanatory knowledge discovery [32] due to its transparency. CART decision trees are developed using a top-down recursive algorithm, where the data set is split into increasingly smaller subsets according to some predetermined metric, most commonly using either the Gini impurity index or a permutation importance measure. The measures used are described below. The rpart implementation of the CART decision tree algorithm in the R caret package was used in this study. This implementation automatically applies pruning, choosing a range of complexity parameters and automatically selecting the optimal model using the complexity parameter that provides the highest accuracy.

The resulting decision tree easily translates itself to a series of rules that can be used to classify the test data. The advantages of using a decision tree classifier lie in its ease of application, particularly as both numerical and categorical input variables require little to no preprocessing; its transparency for interpretation, as the resulting tree can be explained using Boolean logic; and its computational efficiency, particularly with large data sets. In addition, decision tree classification does not require domain knowledge or parameter setting [32]. However, traditional decision trees are also the least robust of the machine learning classification methods, as they are prone to overfitting and therefore rely substantially on the training data. Often, a small change in the training data can result in large changes in the developed tree. These shortcomings can be addressed using the random forest algorithm.

Random Forest

The random forest algorithm [33] works in a similar manner to decision trees, but where the CART algorithm results in a single tree, the random forest algorithm results in a forest of trees. Each of the maximal trees within the random forest will have been developed using a random subset of the predictor variables [34]. Each split within the tree is then calculated according to a given performance metric from only within this subset of variables. Typically, many trees are considered, thus reducing the prediction error, as the model prediction will reflect the average prediction across all trees. As a result, the random forest algorithm is considered robust, flexible, and highly suited to large data sets [35]. The random forest algorithm in the R caret package was used in this study. This implementation chooses a range of mtry parameters, where mtry is the number of variables available for splitting at each tree node, which have a strong influence on predictor variable importance estimates [36]. The mtry parameter providing the highest accuracy was used to select the optimal model.

Naïve Bayes

The Naïve Bayes algorithm for classification is based on Bayes’ theorem, which describes the most likely outcome (Y) based on k number of observations (X={x₁,x₂,…,x_k}). This can be written as P(Y|X) and, as the algorithm is naïve and all variables are considered independent, is calculated using the equation in Figure 3.

The probability of an outcome P(Y); the probability of an observation being described by X, P(X); and the probability of an observation being described by X, given that they can be classed by Y, P(X|Y), can all be estimated using the given data set. For its use as a classifier, an observation is classified according to the most likely class based on the random variables the observation describes. A benefit of the Naïve Bayes classifier is its theoretical low error rate; however, based on the underlying independence of the variables, in practice, this may not be the case. The Naïve Bayes algorithm in the R caret package was used in this study.

Importance and Accuracy Measures

Gini Impurity Index

The Gini impurity index describes the likelihood of an incorrect classification using a random variable (var) and is described mathematically as shown in Figure 4.

Here p_i is the probability of a correct classification according to m classes. By considering the variables resulting in a minimal Gini impurity index, this metric will therefore determine the best (most pure) variables to use to split the training data until a convergence criterion is met.

Permutation Importance

Permutation variable importance [33] is calculated by using the effect the variable has on the overall prediction performance. This performance can be predicted using the out-of-bag prediction error, calculated by taking the mean prediction error rate of those trees that did not include the specific variable [35].

Performance Evaluation

To compare the performance of each classification model, a variety of evaluation metrics were used. The accuracy, precision, recall, and F₁ scores were computed. Precision, recall, and F₁ scores take account of true and false positives and negatives, whereas accuracy considers only true-positives and true-negatives [37].