^{1}

^{2}

^{3}

^{4}

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.

The accumulation of data and its accessibility through easier-to-use platforms will allow data scientists and practitioners who are less sophisticated data analysts to get answers by using big data for many purposes in multiple ways. Data scientists working with medical data are aware of the importance of preprocessing, yet in many cases, the potential benefits of using nonlinear transformations is overlooked.

Our aim is to present a semi-automated approach of symmetry-aiming transformations tailored for medical data analysis and its advantages.

We describe 10 commonly encountered data types used in the medical field and the relevant transformations for each data type. Data from the Alzheimer’s Disease Neuroimaging Initiative study, Parkinson’s disease hospital cohort, and disease-simulating data were used to demonstrate the approach and its benefits.

Symmetry-targeted monotone transformations were applied, and the advantages gained in variance, stability, linearity, and clustering are demonstrated. An open source application implementing the described methods was developed. Both linearity of relationships and increase of stability of variability improved after applying proper nonlinear transformation. Clustering simulated nonsymmetric data gave low agreement to the generating clusters (Rand value=0.681), while capturing the original structure after applying nonlinear transformation to symmetry (Rand value=0.986).

This work presents the use of nonlinear transformations for medical data and the importance of their semi-automated choice. Using the described approach, the data analyst increases the ability to create simpler, more robust and translational models, thereby facilitating the interpretation and implementation of the analysis by medical practitioners. Applying nonlinear transformations as part of the preprocessing is essential to the quality and interpretability of results.

The volume of data collected these days is constantly growing and is expected to reach 44 zettabytes by 2020 [

With the ever-increasing numbers of variables and subjects, there is a notion in the general public that simply the amount of data will reveal all there is to understand from it. This is not necessarily so. Data analysis can be greatly simplified when (1) the distribution of a variable (feature) is symmetric across subjects, (2) variability is stable across different conditions, (3) relationships are linear between variables of interest, and (4) models are additive rather than having a more complex structure. The need for these desirable properties can be circumvented by very complex modeling, which is feasible in data-rich research, as discussed below. However, such complex modeling leads to results that are harder to interpret and less appropriate for generalizing, and the ability to extrapolate beyond the given data or to set thresholds with diagnostic values is reduced. In addition, complex modeling requires a high level of expertise in data analysis—a level most physicians and health specialists cannot devote enough time to achieve, even if they are interested in acquiring quantitative answers to their questions.

In practice, data as extracted and measured are not usually characterized by the above four desirable properties. Moreover, the widely used essentially linear transformations such as normalization or Z score are practically linear transformations and therefore cannot change the symmetry or linearity. It is well known that income information is better analyzed after applying a logarithmic transformation, and the same is true for concentrations of substances in body fluids or the use of log odds for risk modeling. It is similarly recognized that gene expression data should be log-transformed and its variability stabilized before it can be compared across subjects and conditions. This remains true no matter what role it plays in the modeling effort: whether it is the variable of interest to be analyzed or used for prediction and explanation of another quantity of interest. The principles that the data need not remain in their original scale of measurement and that monotone transformations of variables can be used to facilitate the interpretation and generalization are well accepted in specific research situations. It is a cornerstone of the Exploratory Data Analysis approach to analyzing small and medium datasets [

The above principle is undervalued and rarely used when addressing bigger and more complex datasets. A search in the Web of Science database for the terms “classification OR prediction OR clustering” published in one of the three leading medical informatics journals (

We demonstrate that transforming a variable so that its distribution across individuals is approximately symmetric goes a long way towards achieving the other goals for which transformations are useful: stabilizing the variance across conditions, assuring linearity in the relationship between variables, and the additivity of the response of interest [

Sometimes it is not feasible to transform to symmetry because the variable has a more complex structure, such as multimodality (“bumps” or “humps”). Even then, transforming in a way that causes the main body of data to be in a symmetric hump is advantageous.

Limiting the search for appropriate transformation at the preprocessing stage towards symmetry only becomes essential when confronted with big data. The detailed inspection of the relationship between every pair of variables, inspecting the residuals from every model under consideration, for lack of linearity or homogeneity of variability is impossible. Nor is it known ahead of time what the problem of interest to a future user will be. In contrast, when searching for symmetry-inducing transformations at the preprocessing stage, the number of manual inspections is at most linear in the number of variables and thus becomes feasible.

We therefore propose a semi-automated practice for symmetry-targeting preprocessing that enables the data scientist to select an appropriate transformation for each variable in the dataset, thereby allowing the analysis of thousands of variables in an efficient and reproducible manner.

It is important to acknowledge that for every specific analysis, more complex procedures may be used to overcome the limitations discussed above, including nonlinear methods based on splines or loess, robust methods that ignore outliers, and hierarchical methods that capture interactions [

The monotone transformations we use belong to the power transformations family, where ^{p}. We recommend Tukey’s ladder of re-expression, where squares, square root, their reciprocals, and their likes are used [^{p}, where ^{–1} it is merely the speed that is recorded. The longer the time, the lower the speed.

The above transformations are further combined with other transformations that are tailored to the type of variable encountered. Tukey’s taxonomy of measured variables [

Amounts are measurements taking any positive value

Counts are counts of units and therefore can take only integer values 0,1,2,… (eg, number of adverse events, days under antibiotic treatment). Counts tend to be right-skewed and have variance increasing with their mean, stemming from the Poisson nature of their distribution. The square root is a variance stabilizing transformation for Poisson counts and often achieves symmetry.

Ratio is one amount ^{p}‒^{p}, if the two elements are available.

Fraction is a ratio of one amount

Counted fraction is the fraction of counts, counting how many out of how many have some property (n out of m). It is therefore bound between 0 and 1 (eg, the number of patients admitted on a day due to a specific symptom out of the total number of admitted patients that day). Here too, the most useful transformation is the logit transformation: log (n+⅓/m–n+⅓). We add a constant c=⅓ to avoid dividing by zero [

Amounts that are inherently bounded a≤y≤b, are essentially fractions, given by 0≤r=(y‒a)/(b‒a)≤1 (eg, the length of time a child sleeps per day). We first write them as fractions and then transform them as discussed above.

Counts that are inherently bounded l≤n≤m, are essentially counted fractions, given by 0≤r=n‒l)/(m‒l)≤1 (eg, the number of correctly answered questions in a questionnaire), and as such, they should be handled in the same manner explained for counted fractions, log(n‒l+⅓)/(m‒n+⅓).

Difference in amounts, counts, or fractions may take positive or negative values (eg, the difference between the number of words forgotten in an immediate recall assignment and number of words forgotten in a delayed recall assignment). Differences are best handled by transforming the subtracted variables separately, be they amounts or counts. If the two variables that are differenced are not available, the difference variable should rarely be transformed. Instead, it can practically be only positive where it should be handled as amount.

Ranks are bounded counts of how many are below and equal the observation out of the total number ranked (eg, the rank of a specific symptom-describing word out of all descriptive words used). They will be treated exactly as bounded counts.

Ordinal variables are variables whose values are named categories that can be naturally ordered (eg, everyday cognition assessment taking the values 1-4). Each category can be assigned a numeric value (similar to ranks with ties) depending on the proportion of cases in the category and below it, compared to the background of reference distribution. An explicit formula for a logistic reference distribution is given in

The medical information dataset should, therefore, be accompanied by a metafile, specifying for each variable its type (according to the above 10-item list), its context-related lower and upper bounds (when these exist). Finally, an indicator of whether the variable should be reversed, in order to ease interpretation should be added, so that variables carrying similar meaning, say “healthy,” are presented in the same direction.

With the above information at hand, we recommend that the choice of transformations be performed in a semi-automated manner. The automated part can be guided by the measure of skewness that directed the analyst towards a few alternative transformations around the power

_{3}+

_{1})–

_{2}/0.5*(

_{3}–

_{1})

where _{1}, _{2}, and _{3} are the lower quartile (ie, the 25^{th}percent quantile), the median, and the upper quartile (ie, the 75^{th}percent quantile), respectively. The interquartile range 0.5*(_{3}–_{1}) that is a measure of the spread is equal to the median absolute deviation for exactly symmetrical distributions.

A less-resistant version of the Yule index can serve well, even when the data consist of bounded counts over a small range (see

The automatic search is combined with subjective assessment in the following way. For a given variable _{i}, if |sk_{i}|˂

The above approach was implemented in R and used to preprocess data obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Alzheimer’s disease (AD) is the most common form of dementia. There is currently no known treatment nor one that slows the progression of this disorder [

A second dataset involved 1575 Parkinson’s disease (PD) patients and their first degree relatives assembled in the Tel Aviv Medical Center: 1185 idiopathic PD patients, 164 PD patients who are carriers of the G2019S mutation in the

In addition to the clinical datasets described above, the benefits of transformations for cluster analysis were demonstrated using simulated, rather than real research data. First, most clustering algorithms require the unknown number of clusters as input; this number is known in the simulated data scenario. Second, assessing the accuracy of the clusters is not straightforward, as the real separation into subgroups is absent. When the data are generated according to a known clustered structure, the accuracy of the results can be easily measured using, for example, using the adjusted Rand index [

We constructed a simulated dataset of the clinical situation described as originally three assigned diseases: A, B, and C. The “real” situation is that disease A has two subtypes A1, A2; Diseases B and C are actually two subtypes of disease BC. The data contained 100 continuous, discrete, and binary features, of which 7 features define 6 disease subgroups (A1, A2, B, C, BC, and Normal). Each feature was created symmetrically and then transformed in order to exhibit a skewed distribution. The clustering analysis was performed on both the original and the transformed data.

In order to facilitate the use of the study methodology, an implementation in an R shiny application was developed [

The application allows data transformation regardless of modeling procedure. Visualization of the choices and selection made allows the user to transform hundreds of variables in a matter of minutes, without the burden of using scripts, remembering the available transformations, and tuning parameters.

The improvement in the results of statistical analysis methods such as analysis of variance and linear regression, after the preprocessing methods were applied, are described below. Additional benefits gained by symmetry-targeting transformations, such as reduction of variance variability and linearity, are presented next. A dissimilarity matrix is usually based on some distance metric and may therefore be highly deformed due to the absence of symmetry. In a highly skewed variable, the distances between the elements in the bulk of the data will be substantially smaller than the distances between the very few large ones, the latter marring the importance of differences for most of the observations. The improvement in clustering is demonstrated using simulated data in the concluding section of the Results.

The first example is the semi-automated output for the variable NPITOTAL (neuropsychiatric inventory total score) in the ADNI cohort represents the total score of the psychiatric inventory exam. Examinations are shown in

The variable UPDRS Part 3 measures the score for the integrated motor-related condition of PD patients.

Even after transformation, the distributions of the variables are not necessarily perfectly symmetrical, but their skewness decreased substantially.

Although the selected transformations relied only on symmetry as the desirable target, they demonstrate how helpful these can be in achieving other goals as well.

Analysis of variance is most frequently used in order to check for differences in the means between three or more groups. It is therefore a useful tool when screening for variables, which have some marginal association with categorical variables such as the diagnosis groups.

Screenshot of semi-automated transformation workflow application of Alzheimer’s Disease Neuroimaging Initiative (ADNI) patients' data variable neuropsychiatric inventory total score distribution.

Unified Parkinson's Disease Rating Scale (UPDRS) Part 3 (motor part): variable raw data distribution, variable after sqrt transformation.

EcogPtTotal (everyday cognition participant total score) variable raw data over Alzheimer’s Disease Neuroimaging Initiative (ADNI)'s five diagnosis categories before and after Logit transformation.

The linearity of relationships is a desirable property, allowing simpler models and better predictions outside the range of available data. These can later be extrapolated and easily explained. The relationship between EcogPtTotal and geriatric depression total score (GDTotal; score for the integrated geriatric depression scale in the ADNI data) is explored and shown in

Increase of the stability of variability following the transformation is beneficial as well. The distance between the two quartile lines (grayed area) increases with the progression in GDTotal value (

Both linearity of relationships and increase of stability of variability can be used to demonstrate the practical importance of simplicity of a relationship. Suppose we want to identify unusually low self-assessed deterioration, not explained merely by depression, which may indicate a more serious cognitive deterioration. The physician should insert the result of GDTotal into a computer program (that expresses the nonlinear relationship), get the typical EcogPtTotal (standard deviation 2), calculated for that specific GDTotal value, and check whether the patient’s value is above it. Instead, if the measurement is reported in the transformed way, the physician can immediately see whether the patient is above this threshold and by how much. The stability of the calculation and the possibility to extend the relationship into the region where only a few measurements exist are substantially improved.

Recall that a simulated dataset of the clinical situation described above was generated. First, we performed a 2-step analysis of the original symmetric data: (1) select a subset of the clinical measurements using stepwise selection with false discovery rate control for q=0.05 [

Relationship between EcogPtTotal (everyday cognition participant total score) and GDTotal (score for the integrated geriatric depression scale) raw values and after transformations.

Comparison of cluster analysis results^{a}.

True cluster | Assigned membership using symmetric data^{b} |
Assigned membership using skewed data^{c} |
||||||||||

1 | 2 | 3 | 4 | 5 | 6 | 1 | 3 | 2 | 4 | 6 | 5 | |

0 | 200 | 0 | 0 | 0 | 0 | 0 | 194 | 0 | 6 | 0 | 0 | 0 |

1.1 | 0 | 200 | 0 | 0 | 0 | 0 | 0 | 200 | 0 | 0 | 0 | 0 |

1.2 | 0 | 0 | 200 | 0 | 0 | 0 | 0 | 63 | 137 | 0 | 0 | 0 |

2 | 0 | 0 | 0 | 200 | 0 | 0 | 0 | 0 | 0 | 140 | 1 | 59 |

3 | 0 | 0 | 0 | 0 | 199 | 1 | 2 | 0 | 0 | 2 | 196 | 0 |

23 | 0 | 0 | 0 | 0 | 6 | 194 | 0 | 1 | 0 | 91 | 7 | 101 |

^{a}Confusion matrix of real group membership versus assigned cluster using K-medoids algorithm and Manhattan distance matrix used on symmetric (left) and asymmetric (right) features.

^{b}Adjusted Rand=0.986.

^{c}Adjusted Rand=0.681.

The importance of data transformation in the preprocessing stage for any big data analysis, especially in medicine, is presented in this paper. We focused on the use of data transformation as the lesser known part of data preparation. Reviewing articles in medical informatics, we noticed that the preprocessing methods are not always mentioned. Even when they are performed, in many cases the process does not include variable transformations (

As seen in the examples from the large and robust databases of clinical and genetics information of AD and PD patients, the use of transformations results in simpler, linear, additive models with homogeneous variability. This enhances the ability to extrapolate and produce more accurate predictions, with better distance calculations and reduced complexity facilitating the integration into systems and devices. As shown, performing data analysis without transformations is possible but may require complex nonlinear models to explain the data.

The purpose of data analysis is to discover new insights. In this sense, the interpretation is an essential factor in the success of a data analysis process. The described methods increase the translational ability of the results for clinicians. They provide the possibility to apply simpler and more explainable models. The variables can be reverted to the original values before presenting and communicating the results. Methods that serve as a “black-box” have difficulties gaining the trust of physicians. This, in part, is due to the lack of transparent explanations of the process leading to the results.

Moreover, there are patterns in the data that only a human can notice. An example is a case where one transformation had the smallest skewness value, but the proper transformation was actually another (

This paper is based on our experience in using the methodology and the application. As use of the application will grow, we expect to improve with the feedback of users. The use of the methodology requires understanding of the data and metadata characteristics—this might be time consuming but intimate familiarity with the data is key to successful data analysis.

The benefit of nonlinear transformations was demonstrated by simulation only for the purpose of clustering, as it is difficult to find a medical example of clusters of patients, well defined by a known set of variables, which can serve as a vehicle for such a demonstration.

The use of nonlinear transformations as part of the preprocessing is important and affects the quality of the results. Symmetry-targeted transformations contribute significantly to other aspects of data analysis, enabling simpler and more translational models.

Articles of data mining and machine-learning algorithms used for classification of clinical data.

Transformations extended details.

Alzheimer’s disease

Alzheimer’s Disease Neuroimaging Initiative

everyday cognition participant total score

geriatric depression total score

Parkinson’s disease

Unified Parkinson’s Disease Rating Scale

Data collection and sharing for this project was funded by ADNI (National Institutes of Health Grant U01 AG024904) and Department of Defense ADNI (DOD award #W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from many private pharmaceutical companies and nonprofit organizations. Data used in preparation of this article were obtained from the ADNI database. As such, the investigators within ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in the analysis or writing of this report. A complete listing of ADNI investigators can be found on the ADNI website. The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement #604102 (Human Brain Project). We thank Shiri Diskin, PhD, for her editorial assistance in manuscript preparation.

None declared.