This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
A lifelogsbased wellness index (LWI) is a function for calculating wellness scores based on health behavior lifelogs (eg, daily walking steps and sleep times collected via a smartwatch). A wellness score intuitively shows the users of smart wellness services the overall condition of their health behaviors. LWI development includes estimation (ie, estimating coefficients in LWI with data). A panel data set comprising health behavior lifelogs allows LWI estimation to control for unobserved variables, thereby resulting in less bias. However, these data sets typically have missing data due to events that occur in daily life (eg, smart devices stop collecting data when batteries are depleted), which can introduce biases into LWI coefficients. Thus, the appropriate choice of method to handle missing data is important for reducing biases in LWI estimations with panel data. However, there is a lack of research in this area.
This study aims to identify a suitable missingdata handling method for LWI estimation with panel data.
Listwise deletion, mean imputation, expectation maximization–based multiple imputation, predictivemean matching–based multiple imputation, knearest neighbors–based imputation, and lowrank approximation–based imputation were comparatively evaluated by simulating an existing case of LWI development. A panel data set comprising health behavior lifelogs of 41 college students over 4 weeks was transformed into a reference data set without any missing data. Then, 200 simulated data sets were generated by randomly introducing missing data at proportions from 1% to 80%. The missingdata handling methods were each applied to transform the simulated data sets into complete data sets, and coefficients in a linear LWI were estimated for each complete data set. For each proportion for each method, a bias measure was calculated by comparing the estimated coefficient values with values estimated from the reference data set.
Methods performed differently depending on the proportion of missing data. For 1% to 30% proportions, lowrank approximation–based imputation, predictivemean matching–based multiple imputation, and expectation maximization–based multiple imputation were superior. For 31% to 60% proportions, lowrank approximation–based imputation and predictivemean matching–based multiple imputation performed best. For over 60% proportions, only lowrank approximation–based imputation performed acceptably.
Lowrank approximation–based imputation was the best of the 6 datahandling methods regardless of the proportion of missing data. This superiority is generalizable to other panel data sets comprising health behavior lifelogs given their verified lowrank nature, for which lowrank approximation–based imputation is known to perform effectively. This result will guide missingdata handling in reducing coefficient biases in new development cases of linear LWIs with panel data.
Smart wellness services are designed to help individuals monitor their own wellness through smart devices, including smartphones and smartwatches [
Smart wellness services can collect various health behavior lifelogs through the aid of smart devices [
Existing smart wellness services utilize health behavior lifelogs to provide users with detailed records about health behaviors [
A lifelogsbased wellness index (LWI), a function that transforms health behavior lifelogs into wellness scores for smart wellness service users, resolves this limitation [
An LWI can be developed through 3 key phases: definition, estimation, and assessment [
LWI estimation can lead to the reduction of coefficient biases through a panel data set of health behavior lifelogs. A panel data set follows a given sample of participants over time, thus providing multiple observations for each participant. Existing panel data analysis methods (eg, 1way random effects regression) can only be applied to panel data sets. These methods can reduce biases in the coefficients by controlling for heterogeneity across participants, which is caused by unobserved variables [
A panel data set comprising health behavior lifelogs will likely contain large proportions of missing data. Such a data set is collected based on everyday user activities and is therefore exposed to various random events that result in missing data. For example, users may forget to wear smart devices or to record health behavior lifelogs, and the smart devices themselves will no longer record health behavior lifelogs when batteries are depleted. These random events often lead to large proportions of missing data. For example, missing data accounted for 18% of a panel data set in an LWI development case [
Missing data can lead to 2 severe problems when attempting to estimate LWI coefficients. First, it can introduce biases to the coefficients [
This study identified a suitable method for LWI estimation with panel data based on an examination of 6 representative missingdata handling methods: listwise deletion, mean imputation, expectation maximization–based multiple imputation, predictivemean matching–based multiple imputation, knearest neighbors–based imputation, and lowrank approximation–based imputation. These were selected from common missingdata handling methods from previous studies, specifically because they represented possible missingdata handling approaches in the context of LWI estimation.
The 6 abovementioned missingdata handling methods were comparatively evaluated for various missingness proportions of a panel data set by simulating an LWI development case originally presented by Kim et al [
Missingdata handling can be divided into 4 approaches, including complete case analysis, single imputation, multiple imputation, and joint modelbased imputation (
When selecting these 4 approaches, previous studies have used the missingness proportions and missingness mechanisms of data sets as major criteria for ensuring adequate selection for the data sets [
Existing recommendations for missing data handling.
A panel data set of health behavior lifelogs is likely to contain 5% or more of incomplete observations with a missingness mechanism similar to missing completely at random. This property is attributed to a variety of random daily events that result in missing data. For example, the LWI development case presented by Kim et al [
The 6 missingdata handling methods presented in
However, few previous studies have recommended which of the 6 missingdata handling methods are suitable for reducing coefficient biases according to the missingness proportion of a panel data set composed of health behavior lifelogs. This study filled that gap in the literature by comparatively evaluating the LWI coefficient biases of the 6 missingdata handling methods according to the missingness proportion of exactly such a panel data set.
Representative missingdata handling methods applicable for LWI estimation.
Approach and method  Description  




Listwise deletion [ 
Excludes all observations with missing values to conduct analysis 




Mean imputation [ 
Imputes each missing value of a variable with the mean of observed values of the variable 

knearest neighbor–based imputation [ 
Imputes each missing value of a variable based on the observed values of the knearest neighbors 

Lowrank approximation–based imputation [ 
Predicts missing values as a linear combination of a small set of singular vectors 




Expectation maximization–based multiple imputation [ 
Draws imputed values from the multivariate normal distribution of the data set estimated by expectation–maximization; multiple imputed data sets are estimated by repeating the imputation and separately analyzed; analysis results are pooled into the final result 

Predictivemean matching–based multiple imputation [ 
Substitutes a missing value with a value randomly from complete observations, with regressionpredicted values that are closest to the regressionpredicted value for the missing value from the simulated regression model; multiple imputed data sets are estimated by repeating the imputation and separately analyzed; analysis results are pooled into the final result 
We previously developed an LWI for college students [
Variable descriptions.
Category and variable  Description (value meaning)  




Breakfast (or Lunch or Dinner)  Student’s selfrating of the day’s breakfast (or lunch or dinner) based on nutrition (0: skip, 33: low, 66: medium, 100: high) 

Exercise  Whether the student exercises or works out for more than 30 minutes during the day (0: no exercising, 100: exercising) 

Step achievement  Percentage indicating a ratio that the total number of walking steps in the day reached 10,000 

Sleep duration achievement  Percentage that the student’s sleep duration reached 7 hours between 6 PM of the previous day and 6 PM of the current day 

Golden time achievement  Percentage that the student slept during the golden time, which is 10 PM of the previous day to 2 AM of the current day 




Perceived score  Score that the student determines by evaluating overall condition of their critical health behaviors over the day 
To establish an intuitive scoring system, all behavior variables and the proxy variable were set to range from 0 (worst) to 100 (best) [
A 1way random effects regression model was used to estimate the index coefficients:
where
This regression model was selected for 2 reasons. First, the index is a linear function. Second, the regression model was set to control for the unobserved studentspecific random effects on the perceived score. Unobserved (or unmeasured) studentspecific heterogeneity could exist in the regression model and thus influence the perceived score. For example, students may have different levels of interest in wellness, but these are unobserved in the regression model. However, those who are more interested in wellness may have higher standards for health behaviors, thus resulting in lower perceived scores. As the failure to control for such unobserved studentspecific effects may produce misleading results [
The data set used to estimate the regression model was compiled by collecting data on the daily life activities of 41 students including 21 undergraduate (15 males and 6 females) and 20 graduate students (15 males and 5 females), all of whom were attending a university in Korea. Their age statistics were as follows: average of 24.7, maximum of 30, minimum of 19, and a standard deviation of 2.8. A total of 1148 observations were thus collected over a 28day period (November 330, 2015). An observation consisted of 1 student’s 1day data for the 8 variables in the regression model.
Data preprocessing excluded the 264 observations including missing or abnormal values. Notably, students reported that these observations went through data collection problems (eg, forgetting to wear smartwatches, neglecting to enter data through the smartphone app, or depleting their smartwatch batteries). In this regard, they did not accurately reflect actual daily health behaviors of students. By excluding these observations, a panel data set comprised 884 complete observations from 41 students.
The LWI coefficients were estimated by fitting Eq (1) to the data set. Based on the estimated coefficients, the LWI was defined as a linear function consisting of the 7 following behavior variables: 0.151 × Breakfast + 0.163 × Lunch + 0.135 × Dinner + 0.135 × Exercise + 0.095 × Step achievement + 0.219 × Sleep duration achievement + 0.102 × Golden time achievement.
This study simulated the aforementioned LWI development case to evaluate biases regarding the regression coefficients that each of the 6 missingdata handling methods led to, as follows: the data set of the LWI development case was transformed into a reference data set that did not include any missing data; incomplete data sets were simulated by introducing missing data to the reference data set at various missingness proportions; the missingdata handling method changed all simulated data sets into complete data sets by handling their missing data; regression coefficients were estimated by fitting Eq (1) to the complete data sets; a bias measure of the missingdata handling method was calculated by comparing the estimated coefficient values with coefficient reference values. The coefficient reference values were estimated by fitting Eq (1) to the reference data set.
In this study, we conducted a simulation to calculate a bias measure for incremental missingness proportions for each of the 6 methods. The bias measure was referred to as the grandmean of absolute biases (GAB). For each missingness proportion, GAB was used to compare the coefficient biases, thus determining which missingdata handling methods was superior.
Simulation steps are shown in
Research process.
Step 0 was performed to generate a reference data set from the data set used in [
This normalization is generally recommended as preprocessing for datamining algorithms, including missingdata handling methods [
Descriptive statistics of the data set for developing the LWI for college students and regression results for the reference data set.
Variable  Descriptive statistics  Regression results  

Mean (SD)  Range  Estimate (SE)  
Perceived score  63.4 (15.9)  0100  N/A^{a}  N/A 
Breakfast  24.2 (36.2)  0100  0.097 (0.014)  <.001 
Lunch  63.5 (32.3)  0100  0.105 (0.013)  <.001 
Dinner  75.5 (27.5)  0100  0.088 (0.015)  <.001 
Exercise  5.3 (22.4)  0100  0.087 (0.019)  <.001 
Step achievement  74.6 (28.6)  0100  0.061 (0.015)  <.001 
Sleep duration achievement  86.0 (19.3)  6.7100  0.131 (0.021)  <.001 
Golden time achievement  14.2 (25.1)  0100  0.066 (0.018)  <.001 
(Intercept)  N/A  N/A  0.305 (0.029)  <.001 
^{a}N/A: not applicable.
The reference data set also included 40 dummy variables and a time variable. Here, the dummy variables coded the 41 students, while the value of time variable was determined based on the dates the data were collected, that is, between the first and last days of the data collection period (November 330, 2015):
The resulting reference data set was 884×49 in dimension, as it contained all 884 observations mentioned above. Each observation included values for the 40 dummy variables, time variable, 7 behavior variables, and perceived score variable for a particular student on a given day. All variables ranged from 0 to 1.
In Step 1, the missingness proportion was selected to evaluate the 6 missingdata handling methods. The missingness proportion increased from 1% to 80% by 1%. An increment of 1% was sufficiently small to observe how the performance of each method changed according to the missingness proportion. Previous studies [
We used a range up to 80% because one method continued to show outstanding performance for proportion above 60% and a missingness proportion of 80% was too high to estimate coefficients with low biases. If a data set had such a high missingness proportion in practice, then it may be preferable to collect another data set instead of using data from the initial data set.
As shown in
For proportion
In Step 3, each of the 6 missingdata handling methods were applied to each of the 200 simulated data sets using R software (version 3.6.0). Listwise deletion and mean imputation were implemented by several lines of R code to automatically delete incomplete observations and substitute a missing value for a variable with the mean of its observed values, respectively. knearest neighbor–based imputation used the knnImputation function in the DMwR package [
As a result of this step, each of the listwise deletion, mean imputation, knearest neighbor–based imputation, and lowrank approximation–based imputation methods resulted in a complete data set. For expectation maximization–based and predictivemean matching–based multiple imputations, there were 5 complete data sets.
Eq (1) was fitted to each complete data set resulting from Step 3 using the plm package [
Step 5 was performed to calculate a bias measure for each coefficient value set. Because a coefficient could have a certain amount of bias, each coefficient value set contained a total of 8 coefficient biases. The mean of absolute biases (MAB) was defined as a bias measure to calculate the average amount of the 8 coefficient biases for a given coefficient value set:
where
We combined the 200 MABs for each method to create a bias measure that represented the average of its coefficient biases over the 200 simulated data sets of missingness proportion
A low GAB indicated that the missingdata handling method led to small coefficient biases across the 200 simulated data sets of the missingness proportion. The GAB was used as the criterion for evaluating method performance.
GAB results.
Pairwise multiple comparison tests were conducted to statistically compare relative superiority among the 6 missingdata handling methods for each missingness proportion. The tests were conducted using Dunnett modified TukeyKramer pairwise multiple comparison at the .05 significance level [
Number of pairwise comparisons with statistically small GAB differences.
Different missingdata handling methods were shown to be superior depending on the missingness proportion. As shown in
Sum of pairwise comparison times with statistically small GAB for each missingdata handling method and missingness proportion range.
Missingness proportion range  Listwise deletion  Mean imputation  knearest neighbor  Expectation–maximization  Predictivemean matching  Lowrank approximation 
1%30%  15  53  2  84^{a}  91^{a}  99^{a} 
31%60%  0  9  0  34  74^{a}  75^{a} 
61%80%  0  7  0  0  24  46^{a} 
^{a}These methods had the best performance for the missingness proportion range.
The lowrank approximation–based imputation showed superior performance for 1% to 80% missingness proportions and has previously shown excellent performance with lowrank data sets [
Lowrank approximation–based imputation is also expected to perform well with other panel data sets comprising health behavior lifelogs, as previous studies [
Both the expectation maximization–based and predictivemean matching–based multiple imputations showed larger biases than the lowrank approximation–based imputation as the missingness proportion increased. Larger proportions increased the loss of information with missing values, which then increases uncertainty. Multiple imputation reflects such uncertainty in the standard errors of the estimates [
In summary, the lowrank approximation–based imputation was the superior missingdata handling method for handling missing data when estimating a linear LWI with a panel data set comprising health behavior lifelogs, regardless of the missingness proportion.
Three future research issues can improve and expand on this research. The first involves validating generalizability of the current research to nonlinear LWIs (eg, functions with polynomial or interaction variables and logistic functions). New LWI development cases can aim to develop nonlinear LWIs that this study did not cover. Thus, additional research is needed to establish the validity of our findings in regard to nonlinear LWIs.
The second issue involves the need to identify which health behaviorrelated covariates (eg, age, gender, and BMI) can enhance the performance of missingdata handling for LWI estimation. While previous studies have already suggested several such covariates [
The third issue concerns the need to develop guidelines for predicting the size of bias in LWI coefficients for a certain missingness proportion of a given panel data set. In
A panel data set comprising health behavior lifelogs will likely contain a large amount of missing data due to various events. These missing data can result in LWI coefficient biases. While there are various methods for handling missing data, few previous studies have set out to determine which are the most effective for reducing LWI coefficient biases. This study comparatively evaluated 6 representative missingdata handling methods by simulating an existing LWI development case. Results suggested that lowrank approximation–based imputation was superior for reducing biases when estimating a linear LWI with a panel data set composed of health behavior lifelogs. This finding is expected to contribute to the reduction of coefficient biases in new development cases where linear LWIs are estimated with panel data.
coefficient value set
grandmean of absolute biases
lifelogsbased wellness index
mean of absolute biases
This work was supported by the National Research Foundation of Korea grant funded by the Korean government (Ministry of Science and ICT; no. 2020R1C1C1014312).
None declared.