This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
Medical writing styles can have an impact on the understandability of health educational resources. Amid current web-based health information research, there is a dearth of research-based evidence that demonstrates what constitutes the best practice of the development of web-based health resources on children’s health promotion and education.
Using authoritative and highly influential web-based children’s health educational resources from the Nemours Foundation, the largest not-for-profit organization promoting children’s health and well-being, we aimed to develop machine learning algorithms to discriminate and predict the writing styles of health educational resources on children versus adult health promotion using a variety of health educational resources aimed at the general public.
The selection of natural language features as predicator variables of algorithms went through initial automatic feature selection using ridge classifier, support vector machine, extreme gradient boost tree, and recursive feature elimination followed by revision by education experts. We compared algorithms using the automatically selected (n=19) and linguistically enhanced (n=20) feature sets, using the initial feature set (n=115) as the baseline.
Using five-fold cross-validation, compared with the baseline (115 features), the Gaussian Naive Bayes model (20 features) achieved statistically higher mean sensitivity (
We developed new evaluation tools for the discrimination and prediction of writing styles of web-based health resources for children’s health education and promotion among parents and caregivers of children. User-adaptive automatic assessment of web-based health content holds great promise for distant and remote health education among young readers. Our study leveraged the precision and adaptability of machine learning algorithms and insights from health linguistics to help advance this significant yet understudied area of research.
Web-based health education and promotion has become increasingly popular among all age groups [
Much of the current research has focused on exploring these assessment dimensions separately using long-standing readability tools [
The core question of our study is to develop machine learning models to discriminate and predict what constitutes a suitable writing style of web-based health resources on children’s health promotion and education. Research-based evidence is needed to inform and improve the current practice of web-based health educational resource development on health issues related to the promotion of children’s health and well-being for readers such as parents, caregivers of children, and teenagers. Our study aims to assess the writing styles of web-based health resources on children’s health through an integrated, holistic approach, that is, the development of machine learning models to evaluate whether the content and the writing style of a piece of web-based health educational material is more related to children’s health promotion and education, or more for the general public. The underlying hypothesis of our study is that the content and writing style of high-quality web-based health educational resources vary with the intended readership, which is based on the principles of clinically developed guidelines for health educational resource assessment such as PEMAT [
The Nemours Foundation is the world’s largest nonprofit organization dedicated to improving the health and well-being of children, and the website of the Foundation has high-quality health education resources developed by medical experts and experienced health educators purposefully for different readerships including parents, children (aged ≤13 years) and teenagers (aged 13-20 years) [
For the selection of health information for the general public, the main screening criteria were that the websites must have been certified by the Health on the Net Foundation, an international accreditation authority of web-based health information, and they must have been developed by health authorities to provide accurate health educational information. These included governmental health organizations, accredited nonprofit health organizations engaged in health promotion and education, or national or regional associations of specific disease prevention and control. We carefully screened a total of 200 children’s health readings from the website of Nemours KidsHealth [
We annotated the health texts using the semantic tagging system developed by the University of Lancaster, United Kingdom [
Our study chose USAS purposefully, as we aimed to select linguistic and semantic features that may be used for developing machine learning algorithms to predict the semantic relevance and suitability of web-based health information among children. The semantic features described earlier are more suitable for analyzing and modeling the content relevance of health information. Many current studies use grammatical or syntactic features to develop machine learning algorithms for health information evaluation. However, grammatical, syntactic, morphological, or other types of structural or functional linguistic features cannot be used to study the contents of health information. The relevance of health information content for specific populations is largely underexplored in current health informatics using natural language processing and machine learning. Our study took advantage of the extensive English semantic coverage of USAS and developed algorithms using a small number of semantic features (20 from the original 115 semantic features) that measured diverse dimensions of the relevance and suitability of web-based health contents for English-speaking young children: approaches to medical knowledge acquisition; assessment of health situations; describing efforts; complexity of actions; attention, stress, or emphasis on key points; and finally, communicative interactivity. All these dimensions of health information relevance and suitability for young readers are supported and represented by semantic features incorporated in the comprehensive annotation system of USAS.
A number of semantic features were identified as characteristic of adult-oriented health resources: semantic features that had large negative Cohen
Textual features that were statistically significant in children-oriented health texts reflected the different cognitive processing of health information and health communication styles between children and adults. Semantic features that had a large Cohen
There were two semantic categories related to emphasis, stress, and attention: A14 focusing subjuncts that draw attention to or focus on (
The large number of semantic features of statistical significance (
Machine learning algorithms are known for their lack of interpretability compared with statistical models. Through the successive permutation of the predictor features in the final algorithm (Gaussian Naive Bayes [GNB]), we calculated the impact of individual features on the performance of the algorithm, that is, its sensitivity and specificity. Two sets of semantic features were identified as significant contributors to the prediction of children- versus adult-oriented health educational resources. Each set of features that emerged in the process of algorithm development represented a balanced combination of semantic classes, which were statistically significant features in children- or adult-oriented materials.
Semantic feature of health educational texts.
Semantic features | Children-oriented, mean (SD) | Adult-oriented, mean (SD) | Statistical difference | Effect size (Cohen |
|
|
|
|
Mann-Whitney |
|
|
A5: evaluation: good or bad | 5.65 (7.267) | 4.1 (4.994) | 67510.0 | .17 | 0.340 |
A15: safety or danger | 0.230 (1.020) | 1.560 (3.950) | 56287.0 | <.001 | −0.543 |
B2: health and disease | 7.910 (13.792) | 22.45 (30.619) | 41001.0 | <.001 | −0.802 |
B3: medicine and medical treatment | 4.360 (8.392) | 12.46 (17.280) | 46443.5 | <.001 | −0.800 |
F1: food | 10.30 (25.407) | 3.490 (13.801) | 51368.0 | <.001 | 0.491 |
M1: moving, coming, going | 5.27 (7.399) | 2.92 (5.259) | 52775.0 | <.001 | 0.547 |
S1: social actions, states, and processes | 1.850 (2.738) | 3.820 (6.090) | 54876.5 | <.001 | −0.547 |
S2: people | 12.42 (15.635) | 10.22 (16.519) | 65131.5 | .04 | 0.207 |
S4: kin | 2.860 (4.221) | 1.070 (3.247) | 52886.5 | <.001 | 0.713 |
S5: groups and affiliation | 1.500 (3.672) | 2.520 (4.771) | 58355.0 | <.001 | −0.356 |
S8: helping or hindering | 5.140 (6.315) | 6.920 (9.634) | 62823.5 | .007 | −0.318 |
S9: religion and the supernatural | 0.140 (0.738) | 0.440 (1.587) | 64677.0 | .001 | −0.324 |
T1: time | 11.3 (12.639) | 12.94 (15.022) | 67348.0 | .16 | −0.181 |
X3: sensory | 4.920 (7.606) | 2.020 (4.618) | 50469.0 | <.001 | 0.684 |
X9: ability | 1.88 (3.619) | 1.83 (3.612) | 69246.0 | .37 | 0.019 |
Z2: geographical names | 0.550 (1.496) | 3.120 (6.184) | 45505.5 | <.001 | −0.674 |
Z6: negative | 5.840 (5.958) | 3.080 (4.861) | 51392.0 | <.001 | 0.764 |
Z8: pronoun | 59.79 (53.287) | 31.56 (38.830) | 46155.0 | <.001 | 0.907 |
Z99: unmatched expressions | 13.74 (17.037) | 37.58 (49.684) | 39069.0 | <.001 | −0.776 |
aAsymptotic significance (two-tailed).
We applied machine learning algorithms to learn the important features for detecting the writing styles of web-based health educational resources on children’s health promotion and education. Recursive feature elimination (RFE), ridge classifier (RC), extreme gradient boosting (XGBoost) [
For the RC and RFE algorithms, scikit-learn has built-in cross-validation variants
We applied RFE_SVM and RFE_XGB to evaluate the cross-validation score when increasing the number of selected features. The automatic tuning results of the number of features selected by cross-validation are shown in
Automatic tuning of the number of features selected with cross-validation of RFE_XGB.
Automatic tuning of the number of features selected with cross-validation of RFE_SVM.
Automatic feature importance ranking using extreme gradient boost tree. XGBoost: extreme gradient boosting.
Automatic feature importance ranking using the ridge classifier.
RC: Z99, B3, S1, T1, A2, B2, A6, K4, A12, W5, A15, N1, S9, S5, Y1, S8, X4, Z2, X9, Y2, S3, A14, F3, X2, X8, T2, K3, A5, G3, B5, O2, X3, M1, X5, F1, S2, O1, S4, Z6, Z8
XGB Tree: X2, E2, X9, T2, X5, A2, E3, S2, A15, A5, Z7, N6, M7, O4, A13, G1, X3, Z5, E5, K5, Y2, S4, O1, L3, Q3, Z99, B3, T1, S8, B2, L1, Z8, Z2, Z3, I1, X8, Q4, F1, Z6, A12
RFE using SVM as the feature scoring algorithm: A2, A3, A4, A5, A6, A7, A9, A10, A11, A12, A13, A14, A15, B1, B2, B3, B5, C1, E4, E5, E6, F1, F3, F4, G1, G2, G3, H2, H3, H4, H5, I1, I2, I3, I4, K1, K2, K3, K4, K5, L1, L2, L3, M1, M3, M4, M5, M6, M8, N1, N2, N3, N4, N5, N6, O2, O3, O4, P1, Q1, Q2, Q3, Q4, S1, S2, S3, S4, S5, S6, S7, S8, S9, T1, T2, T4, W1, W2, W4, W5, X2, X3, X4, X5, X6, X7, X9, Y1, Y2, Z1, Z2, Z3, Z4, Z6, Z8, Z99
RFE using XGB as the feature scoring algorithm: A1, A2, A3, A4, A5, A6, A7, A8, A9, A10, A11, A12, A13, A14, A15, B1, B2, B3, B4, B5, C1, E1, E2, E3, E4, E6, F1, F2, G1, H3, H4, I1, I2, I3, I4, K3, K4, K5, K6, L2, L3, M1, M2, M3, M4, M5, M6, M7, N1, N3, N4, N5, N6, O1, O2, O3, O4, Q1, Q2, Q3, Q4, S1, S2, S3, S4, S5, S6, S7, S8, S9, T1, T2, T3, T4, W1, W3, W4, X2, X3, X5, X6, X7, X8, X9, Y1, Y2, Z0, Z1, Z2, Z3, Z4, Z5, Z6, Z7, Z8, Z9, Z99
The common 19 features of the four feature selection algorithms are as follows: Z8, S2, S8, F1, A5, S4, X3, M1, T1, S5, S9, Z99, A15, S1, X9, Z6, B2, Z2, B3.
Classifiers used for automatic feature selection.
Classifier and text class | Accuracy | Macro average F1a | Precision | Recall | F1 | ||||||
|
0.925 | 0.89 |
|
|
|
||||||
|
Adult-oriented readings |
|
|
0.99 | 0.91 | 0.95 | |||||
|
Children-oriented readings |
|
|
0.74 | 0.97 | 0.84 | |||||
|
0.93 | 0.89 |
|
|
|
||||||
|
Adult-oriented readings |
|
|
0.95 | 0.96 | 0.96 | |||||
|
Children-oriented readings |
|
|
0.84 | 0.8 | 0.82 | |||||
|
0.94 | 0.90 |
|
|
|
||||||
|
Adult-oriented readings |
|
|
0.95 | 0.98 | 0.96 | |||||
|
Children-oriented readings |
|
|
0.91 | 0.78 | 0.84 |
aF1 = 2 × [(precision × recall) / (precision + recall)].
bSVM: support vector machine.
cXGB: extreme gradient boosting.
Features that were deemed linguistically irrelevant or unexplainable will be replaced by semantic features that are highly relevant and significant for health language studies. Incorporating insights from language studies into automatic feature selection will help in the development of adaptive and interpretable machine learning algorithms. Increasing the interpretability and practical usability of algorithms can be achieved at the stage of the linguistic review of automatically selected feature sets.
We eliminated S9, T1, S2, and Z2 and added X8, A12, A11, A13, and A14. These were the semantic features that were highly relevant to health linguistics. X8 are terms depicting the level of effort and resolution. This is a statistically significant feature of children’s educational resources (
Performance of classifiers using 115 (originally tagged) and 19 (automatically selected) features.
Classifier and feature sets | Sensitivity, mean (SD) | Specificity, mean (SD) | AUCa, mean (SD) | |
|
||||
|
All 115 features | 0.685 (0.125) | 0.771 (0.116) | 0.822 (0.062) |
|
Automatically selected 19 features | 0.634 (0.074) | 0.903 (0.063) | 0.882 (0.054) |
|
||||
|
All 115 features | 0.973 (0.013) | 0.526 (0.096) | 0.901 (0.032) |
|
Automatically selected 19 features | 0.943 (0.028) | 0.703 (0.048) | 0.935 (0.023) |
|
||||
|
All 115 features | 0.982 (0.01) | 0.766 (0.059) | 0.978 (0.012) |
|
Automatically selected 19 features | 0.970 (0.019) | 0.737 (0.051) | 0.970 (0.016) |
aAUC: area under the receiver operating characteristic curve.
bGNB: Gaussian Naive Bayes.
cKNN: K-nearest neighbor algorithm.
dXGB: extreme gradient boosting.
Children and adults also use different approaches to assess health events and situations: A5 words that evaluate events in terms of good or bad and false or true were more prevalent in children’s readings with typical words such as wrong, right, better, good, true, positive, improved, greater, ok, and best. In contrast, A15 words that assess health situations in terms of safety, risk, and harm were more prevalent in adult health readings with typical expressions that we found in the corpus: at-risk, safe, dangerous, exposures, hazard, safety, insurance, warning, alert, and alarming. X9 terms describing success and failure, gains and losses, and benefits and risks were also prevalent in adult health materials. This finding aligns well with the latest research on health communication using the Prospect Theory [
We also identified predictor features that are relevant to the social context of health issues [
The health communicative style is another key dimension of semantic features [
Revised linguistic evaluation framework with final 20 features.
Dimensions of linguistic analyses | Texts on children’s health | Texts on adults’ health | |
|
|||
|
Scope of health knowledge |
F1 (food) X3 (sensory: taste, sound, and touch) |
B2 (medicine); B3 (medical treatment) Z99 (complex and out-of-dictionary words) |
|
Assessment of situations |
A5 (good or bad and true or false) |
A15 (safety or danger) X9 (success or failure, gains or loss, and benefits or risks) |
|
Describing efforts |
X8 (level of efforts or resolution) |
A12 (level of difficulty) |
|
Complexity of actions |
M1 (actions of movement) |
S8 (level of help or hindrance) |
|
|||
|
Interpersonal relations |
S4 (kin) |
S5 (social groups and affiliation) |
|
Socioeconomic context |
N/Aa |
S1 (terms related to participation, involvement, entitlement, eligibility; or describing personality traits such as strength, weakness, vulnerability, and disadvantaged) |
|
|||
|
Attention emphasis and stress |
A13 (degree) A14 (particularizers) |
A11 (importance) |
|
Logical coherence |
Z8 (pronouns) Z6 (negative) |
N/A |
aN/A: not applicable.
Predictor variables of binary logistic regression (children=0; adult=1).
Relevance of semantic features to outcomes | Values | ||||||||
|
β (SE) | Wald test | ORa (95% CI) | ||||||
|
|||||||||
|
Z6 | −0.252 (0.061) | 16.966 | <.001 | 0.778 (0.690-0.876) | ||||
|
X8 | −0.228 (0.127) | 3.233 | .07 | 0.796 (0.621-1.021) | ||||
|
S4 | −0.195 (0.050) | 15.351 | <.001 | 0.823 (0.746-0.907) | ||||
|
X3 | −0.134 (0.033) | 16.715 | <.001 | 0.875 (0.820-0.933) | ||||
|
A5 | −0.104 (0.038) | 7.418 | .006 | 0.902 (0.837-0.971) | ||||
|
A14 | −0.063 (0.144) | 0.192 | .66 | 0.939 (0.707-1.246) | ||||
|
M1 | −0.054 (0.039) | 1.927 | .17 | 0.948 (0.878-1.022) | ||||
|
F1 | −0.038 (0.011) | 11.374 | .001 | 0.963 (0.942-0.984) | ||||
|
A13 | −0.036 (0.042) | 0.744 | .39 | 0.965 (0.889-1.047) | ||||
|
Z8 | −0.021 (0.008) | 7.589 | .006 | 0.979 (0.964-0.994) | ||||
|
|||||||||
|
A11 | 0.030 (0.086) | 0.124 | .73 | 1.031 (0.871-1.219) | ||||
|
B2 | 0.032 (0.013) | 6.397 | .01 | 1.032 (1.007-1.058) | ||||
|
B3 | 0.066 (0.019) | 12.425 | <.001 | 1.068 (1.030-1.108) | ||||
|
Z99 | 0.067 (0.011) | 35.849 | <.001 | 1.069 (1.046-1.093) | ||||
|
X9 | 0.118 (0.064) | 3.400 | .07 | 1.126 (0.993-1.277) | ||||
|
S8 | 0.162 (0.040) | 16.137 | <.001 | 1.176 (1.087-1.273) | ||||
|
S5 | 0.248 (0.057) | 19.056 | <.001 | 1.281 (1.146-1.432) | ||||
|
S1 | 0.279 (0.085) | 10.703 | .001 | 1.322 (1.118-1.562) | ||||
|
A12 | 0.297 (0.102) | 8.573 | .003 | 1.346 (1.103-1.642) | ||||
|
A15 | 0.665 (0.192) | 12.003 | .001 | 1.945 (1.335-2.833) |
aOR: odds ratio.
Performance of machine learning models using different sets of features as predictors.
Algorithm | AUCa, mean (SD) | Sensitivity, mean (SD) | Specificity, mean (SD) | Macro F1b, mean (SD) |
115 features | 0.8224 (0.0617) | 0.6848 (0.1252) | 0.7714 (0.1161) | 0.6336 (0.080) |
19 features (automatic selection) | 0.8817 (0.0539) | 0.6339 (0.0743) | 0.9029 (0.0626) | 0.6333 (0.067) |
20 features (linguistic review) | 0.8888 (0.0315) | 0.7733 (0.076) | 0.8629 (0.0843) | 0.7248 (0.0451) |
aAUC: area under the receiver operating characteristic curve.
bF1 = 2 × [(precision × recall) / (precision + recall)].
Pairwise corrected resampled
Pair | Description | Mean difference (%) | 95% CI of mean difference | |
1 | 19 features versus 115 features | 7.2096 | 0.0059 to 0.1126 | .008a |
2 | 20 features versus 115 features | 8.0729 | −0.0071 to 0.1399 | .02a |
3 | 20 features versus 19 features | 0.8052 | −0.0421 to 0.0563 | .56 |
a
Pairwise corrected resampled
Pair | Description | Mean difference (%) | 95% CI of the mean difference | |
1 | 19 features versus 115 features | −7.4336 | −0.1699 to 0.0681 | .13 |
2 | 20 features versus 115 features | 12.9204 | −0.016 to 0.1929 | .011a |
3 | 20 features versus 19 features | 21.9885 | 0.1048 to 0.174 | <.001a |
a
Pairwise corrected resampled
Pair | Description | Mean difference (%) | 95% CI of the mean difference | |
1 | 19 features versus 115 features | 17.037 | −0.1389 to 0.4017 | .10 |
2 | 20 features versus 115 features | 11.8519 | −0.0163 to 0.1991 | .01a |
3 | 20 features versus 19 features | −4.4304 | −0.2923 to 0.2123 | .53 |
a
Pairwise corrected resampled
Pair | Description | Mean difference (standardized; %) | 95% CI of the mean difference | |
1 | 19 features versus 115 features | −0.0539 | −0.0555 to 0.0548 | .98 |
2 | 20 features versus 115 features | 14.3813 | 0.0158 to 0.1665 | .006a |
3 | 20 features versus 19 features | 14.4430 | 0.0422 to 0.1407 | .001 |
a
We also tested the scalability and effectiveness of the 20 linguistically enhanced features (
Scalability and effectiveness of the 20 linguistically enhanced features. AS: automatically selected; AUC: area under the receiver operating characteristic curve; LE: linguistically enhanced; ROC: receiver operating characteristic curve.
Our study illustrated machine learning–assisted selection of textual features to develop new algorithms to predict the content and writing style of credible web-based resources for children’s health education and promotion among the parents and caregivers of young children. We used high-quality health educational resources developed by influential children’s health promotion and educational organizations as training data. We illustrated that feature selection to reduce high-dimensional feature sets is an effective method for improving the efficiency of machine learning algorithms, as shown by the improved performance of the AUC of the model using automatically selected features (n=19) as predictor variables over the originally tagged feature set (n=115;
Machine learning algorithms were known for their lack of interpretability. Through the successive permutation of the linguistically enhanced predictor variables in the developed GNB algorithm, we explored the individual impact of each feature on the model’s sensitivity and specificity. Two sets of semantic features emerged as large contributors to the model’s ability to predict the suitability of health educational resources for adults and children, respectively. We found the final algorithm interpretable using the linguistic profiling framework developed for those automatically selected features. For the prediction of adult-oriented health education readings, that is, features highly relevant for the sensitivity of the model, 11 semantic features were identified as large contributors as indicated by the decrease of sensitivity in their absence: X3 (−9.4%; words of sensory: taste, sound, touch, sight, smell, etc), S4 (−8.93%; kinships), Z99 (−8.78%; complex words), A14 (−7.99%; focusing subjuncts that draw attention to or to focus upon), Z8 (−6.9%) (pronouns), A11 (−6.11%; terms describing importance and priority), S1 (−5.96%; terms of participation, involvement, entitlement, and eligibility or describing personality traits such as strength, weakness, vulnerability, and disadvantaged), A5 (−5.94%; words of evaluating good or bad or true or false), B3 (−5.33%; medical treatment), S8 (−4.86%; words describing levels of help, obstacles, and hindrance), X9 (−0.31%; success or failure; gains or loss; or benefits or risks).
For the prediction of health education readings on children’s health, that is, features highly relevant for the specificity of the model, 10 semantic features were identified as large contributors, as shown by the decrease in model specificity with the successive permutation of these features (
It is worth noting that features identified as key contributors to model sensitivity were not necessarily features that were statistically significant in adult-oriented health readings (
Impact of selected features on mean sensitivity.
Impact of selected features on mean specificity.
The size of the training data set was relatively small, with a couple hundred texts of children-oriented health readings. However, this reflects the reality, as children’s health educational resources are much less than adult health readings. As a result, the model specificity was consistently lower than the model sensitivity. In addition, in the linguistic evaluation framework (
Our study has shown that children-oriented and adult-oriented health educational readings in English have distinct semantic features that can be effectively exploited to develop machine learning algorithms with proven discriminatory accuracy. Specifically, we identified three large sets of semantic features related to the varying cognitive approaches to health information acquisition, the social contexts of health issues, and user-adaptive health communication styles. Machine learning is known to lack interpretability. Our study developed algorithms that are interpretable from the perspective of linguistics and user-oriented health information assessment. Thus, our study shows that a more integrated approach to computerized health information assessment combining insights from fields such as linguistics and health education can help harness the power of machine learning to advance applied social and health research.
Health on the Net Foundation–certified websites used.
Core parameters of the hyper-parameter tuning of ridge classifier, extreme gradient boosting, support vector machine, and recursive feature elimination.
area under the receiver operating characteristic curve
Gaussian Naive Bayes
K-nearest neighbor
Patient Education Materials Assessment Tool
ridge classifier
recursive feature elimination
support vector machine
University of Lancaster Semantic Annotation System
extreme gradient boosting
None declared.