This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
Preterm birth (PTB), a common pregnancy complication, is responsible for 35% of the 3.1 million pregnancy-related deaths each year and significantly affects around 15 million children annually worldwide. Conventional approaches to predict PTB lack reliable predictive power, leaving >50% of cases undetected. Recently, machine learning (ML) models have shown potential as an appropriate complementary approach for PTB prediction using health records (HRs).
This study aimed to systematically review the literature concerned with PTB prediction using HR data and the ML approach.
This systematic review was conducted in accordance with the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) statement. A comprehensive search was performed in 7 bibliographic databases until May 15, 2021. The quality of the studies was assessed, and descriptive information, including descriptive characteristics of the data, ML modeling processes, and model performance, was extracted and reported.
A total of 732 papers were screened through title and abstract. Of these 732 studies, 23 (3.1%) were screened by full text, resulting in 13 (1.8%) papers that met the inclusion criteria. The sample size varied from a minimum value of 274 to a maximum of 1,400,000. The time length for which data were extracted varied from 1 to 11 years, and the oldest and newest data were related to 1988 and 2018, respectively. Population, data set, and ML models’ characteristics were assessed, and the performance of the model was often reported based on metrics such as accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve.
Various ML models used for different HR data indicated potential for PTB prediction. However, evaluation metrics, software and package used, data size and type, selected features, and importantly data management method often remain unjustified, threatening the reliability, performance, and internal or external validity of the model. To understand the usefulness of ML in covering the existing gap, future studies are also suggested to compare it with a conventional method on the same data set.
Preterm birth (PTB), a common pregnancy complication, is responsible for 1.085 million (35%) of the 3.1 million neonatal deaths each year and significantly affects approximately 15 million children annually worldwide [
Fortunately, health records (HRs) in most countries contain data regarding one’s sociodemographic, obstetric, and medical history. This makes HRs appropriate data sets for ML models to learn and eventually predict the intended outcome. There has been growing research on applied ML on HR data to identify efficient predictive models for the early diagnosis of PTB. Few systematic or literature reviews, although are informative, are not focused on PTB [
This systematic review was conducted in accordance with the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) statement. A comprehensive search was performed in bibliographic databases including PubMed, CINAHL, MEDLINE, Web of Science, Scopus, Engineering Village (Compendex and Inspec), and IEEE Computer Society Digital Library, until May 15, 2021, in collaboration with a medical librarian (Stephen L Clancy). The search terms included controlled and free-text terms. The search strategy and number of articles found from each database are shown in
Studies were included if they aimed to predict PTB risk by using HR data. The outcome variable was PTB occurrence, which is globally defined as any pregnancy termination between 20 and 37 weeks of gestation. Although in some studies PTB was defined differently in terms of age range, all definitions were aligned under 37 weeks of gestational age. The PTB definition serves to examine and establish model performance (ie, the ability of the intended model to distinguish PTB cases from non-PTB cases). The papers were required to include a statement of the ML domain or any of its synonyms. To identify any study that failed to include a ML statement in the title or abstract, an extensive list of commonly used ML model techniques was added to the search strategy.
Selected articles were peer reviewed in the Covidence web-based software [
The quality of studies was assessed using the criteria proposed by Qiao [
Quality assessment.
Study | Unmet need (existing gap) | Reproducibility | Robustness | Generalizability (external validation data) | Clinical significance | |||||||
|
|
Feature engineering | Platform package | Hyperparameters | Valid methods to overcome overfitting | Stability of results |
|
Predictor explanation | Suggested clinical use | |||
Weber et al, 2018 [ |
Yes | Yes | Yes | No | 5-fold CVa | Minimum and maximum values reported from the CV | No | Logistic regression coefficients and odds ratios | No | |||
Rawashdeh et al, 2020 [ |
Yes | Yes | Yes | Number of neighbors for KNNb, number of hidden layers for ANNc, number of trees for RFd | Train-test split. Train size 237 with 19 positives. Test size 37 with 7 positives | No | No | No | Yes | |||
Gao et al, 2019 [ |
Yes | Representing medical concepts as a bag of words and word embeddings, TF-IDFe, discretization of continuous features | No | No | Train-test split. Train size 17,607 with 132 positives. Test size 8082 with 85 positives | Minimum and maximum values and CIs | No | Feature importance, odds ratio | Yes | |||
Lee and Ahn, 2019 [ |
Yes | No | Yes | Only neural network architecture described | Train-test split. Both train and test sets contained 298 participants | No | No | Feature importance (RF and ANN) | No | |||
Woolery and Grzymala-Busse, 1994 [ |
Yes | No | Yes | No | A total of 3 different data sets used in isolation; 50-50 train-test split was used with each data set | No | No | No | No | |||
Grzymala-Busse and Woolery, 1994 [ |
Yes | No | Yes | No | A total of 3 different data sets used in isolation; 50-50 train-test split was used with each data set | No | No | No | No | |||
Vovsha et al, 2014 [ |
Yes | No | Yes | No | Data separated timewise to 3 data sets, and 80-20 train-test split was used with each data set; 5-fold CV to select models | No | No | Feature importance (linear SVMf) | No | |||
Esty et al, 2018 [ |
Yes | No | Yes | No | No | No | No | No | No | |||
Frize et al, 2011 [ |
Yes | No | Yes | No | Division into 3 data sets (parous and nulliparous). Train-test-verification splits | SDs of the metrics were reported | No | No | No | |||
Goodwin and Maher, 2000 [ |
Yes | No | Yes | No | Train-test split (75%-25%) | No | No | Feature importance | No | |||
Tran et al, 2016 [ |
Yes | Unigrams were created from free-text fields after removal of stop words | No | No | Train-test split (66%-33%) | No | No | Feature importance | Yes | |||
Koivu and Sairanen, 2020 [ |
Yes | New features were created. Continuous features were standardized, and nominal features were one-hot encoded | Yes | All hyperparameters described | Data set partitioned into 4 parts (feature selection, training, validation, and test, with stratified splits of 10%-70%-10%-10%) | 95% CIs for metrics | Yes | Feature importance | Yes | |||
Khatibi et al, 2019 [ |
Yes | Imputation with mode for categorical features and median for continuous features | No | No | Train-test split | No | No | Feature importance | No |
aCV: cross-validation.
bKNN:
cANN: artificial neural network.
dRF: random forest.
eTF-IDF: term frequency-inverse document frequency.
fSVM: support vector machine.
The reviewed studies were not homogenous in terms of methodology and data set; thus, a meta-analysis was not possible. A narrative synthesis was chosen to bring together broad knowledge from various approaches. This type of synthesis is not the same as a narrative description that accompanies many reviews. To synthesize the literature, we applied a guideline from Popay et al [
After removing duplicates, 732 papers were screened through title and abstract. Of these 732 studies, 23 (3.1%) were screened by full text, resulting in 13 (1.8%) papers that met the inclusion criteria. Reasons for exclusion at this stage were recorded and are shown in the flow diagram in
PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) chart.
All the studies were retrospective and used one or more data sets recorded in clinical settings. Of the 13 studies, 7 (54%) were conducted in or after 2018 and 9 (69%) originated from the United States. The time length for which data were extracted varied from 1 to 11 years, and the oldest and newest data were related to 1988 and 2018, respectively. Of the 13 studies, 6 (46%) did not report the ethnicity or race of the population whose data were modeled. Various data sets were used for the studies, and the number of data sets varied from 1 to 3 in each study. The types of information included in each data set varied, including demographic, obstetric history, medical background, and clinical and laboratory information. Demographic information was included in almost all of the data sets used in the included studies. The size of the population whose data have been used for ML modeling varied from 274 to 13,150,017 people, and the number of features considered for modeling varied from 19 to 5000 depending on the data set used. PTB was defined differently from study to study; the cutoff point for the control and study groups (PTB and non-PTB) was defined as the 37th week of gestational age for 77% (10/13) of the studies that matched the standard cutoff point between term and PTBs. Of the 13 studies, 3 (23%) determined the PTB cutoff based on the frequency of the newborn death [
Descriptive characteristics of studies and feature selection.
Study, country, and type of study | Population characteristics | Data source (number of features) | Population (birth) | Study (PTBa), control groups, and type of PTB | Feature selection process and gestational week for when selected features are related | Number of selected features | Date |
Weber et al, 2018 [ |
Nulliparous women with a singleton birth (<32, ≥20, and ≥37 weeks); non-Hispanic Black (n=54,084) and White (n=282,130) | Birth certificate and hospital discharge records: >1000 features | 336,214 | PTB (early spontaneous): ≥20 and <32 weeks; control: ≥37 weeks | Factors with uncertain and ambiguous values were excluded, highly correlated features were collapsed, exclusion of features with no variation; —b | 20 | 2007 to 2011 |
Rawashdeh et al, 2020 [ |
Australian; pregnancies with cervical cerclage | Data from a fetal medicine unit in a tertiary hospital in NSWc: 19 features | 274 | PTB (spontaneous): <26 weeks; control: >26 weeks | Unnecessary features (eg, medical record numbers) were excluded | 19 | 2003 to 2014 |
Gao et al, 2019 [ |
Caucasian (>68%), Black (16%-21%), and other (10%-13%) | EHRd of Vanderbilt University Medical Center: 150 features | 25,689 | PTB: <28 weeks; control: ≥28 weeks; type of PTB was not distinguished | Features were arranged by their information gain and top 150 features were retained; — | 150 | 2005 to 2017 |
Lee and Ahn, 2019 [ |
Korean; induced labors were excluded | Anam Hospital in Seoul | 596 | PTB (spontaneous): >20 and <37 weeks; control: ≥37 weeks | — | 14 | 2014 to 2018 |
Woolery and Grzymala-Busse, 1994 [ |
— | 3 data sets: 214 features in total | 18,890 | PTB: <37 weeks; control: ≥37 weeks; type of PTB was not distinguished | — | Data set 1 (n=52), data set 2 (n=77), and data set 3 (n=85) | 1994 |
Grzymala-Busse and Woolery, 1994 [ |
— | 3 data sets:153 features in total | 9480 | PTB: <36 weeks; control: ≥36 weeks; type of PTB was not distinguished | — | Data set 1 (n=13), data set 2 (n=73), and data set 3 (n=67) | 1994 |
Vovsha et al, 2014 [ |
— | NICHDe-MFMUf data set: >400 features | 2929 | PTB (spontaneous and induced): <32, <35, and <37 weeks; control: ≥37 weeks | Logistic regression with forward selection, stepwise selection, LASSOg, and elastic net; — | 24th week (n=50), 26th week (n=205), and 28th week (n=316) | 1992 to 1994 |
Esty et al, 2018 [ |
— | BORNh and PRAMSi: 520 features | 782,000 | PTB: <37 weeks; control: ≥37 weeks; type of PTB was not distinguished | Features with >50% missing values were removed before missing value imputation; features come from before the 23rd gestational week | 520 | — |
Frize et al, 2011 [ |
— | PRAMS: >300 features | >113, 000 | PTB: <37 weeks; control: ≥37 weeks; type of PTB was not distinguished | Decision tree (to establish consistency between data sets, features specific to the United States were excluded, eg, Medicaid and Women Infants Children Program); features come from before the 23rd gestational week | 19 for parous and 16 for nulliparous | 2002 to 2004 |
Goodwin and Maher, 2000 [ |
— | Duke University’s Medical Center TMR TM perinatal data: 4000−5000 features | 63,167 | PTB: <37 weeks; control: ≥37 weeks; type of PTB was not distinguished | Heuristic techniques (features related to week <37 were included); — | 32 demographic and 393 clinical | 1988 to 1997 |
Tran et al, 2016 [ |
Australian | RNSj, NSW | 15,814 births | PTB (spontaneous and elective): <34 and <37 weeks; control: ≥37 weeks | Features kept based on their importance (top |
10 | 2011 to 2015 |
Koivu and Sairanen, 2020 [ |
White, Black, American Indian or Alaskan native, and Asian or Pacific Island individuals | CDCk and NYCl data sets | 13,150,017 | PTB: <37 weeks; control: ≥37 weeks; type of PTB was not distinguished | Excluding highly correlated features with correlation analysis (Pearson); — | 26 | CDC: 2013 to 2016; NYC: 2014 to 2016 |
Khatibi et al 2019 [ |
Iranian | National maternal and neonatal records (IMaNm registry): 112 features | >1,400,000 | PTB (spontaneous and medically indicated): >28 and <37 weeks; control: ≥37 weeks | Parallel feature selection and classification methods including MR-PB-PFS (features with nonzero scores are selected as top features); — | 112 | 2016 to 2017 |
aPTB: preterm birth.
bNot reported in the study.
cNSW: New South Wales.
dEHR: electronic health record.
eNICHD: National Institute of Child Health and Human Development.
fMFMU: Maternal-Fetal Medicine Units Network.
gLASSO: least absolute shrinkage and selection operator.
hBORN: Better Outcomes Registry Network.
iPRAMS: Pregnancy Risk Monitoring Assessment System.
jRNS: Royal North Shore.
kCDC: Centers for Disease Control and Prevention.
lNYC: New York City.
mIMaN: Iranian Maternal and Neonatal Network.
Of the 13 studies, 9 (69%) reported at least one piece of preprocessing information regarding the included data. The preprocessing step included data mapping, missing data management, and the class imbalance management in data. For the feature selection, of the 13 studies, 11 (85%) reported at least one method for the feature selection process. The number of features selected for each study varied from 10 to 520 for final ML modeling. On the basis of the literature surveyed, of the 13 studies, only 2 (15%) used unsupervised feature selection. In addition, of the 13 studies, 3 (23%) did not use feature selection, and some studies did use some heuristics instead. Owing to the divergency in feature selection, we could not identify clear trends on how the used approach would affect the model performance (see
Data processing and machine learning modeling.
Study | Preprocessing data | Model | Dominant model | Evaluation metrics | Analysis software and package | Findings | ||||||
|
Missing data management | Class imbalance |
|
|
|
|
|
|||||
Weber et al, 2018 [ |
MICEa | —b | Super learning approach using logistic regression, random forest, |
No difference between models | Sensitivity, specificity, PVPe, PVNf, and AUCg | Rstudio (version 3.3.2), SuperLearner package | AUC=0.67, sensitivity=0.61, specificity=0.64 | |||||
Rawashdeh et al, 2020 [ |
Instances with missing values were removed manually | SMOTEh | Locally weighted learning, Gaussian process, K-star classifier, linear regression, |
Random forest | Accuracy, sensitivity, specificity, AUC, and G-means | WEKAi (version 3.9) | Random forest: G-mean=0.96, sensitivity=1.00, specificity=0.94, accuracy=0.95, AUC=0.98 (oversampling ratio of 200%) | |||||
Gao et al, 2019 [ |
— | Control group were undersampled | RNNsj, long short-term memory network, logistic regression, SVMk, Gradient boosting | RNN ensembled models on balanced data | Sensitivity, specificity, PVP, and AUC | — | AUC=0.827, sensitivity=0.965, specificity=0.698, PVP=0.033 | |||||
Lee and Ahn, 2019 [ |
— | — | ANNl, logistic regression, decision tree, naïve Bayes, random forest, SVM | No difference between models | Accuracy | Python (version 3.52) | No difference in accuracy between ANN (0.9115) with logistic regression and the random forest (0.9180 and 0.8918, respectively) | |||||
Woolery and Grzymala-Busse, 1994 [ |
— | — | LERSm | — | Accuracy | ID3n, LERS CONCLUS | Database 1: accuracy=88.8% accurate for both low-risk and high-risk pregnancy. Database 2: accuracy=59.2% in high-risk pregnant women. Database 3: accuracy=53.4% | |||||
Grzymala-Busse and Woolery,1994 [ |
— | — | LERS based on the |
— | Accuracy | LERS | Accuracy=68% to 90% | |||||
Vovsha et al, 2014 [ |
— | Oversampling techniques (Adasyn) | SVMs with linear and nonlinear kernels, LR (forward selection, stepwise selection, L1 LASSO regression, and elastic net regression) | — | Sensitivity, specificity, and G-means | Rstudio, glmnet package | SVM: sensitivity (0.404 to 0.594), specificity (0.621 to 0.84), G-mean (0.575 to 0.652); LR: sensitivity (0.502 to 0.591), specificity (0.587 to 0.731), G-mean (0.586 to 0.604) | |||||
Esty et al, 2018 [ |
Imputation with the |
Not clear | Hybrid C5.0 decision tree−ANN classifier | — | Sensitivity, specificity, and ROCo | R software, missForest Package, FANNp library | Sensitivity: 84.1% to 93.4%, specificity: 70.6% to 76.9%, AUC: 78.5% to 89.4% | |||||
Frize et al, 2011 [ |
Decision tree | — | Hybrid decision tree–ANN | — | Sensitivity, specificity, ROC for Pq and NPr cases | See5, MATLAB Neural Ware tool | Training (P: sensitivity=66%, specificity=83%, AUC=0.81; NP: sensitivity=62.8%, specificity=71.7%, AUC=0.72), test (P: sensitivity=66.3%, specificity=83.9%, AUC=0.80; NP: sensitivity=65%, specificity=71.3%, AUC=0.73), and verification (P sensitivity=61.4%, specificity=83.3%, AUC=0.79; NP: sensitivity=65.5%, specificity=71.1%, AUC=0.73) | |||||
Goodwin and Maher, 2000 [ |
PVRuleMinerl or FactMiner | — | Neural networks, LR, CARTs, and software programs called PVRuleMiner and FactMiner | No difference between models | ROC | Custom data mining software (Clinical Miner and PVRuleMiner, FactMiner) | No significant difference between techniques. Neural network (AUC=0.68), stepwise LR (AUC=0.66), CART (AUC=0.65), FactMiner (demographic features only; AUC=0.725), FactMiner (demographic plus other indicator features; AUC=0.757) | |||||
Tran et al, 2016 [ |
— | Undersampling of the majority class | SSLRt, RGBu | — | Sensitivity, specificity, NPVv, PVP, F-measure, and AUC | — | SSLR: sensitivity=0.698 to 0.734, specificity=0.643 to 0.732, F-measure=0.70 0.73, AUC=0.764 to 0.791, NPV=0.96 to 0.719, PVP=0.679, 0.731; RGB: sensitivity=0.621 to 0.720, specificity=0.74 to 0.841, F-measures=0.693 to 0.732, NPV=0.675 to 0.717, PVP=0.783 to 0.743, AUC=0.782 to 0.807 | |||||
Koivu and Sairanen, 2020 [ |
— | — | LR, ANN, LGBMw, deep neural network, SELUx network, average ensemble, and weighted average WAy ensemble | — | AUC | Rstudio (version 3.5.1) and Python (version 3.6.9) | AUC for classifiers: LR=0.62 to 0.64; deep neural network: 0.63 to 0.66; SELU network: 0.64 to 0.67; LGBM: 0.64 to 0.67; average ensemble: 0.63 to 0.67; WA ensemble: 0.63 to 0.67 | |||||
Khatibi et al, 2019 [ |
Map phase module | — | Decision trees, SVMs and random forests, ensemble classifiers | — | Accuracy and AUC | — | Accuracy=81% and AUC=68% |
aMICE: Multiple Imputation by Chained Equations.
bNot reported in the study.
cLR: linear regression.
dLASSO: least absolute shrinkage and selection operator.
ePVP: predictive value positive.
fPVN: predictive value negative.
gAUC: area under the ROC curve.
hSMOTE: Synthetic Minority Oversampling Technique.
iWEKA: Waikato Environment for Knowledge Analysis.
jRNN: recurrent neural network.
kSVM: support vector machine.
lANN: artificial neural network.
mLERS: learning from examples of rough sets.
nID3: iterative dichotomiser 3.
oROC: receiver operating characteristic.
pFANN: Fast Artificial Neural Network.
qP: parous.
rNP: nulliparous.
sCART: classification and regression tree.
tSSLR: stabilized sparse logistic regression.
uRGB: Randomized Gradient Boosting.
vNPV: net present value.
wLGBM: Light Gradient Boosting Machine.
xSELU: scaled exponential linear unit.
yWA: weighted average.
Although the included features somewhat differed in the studies, some features were commonly used and considered potential risk factors that may predict PTB occurrence (
Frequency of potential risk factors in the studies (n=13).
Potential risk factors | Studies, n (%) |
Previous PTBa | 10 (77) |
Hypertensive disorders | 9 (70) |
Maternal age | 7 (54) |
Cervical or uterus disorders (cerclage, myoma, or inconsistency) | 7 (54) |
Ethnicity and race | 6 (46) |
Diabetes (eg, gestational, mellitus) | 6 (46) |
Smoking or substance abuse | 5 (38) |
Multiple pregnancy | 5 (38) |
Education | 4 (30) |
Physical characteristics (BMI, weight, and height) | 4 (30) |
Parity | 4 (30) |
Marital status | 3 (23) |
Other chronic diseases (thyroid, asthma, systemic lupus erythematosus, or cardiovascular) | 3 (23) |
PTB symptoms (bleeding, contractions, premature rupture of membranes, etc) | 3 (23) |
Insurance | 2 (15) |
Income | 2 (15) |
In vitro fertilization | 2 (15) |
Stress or domestic violence | 2 (15) |
Infections (gonorrhea, syphilis, chlamydia, or hepatitis C) | 1 (7) |
Biopsy | 1 (7) |
aPTB: preterm birth.
Various basic and complex ML modeling approaches were used with different frequencies, including artificial neural network, logistic regression, decision tree, support vector machine (SVM) with linear and nonlinear kernels, linear regression (least absolute shrinkage and selection operator [LASSO], ridge, and elastic net), random forest, locally weighted learning, gradient boosting, learning from examples of rough sets, Gaussian process, K-star classifier, and naïve Bayes (
Although most studies reported the type of software applied for the ML analysis, only few of them specified the package they have used for the analysis. Several evaluation measures were used to assess the proposed models. These include sensitivity, specificity, area under the receiver operating characteristic curve, accuracy, predictive value positive, predictive value negative, G-mean, F-measure, and net present value, based on the frequency they have been used in the studies. Owing to the divergent methodology used for outcome assessment and model processing, comparison between models was not possible. However, overall, studies with a cutoff gestational age of 37th week, regardless of the model used, often showed lower sensitivity (40%-69%), except for 1 study that showed a sensitivity of 93% [
In general, reviewed studies had satisfactory quality (
Premature birth remains a public health concern worldwide. Survivors experience substantial lifetime morbidity and mortality rates. The conventional methods of PTB assessment that have been used by clinicians seem to be insufficient to identify PTB risk in more than half of the cases. The conventional methods that are concerned with health data (HR) are often statistical modeling, in which, first, input predictive factors are selected by a researcher and, second, the multifactorial nature of PTB is ignored. Thus, these methods suffer from biases and linearities. The linear vision on HR in conventional approaches is perhaps one of the major barriers to advancing our understanding of nonlinear interaction dynamics between potential risk factors of multifactorial PTB. ML modeling, in contrast to statistical modeling, investigates the structure of the target phenomenon without preassumption on data, and automatically and thoroughly explores possible nonlinear associations and higher-order interactions (more than 2-way) between potential the risk factors and the outcome [
Among the reviewed studies, the performance of various ML modeling indicated potential for predictive purposes. Owing to the different evaluation metrics used by studies, performance comparison across studies was not practical. On the basis of within-study synthesis, some studies compared nonlinear ML methods, such as deep neural networks, kernel SVMs, or random forests, to more basic linear models, such as logistic regression, LASSO, and elastic net. Of these 13 studies, 4 (31%) concluded that there was no significant difference between the predictive performances of the different applied methods [
An HR seems to be a useful data source, including the potential risk factors from which the ML model can learn the significant predictors as well as the nonlinear interaction among the identified risk factors.
A large sample size, as one of the distinct characteristics of HR data, is a double-edged sword that covers large populations but consumes time and requires advanced technology. A large data size can also be used to create validation sets. Most studies in this review had large sample sizes, including thousands of pregnant women. Although some studies performed internal validation, external validation was uncommon, and almost all studies validated the performance within the same HR. Th lack of external validity assessment limits generalizability and may reduce the discrimination validity of the model when applied in other sites and HR systems. External validation of the model through its application in a distinct data set may be helpful in understanding its usefulness and generalizability in different geographical areas, periods, and settings [
Large data sizes and reflective data types are as important as large sample sizes. HR data often appear insufficient to precisely identify risk factors that decrease the accuracy of predictive ML models. Indeed, small sample size and passive data that are limited to a few sociodemographic and medical histories seem insufficient to predict the multifactorial PTB. Enriched data that include more, time-sensitive, and dynamic characteristics of each individual (eg, life history, mental distress during various stages of pregnancy, and biomarker change) may increase the accuracy and integrity of the applied ML models. For example, being diagnosed with gestational diabetes is known to be a strong predictive factor for PTB among the features in ML models. However, owing to the dynamic nature of diabetes (glucose level), which can vary from moment to moment, particularly during pregnancy, applying a pool of data reflecting the dynamic glucose change in a person may be more accurate in predicting PTB in comparison with the presence or absence of diabetes. The difference in glucose change may also partially explain why some women with diabetes are at a higher risk of developing PTB. To achieve this accuracy in HR use, data should be enriched by more and dynamic features and ML models should be optimized to analyze the dynamic-natured potential risk factors that go beyond the clear-cut presence or absence of a feature [
In contrast, a small data size threatens the risk factor distinction for PTB prediction. There might be an indirect association between some predictive factors and PTB, falsifying the direct and actual associations. For example, smoking not only is introduced as a protective factor against mortality in low–birth weight and PTB infants but also is identified as a predictive factor for PTBs. In this case, PTB may not be the result of smoking directly itself but due to potential mediators, such as hypertension, which is triggered by smoking. Therefore, if there is no recorded information about blood pressure, the model may consider smoking as the actual risk factor. This highlights the importance of more possible health data to increase the ability of the ML model to distinguish between mediators and exposure features.
One of the major challenges in HR-based studies is the presence of missing data. Although missing data have been an acknowledged challenge in HR studies, a little more than half of the studies acknowledged the presence of missing data and a variety of analytic approaches to manage this absence. On average, despite its importance, there has been minimal work in this area, and it is unclear how such biased observations impact prediction models.
Another important challenge in HR-related models is unbalanced data between case and control groups. This problem is because PTB occurs in 10% of all births. Researchers have often applied oversampling techniques to handle unbalanced data. However, these techniques create artificial data that may not have much in common with actual observations. Oversampling techniques must be used carefully in validating models because if artificial instances end up in the test set (or test folds in cross-validation), one may obtain highly overoptimistic performance estimates.
In addition, all reviewed studies approached PTB prediction as a classification problem. There was often no clear discernment of abortion and PTB in the reviewed studies. This ambiguity, if it comes from missing to distinguish abortion from PTB in actual ML modeling, may threaten the specificity of the model in predicting PTB. In addition, as PTB and abortion have different leading causes, the findings of the studies may also be questionable. In addition, in the defined PTB time window (20-37 gestational week), classification remains problematic. In this case, neonates born at week ≤30 are considered to belong to the same class as those born at week 36 of pregnancy. However, the former is associated with a much higher risk of adverse outcomes and requires neonatal intensive care. Therefore, it could be more beneficial to approach PTB as a regression problem and try to predict the gestational age (as weeks or days) at childbirth. This approach could help identify PTB cases that have the greatest need for care.
Overall, ML modeling has been indicated to be a potentially useful approach in predicting PTB, although future studies are suggested to minimize the aforementioned limitations to achieve more accurate models. Importantly, ML’s ability to cover the existing gap in conventional statistical methods remains questionable. To achieve reliable conclusions, our study suggests some considerations for future studies. First, more studies are needed to compare ML modeling with existing conventional methods in the same data set with the same amount of data and population. Conducting the comparison studies uncovers the potential superiority of one over the other. Second, the study population should be distinguished based on parity, particularly if previous pregnancy data were among the selected features. Otherwise, the model would probably rely on this strong predictive factor in multiparous women, leaving nulliparous women underserved and undetected. In addition, studies should be transparent to whether they use the same time frame for feature selection for case (PTB) and control (non-PTB) groups. For instance, assume that we have a cutoff point of 28 weeks before which we want our model to identify PTB cases. In this case, if we include the data for the control group to be after the cutoff point, which most likely differs from before the cutoff point, the model may rely on the information after the cutoff point for PTB prediction. Thus, the model fails to detect the cases before the specified time point. Third, two cutoff points should be clarified in model development: (1) the gestational cutoff week the study targets before the cases are detected and (2) the gestational time point before the features are selected. For example, Gao et al [
Enriched data size and optimized data type can also improve the usefulness of the ML model. Appropriate approaches for managing missing data and unbalanced control and case groups are also required to achieve more reliable and accurate results.
Search strategy.
Machine learning models’ frequency.
health record
machine learning
Preferred Reporting Items for Systematic Reviews and Meta-Analyses
preterm birth
support vector machine
None declared.