This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
Machine learning (ML) models provide more choices to patients with diabetes mellitus (DM) to more properly manage blood glucose (BG) levels. However, because of numerous types of ML algorithms, choosing an appropriate model is vitally important.
In a systematic review and network metaanalysis, this study aimed to comprehensively assess the performance of ML models in predicting BG levels. In addition, we assessed ML models used to detect and predict adverse BG (hypoglycemia) events by calculating pooled estimates of sensitivity and specificity.
PubMed, Embase, Web of Science, and Institute of Electrical and Electronics Engineers Explore databases were systematically searched for studies on predicting BG levels and predicting or detecting adverse BG events using ML models, from inception to November 2022. Studies that assessed the performance of different ML models in predicting or detecting BG levels or adverse BG events of patients with DM were included. Studies with no derivation or performance metrics of ML models were excluded. The Quality Assessment of Diagnostic Accuracy Studies tool was applied to assess the quality of included studies. Primary outcomes were the relative ranking of ML models for predicting BG levels in different prediction horizons (PHs) and pooled estimates of the sensitivity and specificity of ML models in detecting or predicting adverse BG events.
In total, 46 eligible studies were included for metaanalysis. Regarding ML models for predicting BG levels, the means of the absolute root mean square error (RMSE) in a PH of 15, 30, 45, and 60 minutes were 18.88 (SD 19.71), 21.40 (SD 12.56), 21.27 (SD 5.17), and 30.01 (SD 7.23) mg/dL, respectively. The neural network model (NNM) showed the highest relative performance in different PHs. Furthermore, the pooled estimates of the positive likelihood ratio and the negative likelihood ratio of ML models were 8.3 (95% CI 5.712.0) and 0.31 (95% CI 0.220.44), respectively, for predicting hypoglycemia and 2.4 (95% CI 1.63.7) and 0.37 (95% CI 0.290.46), respectively, for detecting hypoglycemia.
Statistically significant high heterogeneity was detected in all subgroups, with different sources of heterogeneity. For predicting precise BG levels, the RMSE increases with a rise in the PH, and the NNM shows the highest relative performance among all the ML models. Meanwhile, current ML models have sufficient ability to predict adverse BG events, while their ability to detect adverse BG events needs to be enhanced.
PROSPERO CRD42022375250; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=375250
Diabetes mellitus (DM) has become one of the most serious health problems worldwide [
Machine learning (ML) models use statistical techniques to provide computers with the ability to complete assignments by training themselves without being explicitly programmed [
Recently, there has been an immense surge in using ML technologies for predicting DM complications. Regarding BG management, previous studies have developed different types of ML models, including random forest (RF) models, support vector machines (SVMs), neural network models (NNMs), and autoregression models (ARMs), using CGM data, electronic health records (EHRs), electrocardiograph (ECG), electroencephalograph (EEG), and other information (ie, biochemical indicators, insulin intake, exercise, and meals) [
Therefore, this metaanalysis aimed to comprehensively assess the performance of ML models in BG management in patients with DM.
The study protocol has been registered in the international prospective register of systematic reviews (PROSPERO; registration ID: CRD42022375250). Studies on BG levels or adverse BG event prediction or detection using ML models were eligible, with no restrictions on language, investigation design, or publication status. PubMed, Embase, Web of Science, and Institute of Electrical and Electronics Engineers (IEEE) Explore databases were systematically searched from inception to November 2022. Keywords used for study repository searches were (“machine learning” OR “artificial intelligence” OR “logistic model” OR “support vector machine” OR “decision tree” OR “cluster analysis” OR “deep learning” OR “random forest”) AND (“hypoglycemia” OR “hyperglycemia” OR “adverse glycemic events”) AND (“prediction” OR “detection”). Details regarding the search strategies are summarized in
Inclusion criteria were as follows: (1) participants in the studies were diagnosed with DM; (2) study endpoints were hypoglycemia, hyperglycemia, or BG levels; (3) the studies established at least 2 or more types of ML models for prediction of BG levels and 1 or more types of ML models for prediction or detection of adverse BG events; (4) the studies reported the performance of ML models with statistical or clinical metrics; (5) the studies contained the development and validation of ML models; and (6) study outcomes were means (SDs) of performance metrics of test data for prediction of BG levels and sensitivity and specificity of test data for prediction or detection of adverse BG events.
Exclusion criteria were as follows: (1) studies did not report on the derivation of ML models, (2) studies were based only on physiological or controloriented ML models, (3) studies could not reproduce true positives, true positives, false negatives, and false positives for prediction or detection of adverse BG events, (4) studies were reviews, systematic reviews, animal studies, or irretrievable and repetitive papers, and (5) studies had unavailable full text or outcome metrics.
Authors KL and LYL screened and selected studies independently based on the criteria mentioned before. Authors KL and YM extracted and recorded the data from the selected studies. Conflicts were resolved by reaching a consensus. The study strictly followed the PRISMA (Preferred Reporting Items for Systematic Reviews and MetaAnalysis) statement (
Two reviewers independently carried out data extraction and quality assessment. If a single study included more than 1 extractable test results for the same ML model, the best result was extracted. If a single study included 2 or more models, the performance metrics of each model were extracted. For studies predicting BG levels, RMSEs based on different prediction horizons (PHs) were extracted. For studies predicting or detecting adverse BG events, the sensitivity, specificity, and precision of reproducing the 2×2 contingency table were extracted.
Specifically, the following information was extracted:
General characteristics: first author, publication year, country, data source, and study purpose (ie, predicting or detecting hypoglycemia)
Experimental information: participants (type of DM, type 1 or 2), sample size (patients, data points, and hypoglycemia), demographic information, models, study place and time, model parameters (ie, input and PHs), model performance metrics, threshold of BG levels for hypoglycemia, and reference (ie, fingerstick)
The Quality Assessment of Diagnostic Accuracy Studies (QUADAS2) tool was applied to assess the quality of included studies based on patient selection (5 items), index test (3 items), reference standard (4 items), and flow and timing (4 items). All 4 domains were used for assessing the risk of bias, and the first 3 domains were used to assess the consensus of applicability. Each domain has 1 query in relation to the risk of bias or applicability consisting of 7 questions [
The performance metrics of ML models used to predict BG levels, predict adverse BG events, and detect adverse BG events were assessed independently. The performance metrics were the RMSE of ML models in predicting BG levels and the sensitivity and specificity of ML models in predicting or detecting adverse BG events. A network metaanalysis was conducted for BG level–based studies to assess the global and local inconsistency between studies and plotted the surface under the cumulative ranking (SUCRA) curve of every model to calculate relative ranks. For eventbased studies, pooled sensitivity, specificity, the positive likelihood ratio (PLR), and the negative likelihood ratio (NLR) with 95% CIs were calculated. Study heterogeneity was assessed by calculating I² values based on multivariate randomeffects metaregression that considered within and betweenstudy correlation and classifying them into quartiles (0% to <25% for low, 25% to <50% for lowtomoderate, 50% to <75% for moderatetohigh, and >75% for high heterogeneity) [
Furthermore, BG level–based studies were divided into 4 subgroups based on different PHs (15, 30, 45, 60 minutes), and adverse event–based studies were analyzed using different types of models (ie, NNM, RF, and SVM). A 2sided
A total of 20,837 studies were identified through systematically searching the predefined electronic databases; these also included 21 studies found using reference tracking [
Flow diagram of identifying and including studies. IEEE: Institute of Electrical and Electronics Engineers.
As studies on hyperglycemia were insufficient for analysis, we selected studies on hypoglycemia to assess the ability of ML models to predict adverse BG events. In total, the 46 studies included 28,775 participants: n=428（1.49%）for predicting BG levels, n=28,138 (97.79%) for predicting adverse BG events, and n=209 (0.72%) for detecting adverse BG events. Of the 46 studies, 10 (21.7%) [
Baseline characteristics of BG^{a} levelbased studies (N=10).
First author (year), country  Data source  Sample size  Demographic information  Object; setting  Model; PH^{b} (minutes); input  Performance metrics  
Patients, n  Data points, n 


PérezGandía (2010), Spain [ 
CGM^{c} device  15  728  —^{d}  T1DM^{e}; out  Models: NNM^{f}, ARM^{g} PH: 15, 30 Input: CGM data  RMSE^{h}, delay  
Prendin (2021) United States [ 
CGM device  Real (n=141)  350,000  Age  T1DM; out  ARM, autoregressive moving average (ARMA), autoregressive integrated moving average (ARIMA), SVM^{i}, RF^{j} feedforward neural network (fNN), long shortterm memory (LSTM) PH: 30 Input: CGM data  RMSE, coefficient of determination (COD) sensibility, delay, precision 

Zhu (2020) England [ 
Ohio T1DM, UVA/Padova T1D  Real (n=6), simulated (n=10)  1,036,800  —  T1DM; out  DRNN^{k}, NNM, SVM, ARM PH:30 Input: BG level, meals, exercise, meal times  RMSE, mean absolute relative difference (MARD) time gain  
D'Antoni (2020), Italy [ 
Ohio T1DM  6  —  Age, sex ratio  T1DM; out  ARJNN^{l}, RF, SVM, autoregression (AR), one symbolic model (SAX), recurrent neural network (RNN), one neural network model (NARX), jump neural network (JNN), delayed feedforward neural network model (DFFNN) PH: 15, 30 Input: CGM data  RMSE  
Amar (2020), Israel [ 
CGM device, insulin pump  141  1,592,506  Age, sex ratio, weight, BMI, duration of DM  T1DM; in  ARM, gradually connected neural network (GCN), fully connected (FC [neural network]), light gradient boosting machine (LCBM), RF PH: 30, 60 Input: CGM data  RMSE, Clarke error grid (CEG)  
Li (2020), England [ 
UVA/Padova T1D  Simulated (n=10)  51,840  —  T1DM; out  GluNet, NNM, SVM, latent variable with exogenous input (LVX), ARM PH: 30, 60 Input: BG level, meals, exercise  RMSE, MARD, time lag  
Zecchin (2012), Italy [ 
UVA/Padova T1D, CGM device  Simulated (n=20), real (n=15)  —  —  T1DM; out  Neural network–linear prediction algorithm (NNLPA), NN, ARM PH: 30 Input: meals, insulin  RMSE, energy of secondorder differences (ESOD), time gain, J index  
Mohebbi (2020), Denmark [ 
Cornerstones4Care platform  Real (n=50  —  —  T1DM; in  LSTM, ARIMA PH: 15, 30, 45, 60, 90  RMSE, MAE  
Daniels (2022), England [ 
CGM device  Real (n=12)  —  Sex ratio  T1DM; out  Convolutional recurrent neural network (CRNN), SVM PH: 30, 45, 60, 90, 120 Input: BG level, insulin, meals, exercise  RMSE, MAE, CEG, time gain  
Alfian (2020), Korea [ 
CGM device  Real (n=12)  26,723  —  —  SVM, knearest neighbor knearest neighbor (kNN), DT^{m}, RF, AdaBoost, XGBoost^{n}, NNM PH: 15, 30 Input: CGM data  RMSE, glucosespecific root mean square error (gRMSE), R2 score, mean absolute percentage error (MAPE) 
^{a}BG: blood glucose.
^{b}PH: prediction horizon.
^{c}CGM: continuous glucose monitoring.
^{d}Not applicable.
^{e}T1DM: type 1 diabetes mellitus.
^{f}NNM: neural network model.
^{g}ARM: autoregression model.
^{h}RMSE: root mean square error.
^{i}SVM: support vector machine.
^{j}RF: random forest.
^{k}DRNN: dilated recurrent neural network.
^{l}ARJNN: ARTiDe jump neural network.
^{m}DT: decision tree.
^{n}XGBoost: Extreme Gradient Boosting.
Baseline characteristics of studies predicting adverse BG^{a} events (N=19).
First author (year), country  Data source  Sample size  Object; setting  Model  Time  Age (years), mean (SD)/range  Threshold  
Patients, n  Data points, n  Hypoglycemia, n  
Pils (2014), United States [ 
CGM^{b} device  2  2518  152  T1DM^{c}; out  SVM^{d}  All  —^{e}  3.9  
Seo (2019), Korea [ 
CGM device  104  7052  412  DM^{f}; out  RF^{g}, SVM, knearest neighbor (kNN), logistic regression (LR)  Postprandial  52  3.9  
Parcerisas (2022), Spain [ 
CGM device  10  67  22  T1DM; out  SVM  Nocturnal  31.8 (SD 16.8)  3.9  
Stuart (2017), Greece [ 
EHRs^{h}  9584  —  1327  DM; in  Multivariable logistic regression (MLR)  All  —  4  
Bertachi (2020), Spain [ 
CGM device  10  124  39  T1DM; out  SVM  Nocturnal  31.8 (SD 16.8)  3.9  
Elhadd (2020), Qatar [ 
—  13  3918  172  T2DM; out  XGBoost^{i}  All  3563  —  
MosqueraLopez (2020), United States [ 
CGM device  10  117  17  T1DM; out  SVM  Nocturnal  33.7 (SD 5.8)  3.9  
MosqueraLopez (2020), United States [ 
CGM device  20  2706  258  T1DM; out  SVM  Nocturnal  —  3.9  
Ruan (2020), England [ 
EHRs  17,658  3276  703  T1DM; in  XGBoost, LR, stochastic gradient descent (SGD), kNN, DT^{j}, SVM, quadratic discriminant analysis (QDA), RF, extra tree (ET), linear discriminant analysis (LDA), AdaBoost, bagging  All  66 (SD 18)  4  
Güemes (2020), United States [ 
CGM device  6  55  6  T1DM; out  SVM  Nocturnal  4060  3.9  
Jensen (2020), Denmark [ 
CGM device  463  921  79  T1DM; out  LDA  Nocturnal  43 (SD 15)  3  
Oviedo (2019), Spain [ 
CGM device  10  1447  420  T1DM; out  SVM  Postprandial  41 (SD 10)  3.9  
Toffanin (2019), Italy [ 
CGM device  20  7096  36  T1DM; out  Individual modelbased  All  46  3.9  
Bertachi (2018), United States [ 
CGM device  6  51  6  T1DM; out  NNM^{k}  Nocturnal  4060  3.9  
Eljil (2014), United Arab Emirates [ 
CGM device  10  667  100  T1DM; out  Bagging  All  25  3.3  
Dave (2021), United States [ 
CGM device  112  546,640  12,572  T1DM; out  RF  All  12.67 (SD 4.84)  3.9  
Marcus (2020), Israel [ 
CGM device  11  43,533  5264  T1DM; out  Kernel ridge regression (KRR)  All  1839  3.9  
Reddy (2019), United States [ 
—  55  90  29  T1DM; out  RF  —  33 (SD 6)  3.9  
Sampath (2016), Australia [ 
—  34  150  40  T1DM; out  Ranking aggregation (RA)  Nocturanl  —  —  
Sudharsan (2015), United States [ 
—  —  839  428  T2DM; out  RF  All  —  3.9 
^{a}BG: blood glucose.
^{b}CGM: continuous glucose monitoring.
^{c}T1DM: type 1 diabetes mellitus.
^{d}SVM: support vector machine.
^{e}Not applicable.
^{f}DM: diabetes mellitus.
^{g}RF: random forest.
^{h}EHR: electronic health record.
^{i}XGBoost: Extreme Gradient Boosting.
^{j}DT: decision tree.
^{k}NNM: neural network model.
Baseline characteristics of studies detecting adverse BG^{a} events (N=17).
First author (year), country  Data source  Sample size  Object; setting  Model  Time  Age (years), mean (SD)/range  Threshold  
Patients, n  Data points, n  Hypoglycemia, n  
Jin (2019), United States [ 
EHRs^{b}  —^{c}  4104  132  T1DM^{d}; in  Linear discriminant analysis (LDA)  All  —  —  
Nguyen (2013), Australia [ 
EEG^{e}  5  144  76  T1DM; in  LevenbergMarquardt (LM), genetic algorithm (GA)  All  1218  3.3  
Chan (2011), Australia [ 
CGM^{f} device  16  100  52  T1DM; experimental  Feedforward neural network (fNN)  Nocturnal  14.6 (SD 1.5)  3.3  
Nguyen (2010), Australia [ 
EEG  6  79  27  T1DM; experimental  Blockbased neural network (BRNN)  Nocturnal  1218  3.3  
Rubega (2020), Italy [ 
EEG  34  2516  1258  T1DM; experimental  NNM^{g}  All  55 (SD 3)  3.9  
Chen (2019), United States [ 
EEG  —  300  11  DM^{h}; in  Logistic regression (LR)  All  —  —  
Jensen (2013), Denmark [ 
CGM device  10  1267  160  T1DM; experimental  SVM^{i}  All  44 (SD 15)  3.9  
Skladnev (2010), Australia [ 
CGM device  52  52  11  T1DM; in  fNN  Nocturnal  16.1 (SD 2.1)  3.9  
Iaione (2005), Brazil [ 
EEG  8  1990  995  T1DM; experimental  NNM  Morning  35 (SD 13.5)  3.3  
Nuryani (2012), Australia [ 
ECG  5  575  133  DM; in  SVM, linear multiple regression (LMR)  All  16 (SD 0.7)  3.0  
San (2013), Australia [ 
ECG  15  440  39  T1DM; in  Blockbased neural network (BBNN), wavelet neural network (WNN), fNN, SVM  All  14.6 (SD 1.5)  3.3  
Ling (2012), Australia [ 
ECG  16  269  54  T1DM; in  Fuzzy reasoning model (FRM), fNN, multiple regression–fuzzy inference system (MRFIS)  Nocturnal  14.6 (SD 1.5)  3.3  
Ling (2016), Australia [ 
ECG  16  269  54  T1DM; in  Extreme learning machine–based neural network (ELMNN), particle swarm optimization–based neural network (PSONN), MRFIS, LMR, fuzzy inference system (FIS)  Nocturnal  14.6 (SD 1.5)  3.3  
Nguyen (2012), Australia [ 
EEG  5  44  20  T1DM; in  NNM  —  1218  3.3  
Ngo (2020), Australia [ 
EEG  8  135  53  T1DM; in  BRNN  Nocturnal  1218  3.9  
Ngo (2018), Australia [ 
EEG  8  54  26  T1DM; in  BRNN  Nocturnal  1218  3.9  
Nuryani (2010), Australia [ 
ECG  5  27  8  T1DM; experimental  Fuzzy support vector machine (FSVM), SVM  Nocturnal  16 (SD 0.7)  3.3 
^{a}BG: blood glucose.
^{b}EHR: electronic health record.
^{c}Not applicable.
^{d}T1DM: type 1 diabetes mellitus.
^{e}EEG: electroencephalograph.
^{f}CGM: continuous glucose monitoring.
^{g}NNM: neural network model.
^{h}DM: diabetes mellitus.
^{i}SVM: support vector machine.
As shown in
The quality assessment results using the QUADAS2 tool showed that more than half of all included studies did not report the patient selection criteria in detail, which led to lowquality patient selection (
Quality assessment of included studies. Risk of bias and applicability concerns graph (A) and risk of bias and applicability concerns summary (B).
Network metaanalysis was conducted to evaluate the performance of different ML models. For PH=30 minutes, 10 (21.7%) studies [
For PH=60 minutes, 4 (8.7%) studies [
For PH=15 minutes, 3 (6.5%) studies [
For PH=45 minutes, only 2 (4.3%) studies [
Network map of ML models for predicting BG levels in different PHs. PH=30 (A), 60 (B), 15 (C), and 45 minutes (D). ARIMA: autoregressive integrated moving average; ARM: autoregression model; ARMA: autoregressive moving average; ARJNN: ARTiDe jump neural network; BG: blood glucose; CRNNMTL: convolutional recurrent neural network multitask learning; CRNNMTLGV: convolutional recurrent neural network multitask learning glycemic variability; CRNNSTL: convolutional recurrent neural network singletask learning; CRNNTL: convolutional recurrent neural network transfer learning; DFFNN: delayed feedforward neural network; DRNN: dilated recurrent neural network; DT: decision tree; FC: fully connected (neural network); fNN: feedforward neural network; GCN: gradually connected neural network; JNN: jump neural network; kNN: knearest neighbor; LGBM: light gradient boosting machine; LSTM: long shortterm memory; LVX: latent variable with exogenous input; ML: machine learning; NARX: one neural network model; NNLPA: neural network–linear prediction algorithm; NNM: neural network model; PH: prediction horizon; RF: random forest; RNN: recurrent neural network; SAX: one symbolic model; SVR: support vector regression.
Relative ranks of ML^{a} models for predicting BG^{b} levels in PH^{c}=30 minutes.
ML model  SUCRA^{d}  Relative rank 
NNM^{e}  52.0  14.4 
ARM^{f}  39.6  17.9 
ARJNN^{g}  79.5  6.8 
RF^{h}  6.9  27.1 
SVM^{i}  73.3  8.5 
One symbolic model (SAX)  0.4  28.9 
Recurrent neural network (RNN)  19.0  23.7 
One neural network model (NARX)  3.9  27.9 
Jump neural network (JNN)  36.0  18.9 
Delayed feedforward neural network model (DFFNN)  15.8  24.6 
Gradually connected neural network (GCN)  41.1  17.5 
Fully connected (FC [neural network])  58.1  12.7 
Light gradient boosting machine (LGBM)  69.3  9.6 
DRNN^{j}  99.1  1.2 
Autoregressive moving average (ARMA)  54.3  13.8 
Autoregressive integrated moving average (ARIMA)  46.6  16.0 
Feedforward neural network (fNN)  86.3  4.8 
Long shortterm memory (LSTM)  69.1  9.7 
GluNet  96.4  2.0 
Latent variable with exogenous input (LVX)  75.2  7.9 
Neural network–linear prediction algorithm (NNLPA)  60.0  12.2 
Convolutional recurrent neural network multitask learning (CRNNMTL)  77.5  7.3 
Convolutional recurrent neural network multitask learning glycemic variability (CRNNMTLGV)  77.2  7.4 
Convolutional recurrent neural network transfer learning (CRNNTL)  71.8  8.9 
Convolutional recurrent neural network singletask learning (CRNNSTL)  52.0  14.4 
kNearest neighbor (kNN)  26.0  21.7 
DT^{k}  16.2  24.5 
AdaBoost  18.0  24.0 
XGBoost^{l}  29.2  20.8 
^{a}ML: machine learning.
^{b}BG: blood glucose.
^{c}PH: prediction horizon.
^{d}SUCRA: surface under the cumulative ranking.
^{e}NNM: neural network model.
^{f}ARM: autoregression model.
^{g}ARJNN: ARTiDe jump neural network.
^{h}RF: random forest.
^{i}SVM: support vector machine.
^{j}DRNN: dilated recurrent neural network.
^{k}DT: decision tree.
^{l}XGBoost: Extreme Gradient Boosting.
SUCRA curves of ML models for predicting BG levels in different PHs. PH=30 (A), 60 (B), 15 (C), and 45 minutes (D). ARIMA: autoregressive integrated movingaverage; ARM: autoregression model; ARMA: autoregressive moving average; ARJNN: ARTiDe jump neural network; BG: blood glucose; CRNNMTL: convolutional recurrent neural networks multitask learning; CRNNMTLGV: convolutional recurrent neural networks multitask learning glycemic variability; CRNNSTL: convolutional recurrent neural networks singletask learning; CRNNTL: convolutional recurrent neural networks transfer learning; DFFNN: delayed feedforward neural network; DRNN: dilated recurrent neural network; DT: decision tree; FC: fully connected (neural network); fNN: feedforward neural network; GCN: gradually connected neural network; JNN: jump neural network; kNN: knearest neighbor; LGBM: light gradient boosting machine; LSTM: long shortterm memory; LVX: latent variable with exogenous input; ML: machine learning; NARX: one neural network model; NNLPA: neural network–linear prediction algorithm; NNM: neural network model; PH: prediction horizon; RF: random forest; RNN: recurrent neural network; SAX: one symbolic model; SVR: support vector regression.
Relative ranks of ML^{a} models for predicting BG^{b} levels in PH^{c}=60 minutes.
ML model  SUCRA^{d}  Relative rank 
ARM^{e}  41.0  10.4 
Gradually connected neural network (GCN)  14.2  14.7 
Fully connected (FC [neural network])  55.7  8.1 
Light gradient boosting machine (LGBM)  56.0  8.0 
RF^{f}  59.7  7.5 
GluNet  97.8  1.4 
NNM^{g}  59.9  7.4 
SVM^{h}  49.5  9.1 
Latent variable with exogenous input (LVX)  85.9  3.3 
Convolutional recurrent neural network multitask learning (CRNNMTL)  61.4  7.2 
Convolutional recurrent neural network multitask learning glycemic variability (CRNNMTLGV)  54.2  8.3 
Convolutional recurrent neural network transfer learning (CRNNTL)  44.5  9.9 
Convolutional recurrent neural network singletask learning (CRNNSTL)  32.5  11.8 
kNearest neighbor (kNN)  42.5  10.2 
DT^{i}  4.5  16.3 
AdaBoost  24.1  13.1 
XGBoost^{j}  66.5  6.4 
^{a}ML: machine learning.
^{b}BG: blood glucose.
^{c}PH: prediction horizon.
^{d}SUCRA: surface under the cumulative ranking.
^{e}ARM: autoregression model.
^{f}RF: random forest.
^{g}NNM: neural network model.
^{h}SVM: support vector machine.
^{i}DT: decision tree.
^{j}XGBoost: Extreme Gradient Boosting.
Relative ranks of ML^{a} models for predicting BG^{b} levels in PH^{c}=15 minutes.
ML model  SUCRA^{d}  Relative rank 
NNM^{e}  84.4  3.0 
ARM^{f}  86.8  2.7 
ARJNN^{g}  99.1  1.1 
RF^{h}  64.6  5.6 
SVM^{i}  20.9  11.3 
One symbolic model (SAX)  0.3  14.0 
Recurrent neural network (RNN)  45.9  8.0 
One neural network model (NARX)  11.8  12.5 
Jump neural network (JNN)  62.2  5.9 
Delayed feedforward neural network model (DFFNN)  39.6  8.9 
kNearest neighbor (kNN)  53.7  7.0 
DT^{j}  33.3  9.7 
AdaBoost  36.8  9.2 
XGBoost^{k}  60.8  6.1 
^{a}ML: machine learning.
^{b}BG: blood glucose.
^{c}PH: prediction horizon.
^{d}SUCRA: surface under the cumulative ranking.
^{e}NNM: neural network model.
^{f}ARM: autoregression model.
^{g}ARJNN: ARTiDe jump neural network.
^{h}RF: random forest.
^{i}SVM: support vector machine.
^{j}DT: decision tree.
^{k}XGBoost: Extreme Gradient Boosting.
Relative ranks of ML^{a} models for predicting BG^{b} levels in PH^{c}=45 minutes.
ML model  SUCRA^{d}  Relative rank 
Convolutional recurrent neural network multitask learning (CRNNMTL)  52.1  5.8 
Convolutional recurrent neural network multitask learning glycemic variability (CRNNMTLGV)  41.8  6.8 
Convolutional recurrent neural network transfer learning (CRNNTL)  31.6  7.8 
Convolutional recurrent neural network singletask learning (CRNNSTL)  27.5  8.2 
SVM^{e}  32.0  7.8 
kNearest neighbor (kNN)  61.4  4.9 
DT^{f}  26.3  8.4 
RF^{g}  70.3  4.0 
AdaBoost  34.1  7.6 
XGBoost^{h}  73.5  3.7 
NNM^{i}  99.4  1.1 
^{a}ML: machine learning.
^{b}BG: blood glucose.
^{c}PH: prediction horizon.
^{d}SUCRA: surface under the cumulative ranking.
^{e}SVM: support vector machine.
^{f}DT: decision tree.
^{g}RF: random forest.
^{h}XGBoost: Extreme Gradient Boosting.
^{i}NNM: neural network model.
ML models for predicting hypoglycemia (adverse BG events) involved 19 (41.3%) studies [
For the NNM, 3 (6.5%) studies [
For the RF, 5 (10.9%) studies [
Sensitivity and specificity forest plots of ML models for predicting adverse BG events. The horizontal lines indicate 95% CIs. The square markers represent the effect value of a single study, and the diamond marker represents the combined results of all studies. The vertical line shows the line of no effects. BG: blood glucose; ML: machine learning.
SROC curves of all ML algorithms (A), NNM algorithms (B), RF algorithms (C), SVM algorithms (D), and ensemble learning algorithms (E) for predicting adverse BG events. The hollow circles represent results of all studies, and the red diamonds represent the summary result of all studies. AUC: area under the curve; BG: blood glucose; ML: machine learning; NNM: neural network model; RF: random forest; SROC: summary receiver operating characteristic; SVM: support vector machine.
Sensitivity and specificity forest plots of NNM algorithms (A), RF models (B), SVM algorithms (C), and ensemble learning algorithms (D) for predicting adverse BG events. The horizontal lines indicate 95% CIs. The square markers represent the effect value of a single study, and the diamond marker represents the combined results of all studies. The vertical line shows the line of no effects. BG: blood glucose; NNM: neural network model; RF: random forest; SROC: summary receiver operating characteristic; SVM: support vector machine.
For the SVM, 8 (17.4%) studies [
For ensemble learning models (RF, XGBoost, bagging), 7 (15.2%) studies [
ML models for detecting hypoglycemia (adverse BG events) involved 17 (37%) studies [
For the NNM, 11 (23.9%) studies [
For the SVM, 4 (8.7%) studies [
Sensitivity and specificity forest plots of ML models for detecting adverse BG events. The horizontal lines indicate 95% CIs. The square markers represent the effect value of a single study, and the diamond marker represents the combined results of all studies. The vertical line shows the line of no effects. BG: blood glucose; ML: machine learning.
SROC curves of all ML algorithms (A), NNM algorithms (B), and SVM algorithms (C) for detecting adverse BG events. The hollow circles represent results of all studies, and the red diamonds represent the summary result of all studies. AUC: area under the curve; BG: blood glucose; ML: machine learning; NNM: neural network model; SROC: summary receiver operating characteristic; SVM: support vector machine.
Sensitivity and specificity forest plots of NNM algorithms (A) and SVM algorithms (B) for detecting adverse BG events. The horizontal lines indicate 95% CIs. The square markers represent the effect value of a single study, and the diamond marker represents the combined results of all studies. The vertical line shows the line of no effects. BG: blood glucose; NNM: neural network model; SVM: support vector machine.
This metaanalysis systematically assessed the performance of different ML models in enhancing BG management in patients with DM based on 46 eligible studies. Comprehensive evidence obtained via exhaustive searching allowed us to assess the overall ability of the ML models in different scenarios, including predicting BG levels, predicting adverse BG events, and detecting adverse BG events.
Obviously, the RMSE of ML models for predicting BG levels increased as the PH increased from 15 to 60 minutes, which indicates that the longer the PH, the larger the prediction error. Based on the results of relative ranking, among all the ML models for predicting BG levels, neural network–based models, including the DRNN, GluNet, ARJNN, and NNM, achieved the minimum RMSE and the maximum SUCRA in different PHs, indicting the highest relative performance. In contrast, the DT achieved the maximum RMSE and the minimum SUCRA in a PH of 60 and 45 minutes, indicating that lowest relative performance. Thus, for predicting BG levels, neural network–based algorithms might be an appropriate choice. We found that time domain features combined with historical BG levels as input can further improve the performance of NNM algorithms [
Regarding ML models for predicting adverse BG events, the pooled sensitivity, specificity, PLR, and NLR were 0.71 (95% CI 0.610.80), 0.91 (95% CI 0.870.94), 8.3 (95% CI 5.712.0), and 0.31 (95% CI 0.220.44), respectively. According to the
Regarding ML models for detecting hypoglycemia, the pooled sensitivity, specificity, PLR, and NLR were 0.74 (95% CI 0.700.78), 0.70 (0.560.81), 2.4 (1.63.7), and 0.37 (0.290.46), respectively, which indicates that the algorithms generate small changes in probability [
The study has several limitations. First, although we developed a comprehensive search strategy, there was still a possibility of potential missing studies. To further increase the rate of literature retrieval, we included the main medical databases with a feasible search strategy, including PubMed, Embase, Web of Science, and IEEE Explore, and references from relevant studies were also screened for eligibility to avoid omissions. Second, statistically significant high heterogeneity was detected in all subgroups, with different sources of heterogeneity, including different types of DM, ML models, data sources, reference index, time and setting of data collection, and threshold of hypoglycemia, among studies. To address this issue, hierarchical analysis and metaregression analysis were carried out in different subgroups to explore the possible sources of heterogeneity. Furthermore, for several studies that provided no required outcome measures or had inconsistent outcome measures, relevant estimation methods were used to calculate the indicators, which might have led to a certain amount of estimation error. However, the estimation error was small enough to be accepted owing to an appropriate estimation method, and the results of this study were further enriched. However, future studies are required to report all relevant outcome measures for further evaluation.
In future, more accurate ML models will be used for BG management, which will certainly improve the quality of life of patients with DM and reduce the burden of adverse BG events. First, as mentioned before, current ML models have relatively sufficient ability to predict BG levels and hypoglycemia, and the fact that an extended PH is more beneficial for increasing the time available for patients and clinicians to respond still needs to be emphasized [
In summary, in predicting precise BG levels, the RMSE increases with an increase in the PH, and the NNM shows the relatively highest performance among all the ML models. Meanwhile, according to the PLR and NLR, current ML models have sufficient ability to predict adverse BG (hypoglycemia) events, while their ability to detect adverse BG events needs to be enhanced. Future studies are required to focus on improving the performance and using ML models in clinical practice [
Supplemental plot1forest (RMSE PH=30). PH: prediction horizon; RMSE: root mean square error.
PRISMA (Preferred Reporting Items for Systematic Reviews and MetaAnalysis) checklist.
Supplemental plot2forest (RMSE PH=60). PH: prediction horizon; RMSE: root mean square error.
Supplemental plot3forest (RMSE PH=15). PH: prediction horizon; RMSE: root mean square error.
Supplemental plot4forest (RMSE PH=45). PH: prediction horizon; RMSE: root mean square error.
Supplemental plot5  metaregression (preall).
Supplemental plot5metaregression(preNN).
Supplemental plot5metaregression(preSVM).
Supplemental plot5metaregression(detall).
supplemental plot5metaregression(detNN).
Supplemental plot5metaregression(detSVM).
autoregression model
ARTiDe jump neural network
area under the curve
blood glucose
continuous glucose monitoring
diabetes mellitus
dilated recurrent neural network
decision tree
electrocardiograph
electroencephalograph
electronic health record
machine learning
negative likelihood ratio
neural network model
prediction horizon
positive likelihood ratio
Quality Assessment of Diagnostic Accuracy Studies
random forest
root mean square error
summary receiver operating characteristic
surface under the cumulative ranking
support vector machine
type 1 diabetes mellitus
type 2 diabetes mellitus
Extreme Gradient Boosting
The study was funded by the National Natural Science Foundation of China (grant no. 82073663) and the Shaanxi Provincial Research and Development Program Foundation (grant nos. 2017JM7008 and 2022SF245).
The data sets used and analyzed during the study are available from the corresponding author upon reasonable request.
YW and CC conceived and designed the study. KL and LL undertook the literature review and extracted data. KL, LL, and JJ interpreted the data. KL, YM, and SL wrote the first draft of the manuscript, with revision by YW, ZL, CP, and ZY. All authors have read and approved the final version of the manuscript and had final responsibility for submitting it for publication.
None declared.