This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
Obesity is one of today’s most visible public health problems worldwide. Although modern bariatric surgery is ostensibly considered safe, serious complications and mortality still occur in some patients.
This study aimed to explore whether serious postoperative complications of bariatric surgery recorded in a national quality registry can be predicted preoperatively using deep learning methods.
Patients who were registered in the Scandinavian Obesity Surgery Registry (SOReg) between 2010 and 2015 were included in this study. The patients who underwent a bariatric procedure between 2010 and 2014 were used as training data, and those who underwent a bariatric procedure in 2015 were used as test data. Postoperative complications were graded according to the ClavienDindo classification, and complications requiring intervention under general anesthesia or resulting in organ failure or death were considered serious. Three supervised deep learning neural networks were applied and compared in our study: multilayer perceptron (MLP), convolutional neural network (CNN), and recurrent neural network (RNN). The synthetic minority oversampling technique (SMOTE) was used to artificially augment the patients with serious complications. The performances of the neural networks were evaluated using accuracy, sensitivity, specificity, Matthews correlation coefficient, and area under the receiver operating characteristic curve.
In total, 37,811 and 6250 patients were used as the training data and test data, with incidence rates of serious complication of 3.2% (1220/37,811) and 3.0% (188/6250), respectively. When trained using the SMOTE data, the MLP appeared to have a desirable performance, with an area under curve (AUC) of 0.84 (95% CI 0.830.85). However, its performance was low for the test data, with an AUC of 0.54 (95% CI 0.530.55). The performance of CNN was similar to that of MLP. It generated AUCs of 0.79 (95% CI 0.780.80) and 0.57 (95% CI 0.590.61) for the SMOTE data and test data, respectively. Compared with the MLP and CNN, the RNN showed worse performance, with AUCs of 0.65 (95% CI 0.640.66) and 0.55 (95% CI 0.530.57) for the SMOTE data and test data, respectively.
MLP and CNN showed improved, but limited, ability for predicting the postoperative serious complications after bariatric surgery in the Scandinavian Obesity Surgery Registry data. However, the overfitting issue is still apparent and needs to be overcome by incorporating intra and perioperative information.
Obesity is one of today’s most important public health problems worldwide. With no changes in the current trends, the estimated prevalence of severe obesity (BMI greater than 35 kg/m^{2}) will reach 9% for women and 6% for men within a few years [
Although modern bariatric surgery is considered to be ostensibly safe, serious complications and mortality still occur in some patients [
In our previous studies, conventional statistical models [
Although there is increasing evidence that the use of ML methods can improve our understanding of postoperative progression of bariatric surgery [
The aim of this study was to examine whether serious postoperative complications of bariatric surgery can be predicted preoperatively using DLNNs based on the information available from a national quality registry. We used the data from the Scandinavian Obesity Surgery Registry (SOReg) to examine the performance of 3 widely used DLNNs.
The SOReg covers virtually all bariatric surgical procedures performed in Sweden since 2010 [
The Regional Ethics Committee in Stockholm approved the study (approval number: 2013/53531/5).
Three supervised DLNNs were applied and compared in our study, comprising multilayer perceptron (MLP), convolutional neural network (CNN), and recurrent neural network (RNN) models. For the MLP model, we used 4 dense layers and 2 dropout layers. The initial computation units for the dense layers were set to 15, 64, 64, and 128, and dropout rate was set to 0.5 for the 2 dropout layers (
In the initial CNN, we used a 7layer model with 2 onedimensional (1D) convolution layers (with 10 filters for each), 2 1D max pooling layers, 1 flatten layer, and 2 dense layers (with 1000 computation units). The relu activation function was used for the 2 1D convolution layers and the first dense layers, and the sigmoid activation function was used for the last dense layer. The binary crossentropy loss function and the adaptive moment estimation (Adam) optimizer were used when compiling the model (
In view of the temporal feature of the data, we also used the RNN for prediction. To minimize computation time, the initial model only included 1 long shortterm memory (LSTM) layer and 1 dense layer. The initial dimensionality of the LSTM layer was set to 32. To tackle overfitting, we randomly dropped out inputs and recurrent connections in the LSTM layer to break happenstance correlations in the training data that the layer was exposed to. The dropout rates for inputs and recurrent connections were set to 0.2. The activation functions for input connection and recurrent connection were hyperbolic tangent and hard sigmoid, respectively. The activation function for the dense layer was sigmoid. The binary crossentropy loss function and the Adam optimizer were used when compiling the model.
For the training data, the binary features were converted into dummy variables, and the continuous features were standardized to have mean 0 and SD 1 before they enter the model. For the test data, the continuous features were standardized using the corresponding means and standardizations from the training data. HbA_{1c} was log transformed before standardization because of its asymmetrical distribution. In sensitivity analysis, the normalizer and minmax scaler were also used to evaluate the influence of scalers on the models’ performance.
As the incidence rate of serious complications is very low (only 3.2%), the extreme imbalance would result in serious bias in the performance metrics [
The performances of the three neural networks were evaluated using accuracy, sensitivity, specificity, Matthews correlation coefficient (MCC) [
To find optimal highlevel parameters (such as the number, size, and type of layers in the networks) and lowerlevel parameters (such as the number of epochs, choice of loss function and activation function, and optimization procedure) in the DLNN models, the Kfold crossvalidation method was used during the training phase. Kfold crossvalidation is currently considered as a minimum requirement to handle the problems such as overfitting when applying only 1 single dataset in ML [
The descriptive and inferential statistical analyses were performed using Stata 15.1 (StataCorp LLC, College Station). The DLNN models were achieved using packages scikitlearn 0.19.1 and Keras 2.1.6 in Python 3.6 (Python Software Foundation). The 95% CI of AUC was calculated using the package pROC in R 3.61 (R Foundation for Statistical Computing).
All the computation was conducted using a computer with the 64bit Windows 7 Enterprise operating system (Service Pack 1), Intel Core TM i54210U CPU of 2.40 GHz, and 16.0 GB installed random access memory.
The incidence of serious complications after bariatric surgery in our study was 3.2%, which is similar to other studies [
Performance metrics of the models.
Model  Training data  Test data  

Accuracy  Specificity  Sensitivity  MCC^{a}  AUC^{b} 
Accuracy  Specificity  Sensitivity  MCC  AUC 
MLP^{c}  0.97  1.00  0.00  0.00  0.60 
0.97  1.00  0.00  0.00  0.57 
MLP^{d}  0.68  0.39  0.97  0.44  0.84 
0.84  0.82  0.23  0.02  0.54 
CNN^{e}  0.97  1.00  0.00  0.00  0.58 
0.97  1.00  0.00  0.00  0.55 
CNN^{d}  0.63  0.56  0.70  0.26  0.79 
0.95  0.97  0.06  0.03  0.57 
RNN^{f}  0.97  1.00  0.00  0.00  0.58 
0.97  1.00  0.00  0.00  0.56 
RNN^{d}  0.58  0.66  0.49  0.15  0.65 
0.91  0.93  0.14  0.05  0.55 
^{a}MCC: Matthews correlation coefficient.
^{b}AUC: area under curve.
^{c}MLP: multilayer perceptron.
^{d}Trained using synthetic minority oversampling technique data.
^{e}CNN: convolutional neural network.
^{f}RNN: recurrent neural network.
There were myriad combinations of high and lowlevel parameters used during model training, and most of them resulted in constant performance after given values. Therefore, we only show the trend of the MLP model’s accuracy with number of epochs for model training while keeping other parameters unchanged in
The performance of the MLP was not optimal for the original training data and test data. The AUCs were barely higher than a random guess, that is, 0.5, which were 0.60 (95% CI 0.590.61) and 0.57 (95% CI 0.550.59) for the training data and test data, respectively (
The performance of MLP was significantly influenced by the number of computation units in the SMOTE data but not in the test data. For example, when the computation units of the first layer ranged from 4 to 500, the AUC increased rapidly from 0.55 to 0.80. Within the range from 500 to 1000, the AUC increased slowly from 0.80 to 0.85 and kept fluctuating around 0.85 afterward (
Change of accuracy with the number of epochs in multilayer perceptron. MLP: multilayer perceptron; SMOTE: synthetic minority oversampling technique.
Area under curve of multilayer perceptron with initial setting. MLP: multilayer perceptron.
Performance of multilayer perceptron using the synthetic minority oversampling technique and test data with different numbers of computation units in the first hidden layer. MLP: multilayer perceptron; ROC: receiver operating characteristic; SMOTE: synthetic minority oversampling technique.
The performance of CNN appeared to be similar to that of MLP. The AUCs were 0.58 (95% CI 0.560.60) and 0.55 (95% CI 0.540.56) for the training data and test data, respectively (
The number of output filters in the convolution (or the dimensionality of the output space) has a significant influence on the CNN model’s performance in the SMOTE data but not in the training data and test data. The AUC of CNN increased rapidly from 0.63 to 0.80 when we set the number of filters from 5 to 50. However, the larger number of filters contributes no further improvement (
Area under curve of convolutional neural network with initial setting. CNN: convolutional neural network.
Performance of convolutional neural network using the synthetic minority oversampling technique and test data with different numbers of filters. CNN: convolutional neural network; ROC: receiver operating characteristic; SMOTE: synthetic minority oversampling technique.
Compared with the MLP and CNN, the RNN showed even worse performance. AUCs of RNN for the original training data and test data were 0.58 (95% CI 0.570.59) and 0.56 (95% CI 0.550.57), respectively (
Area under curve of recurrent neural network with initial setting. RNN: recurrent neural network.
The performance of the RNN model was influenced by the dimensionality of the LSTM layer. The AUC changed from 0.50 to 0.60 rapidly when the dimensionality grew from 2 to 20 and kept fluctuating around 0.61 afterward (
Performance of recurrent neural network using the synthetic minority oversampling technique and test data with different dimensionalities of long shortterm memory. LSTM: long shortterm memory; ROC: receiver operating characteristic; SMOTE: synthetic minority oversampling technique.
In the sensitivity analysis, we tried different scalers and optimizers in data preparation and model compiling, and we tried thousands of combinations of hyperparameters for each model using the exhaustive grid search method [
The computing time for the models was largely dependent on the number of DLNN layers and hyperparameter settings of the layers, number of epochs and batch size for training, and obviously software and hardware used. In our study, with the model structures and hyperparameters described above, the running time ranged from 82 seconds for the MLP model (computational units=64, epochs=80, batch size=128, and trained by original data) to more than 10 hours for the CNN model (filters=400, epochs=100, batch size=128, and trained by SMOTE data with crossvalidation and grid search) on our computer.
Several studies have explored using ML methods to predict the risks after bariatric surgery. Razzaghi et al [
In this study, we applied and compared 3 DLNN models for predicting serious complications after bariatric surgery. MLP is the classical type of neural network, which consists of multiple layers of computational units. The layers are interconnected in a feedforward way, where the information moves only forward, that is, from input nodes, through hidden nodes and to output nodes, and the connections between the nodes do not form a cycle [
Although the results from the MLP and CNN models seem promising in the SMOTE training data, the overfitting problem still exists, which was reflected in the poor performance of the 3 models in the test data (see
We also noticed that the RNN performed worse than MLP and RNN for our data. The possible reason might be that the sequential pattern or temporal trend in our data cannot be represented by the features currently available in our data, or there is no dependency between the patients or events in the timeseries. Even if the trend can be captured by the RNN, it might be weak, and the past status contributed noise rather than information to current status.
Although increasing the number of computational units in the layers or adding more layers may increase the model’s capacity, the tradeoff between computational expensiveness and representational power is seen everywhere in ML. Limited by the computing power, we tried to avoid complicated networks such as applying multiple RNN layers or combining CNN and RNN, but it deserves investigation in the future with data having more variables and apparent temporal trend.
Compared with previous studies, there are several advantages in our study. First, we used DLNNs rather than traditional ML techniques. The biggest advantage of DLNNs is that they try to learn highlevel features from data in an incremental manner. They need less human domain expertise for hardcore feature extraction [
Compared with the results from our previous study using traditional ML algorithms to predict the postoperative serious complication after bariatric surgery using SOReg data, the MLP and CNN showed improved, but limited, predictive ability, which deserves further investigation. The overfitting issue is still apparent and needs to be overcome by incorporating more patient features, for example, intra and perioperative information, from other data resources.
Structure of the MLP model. MLP: multilayer perceptron.
Structure of the CNN model. CNN: convolutional neural network.
Performance of the first 175 MLP models with different computation units, epochs, and batch sizes. In general, performance of the models (measured as AUC) increased with more computation units and epochs, and decreased with larger batch sizes. Although the performance increased with the model’s complexity, the efficiency (measured as AUC divided by logarithmic computing time) decreased. MLP: multilayer perceptron; AUC: area under the curve.
onedimensional
adaptive moment estimation
area under curve
credit assignment path
convolutional neural network
deep learning neural network
hemoglobin A1c
long shortterm memory
Matthews correlation coefficient
machine learning
multilayer perceptron
rectified linear unit
recurrent neural network
receiver operating characteristic
synthetic minority oversampling technique
the Scandinavian Obesity Surgery Registry
waist circumference
YC’s work was supported by Örebro Region County Council (OLL864441). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
None declared.