This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
The absolute number of femoral neck fractures (FNFs) is increasing; however, the prediction of traumatic femoral head necrosis remains difficult. Machine learning algorithms have the potential to be superior to traditional prediction methods for the prediction of traumatic femoral head necrosis.
The aim of this study is to use machine learning to construct a model for the analysis of risk factors and prediction of osteonecrosis of the femoral head (ONFH) in patients with FNF after internal fixation.
We retrospectively collected preoperative, intraoperative, and postoperative clinical data of patients with FNF in 4 hospitals in Shanghai and followed up the patients for more than 2.5 years. A total of 259 patients with 43 variables were included in the study. The data were randomly divided into a training set (181/259, 69.8%) and a validation set (78/259, 30.1%). External data (n=376) were obtained from a retrospective cohort study of patients with FNF in 3 other hospitals. Least absolute shrinkage and selection operator regression and the support vector machine algorithm were used for variable selection. Logistic regression, random forest, support vector machine, and eXtreme Gradient Boosting (XGBoost) were used to develop the model on the training set. The validation set was used to tune the model hyperparameters to determine the final prediction model, and the external data were used to compare and evaluate the model performance. We compared the accuracy, discrimination, and calibration of the models to identify the best machine learning algorithm for predicting ONFH. Shapley additive explanations and local interpretable modelagnostic explanations were used to determine the interpretability of the black box model.
A total of 11 variables were selected for the models. The XGBoost model performed best on the validation set and external data. The accuracy, sensitivity, and area under the receiver operating characteristic curve of the model on the validation set were 0.987, 0.929, and 0.992, respectively. The accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve of the model on the external data were 0.907, 0.807, 0.935, and 0.933, respectively, and the logloss was 0.279. The calibration curve demonstrated good agreement between the predicted probability and actual risk. The interpretability of the features and individual predictions were realized using the Shapley additive explanations and local interpretable modelagnostic explanations algorithms. In addition, the XGBoost model was translated into a selfmade webbased risk calculator to estimate an individual’s probability of ONFH.
Machine learning performs well in predicting ONFH after internal fixation of FNF. The 6variable XGBoost model predicted the risk of ONFH well and had good generalization ability on the external data, which can be used for the clinical prediction of ONFH after internal fixation of FNF.
The incidence of hip fractures is changing worldwide. In most Western and Northern European countries, the incidence is decreasing, as well as in Singapore [
Published papers regarding ONFH prediction are primarily based on changes in the blood circulation of the femoral head by radiological investigations, such as singlephoton emission computed tomography [
Prediction research using machine learning involves learning models from sample data and making predictions and decisions on the new data. The support vector machine (SVM) algorithm exhibits good prediction performance and generalization ability when dealing with small sample binary classification problems [
In machine learning, the black box describes models that cannot be understood by examining their parameters (eg, neural network and XGBoost) [
This study aims to explore and compare the application value of different machine learning algorithms for the prediction of ONFH after internal fixation of FNF and to develop a prediction model of ONFH based on a machine learning algorithm. In this study, the development and validation of the prediction model was guided by the Prediction Model Risk of Bias Assessment Tool (PROBAST) [
This multicenter retrospective followup study was performed at least 30 months after followup in patients undergoing internal fixation of FNFs. The study population comprised patients with FNF with internal fixation who were discharged from Shanghai Ninth People’s Hospital, Dongfang Hospital, and Yangpu District Central Hospital from January 1, 2015, to May 1, 2018, and the Tenth People’s Hospital from January 1, 2017, to May 1, 2018. By searching the inpatient electronic medical record system, medical imaging information system, laboratory information system, manual reading of cases, and followup, we collected 47 clinical features from 316 patients with FNF. After excluding patients who were lost to followup or who died, 259 patients with FNF and associated 43 variables were included in this study. The external data (n=376) were obtained from our previous retrospective cohort study [
The inclusion criteria were as follows: (1) patients with FNF treated with internal fixation aged between 18 and 75 years, (2) patients with FNF with complete baseline data, (3) followup time ≥30 months, (4) The American Society of Anesthesiologists anesthesia risk of approximately grade IIII, and (5) no moderate or severe pain or limitation of movement of the injured hip side that occurred before the fracture. Exclusion criteria included the following: (1) patients with FNF with pathological fractures or old fractures admitted for >2 weeks after injury; (2) patients with failed internal fixation operation; (3) patients with a history of malignant tumors, nontraumatic fractures, ipsilateral lower limb, or other fractures; (4) patients with a history of longterm diving, alcohol abuse, and fluoroquinolone, antiplatelet drug, or hormone use; (5) patients with multiple fractures at the same site, injuries on the opposite side, or fracture of both lower limbs in the past 6 months; (6) patients who experienced acute myocardial infarction, cerebrovascular accident, severe trauma, or major operation within half a year; (7) patients with vascular transplantation or free fibula transplantation during internal fixation; and (8) patients with poor compliance. The diagnosis of ONFH was based on the updated version of the Association Research Circulation Osseous grading system [
The protocol for this research project was approved by the Medical and Life Science Ethics Committee of Tongji University (2019tjdx285; date June 18, 2019). Given the retrospective nature of this study, the requirement for informed consent was waived.
A total of 47 clinical features were collected, and features with missing values >20% were excluded. Overall, 43 candidate variables were included in the following categories: (1) demographic information: age, sex, smoking, drinking, and ageadjusted Charlson Comorbidity Index [
Outlier detection was performed on the raw data. Each source of outlier was checked by looking through the medical history twice, so that we knew whether the value of the outlier was true. The errors caused by incorrect manual collection were rectified. In this study, the proportion of missing variables was <5% and was substituted with the mean value. The original data, such as blood biochemical indices, were continuous variables, which were converted into low, normal, and high categorical variables according to clinical significance. According to the modeling requirements, categorical variables were transformed into dummy variables. Standardization of continuous variables was not a necessary step in preprocessing. Although there were few continuous variables included in this study, we compared the effects of standardization with nonstandardization during the modeling process. Finally, the processed data were randomly divided into a training set and validation set at a ratio of 7:3.
The ratio of ONFH to nonONFH was 1:5, which is unbalanced. When unbalanced data are used to fit the model, the classification interface will be biased toward the minority, resulting in low sensitivity and high specificity [
A large amount of collected clinical data will inevitably contain redundant features and noise data, which will lead to overfitting in modeling and cannot be effectively classified. Variable selection is a process that can remove irrelevant and redundant features and reduce the impact of noise data on classifier performance to a certain extent [
Four classification algorithms, LR, RF, SVM, and XGBoost, were used to establish the models. The parameter learning curve, grid search, and crossvalidation methods were used to adjust the parameters of the model, and the best parameter combination was determined by checking the accuracy, sensitivity, and AUC of the model on the validation set. The parameter learning curve is a curve with different parameter values as the abscissa, and the model score under different parameter values as the ordinate. We can see that the change in trend of the model evaluation index under different parameter values, initially obtain a small parameter search interval, or select the value of the best point of the model performance as the optimal parameter value. Grid search refers to the selection of all candidate parameters through loop traversal. The system tries every possibility, and the bestperforming parameter is the final result, which is the process of training and comparison. Crossvalidation refers to randomly dividing the data set into K parts without replacement, k1 parts are used to train the model, and the remaining part is used for performance evaluation. This process was repeated K times to obtain k models and performance evaluation results.
The aim of parameter tuning is to minimize the generalization error of the model. The generalization error was used to measure the accuracy of the model with unknown data in machine learning. A model that is too simple or too complex will cause high generalization errors. If the model is too complex, it will overfit; if the model is too simple, it will underfit. By comparing the sample learning curves of the training with validation sets, we can observe the fitting effect of the model. The sample learning curve is drawn with the number of different training samples as the abscissa, and the accuracy of the training or validation sets under the number of samples as the ordinate. When the errors of the training and validation sets converge but the accuracy is low, it indicates a high bias. When the deviation of the upper left corner of the curve is very large and the accuracy of the training and validation sets is very low, the model is underfitted. When the errors of the training and validation sets are large, there is high variance; the variance in the upper right corner of the curve is high, the accuracy of the training and validation sets are too different, and the model is overfitted. If one of the biases and variances is large, this indicates that the generalization error is large.
Confusion matrix indicates the count of the true outcome and prediction under different labels (ONFH or nonONFH). A series of indicators can be calculated using the confusion matrix. The accuracy of the model is a key indicator for measuring the quality of the model. According to PROBAST [
Calibration includes logloss and the calibration curve. Logloss is the negative logarithm of the probability of the real probability for a given probability classifier under the condition of prediction probability. A smaller value of the loglikelihood function indicates a more accurate prediction. In this study, all samples were reordered according to the predicted probability and divided into 10 equal groups. The calibration curve showed the distance between the predicted probabilities and the true incidence of ONFH in each group. A curve closer to the ideal line (y=x) shows a better calibration ability of the model.
The black box model is explained through both global and local explanations. Shapley additive explanations (SHAP) is based on the theoretically optimal Shapley values [
Qualitative variables are expressed as ratios or constituent ratios. The KolmogorovSmirnov test was used to test the normality of the quantitative variables. Variables that fit the normal distribution were expressed as the mean (SD), and variables that did not fit the normal distribution were expressed as the median (25th percentile [P_{25}] and 75th percentile [P_{75}]). The Kendall correlation coefficient and Spearman correlation coefficient were used to describe the correlation between the qualitative and quantitative variables, respectively. A coefficient greater than 0.6 indicates that there is a correlation between the 2 variables. LASSO regression was used to eliminate multicollinearity. Statistical analysis was performed using Python 3.7.4 (Anaconda 4.9.2). The main Python library and version information used for modeling are listed in
The flowchart of the study is shown in
Python library and function.
Library  Version  Function 
scikitlearn  0.24.1  Machine learning 
NumPy  1.16.5  Scientific computing 
pandas  0.25.1  Data analysis 
Matplotlib  3.3.4  Visualization 
imblearn  0.0  Imbalanced data set 
statsmodels  0.12.2  Statistical computations 
XGBoost^{a}  1.3.3  Gradient boosting framework 
SHAP^{b}  0.39.0  Explain the output of machine learning model 
LIME^{c}  0.2.0.1  Explain the output of machine learning model 
Flask  1.1.1  Web development 
Gunicorn  20.1.0  HTTP server 
^{a}XGBoost: eXtreme Gradient Boosting.
^{b}SHAP: Shapley additive explanations.
^{c}LIME: local interpretable modelagnostic explanations.
Flowchart of the study. LASSO: least absolute shrinkage and selection operator; LIME: local interpretable modelagnostic explanations; LR: logistic regression; ONFH: osteonecrosis of the femoral head; RF: random forest; SHAP: Shapley additive explanations; SMOTE: synthetic minority oversampling technique; SVM: support vector machine; XGBoost: eXtreme Gradient Boosting.
A total of 259 patients with FNF were included in this study, comprising 124 (47.8%) men and 135 (52.1%) women, and the median (P_{25}, P_{75}) age was 57 (49, 62) years. A total of 43 patients experienced ONFH after internal fixation surgery, with an incidence of ONFH of 16.6%. All data were randomly divided into a training set (181/259, 69.8%) and a validation set (78/259, 30.1%) at a ratio of 7:3 (randomstate=420). There were 29 patients with ONFH and 152 patients without ONFH in the training set. After using the SMOTE algorithm to oversample the femoral head necrosis group in the training set, the number of ONFH and nonONFH groups reached a balance (152 cases in each group). There were 14 patients with ONFH and 64 patients without ONFH in the validation set. Patient characteristics in the 3 data sets are presented in
First, we used the grid search and 10fold crossvalidation estimators (GridSearchCV) to explore the LASSO regression regularization parameter α (
The process of exploring the optimal feature subset. (A) Grid search and 10fold crossvalidation estimators of least absolute shrinkage and selection operator regression regularization parameter α. The yaxis represents the average and SD of 10 cross validations. (B) Prediction results of support vector machine (SVM) with different validation samples under 4 kernel functions of linear, polynomial, radial basis function (RBF) and sigmoid. (C) Best α that makes the SVM model have the best accuracy, sensitivity, and area under the receiver operating characteristic curve performance on the validation set. (D) Comparison of standardized with nonstandardized results of continuous variables under different support vector machine parameters C (kernel=linear). AUC: area under the receiver operating characteristic curve; CV: cross validation.
Therefore, to obtain a reliable α and identify the optimal feature subset, the feature subset under the 10fold crossvalidation was first introduced into the SVM classifier for modeling. The performance of the SVM classifier on the validation set using different kernel functions was determined (other parameters used default values).
In addition, we compared the results of standardization with nonstandardization on the accuracy, sensitivity, and AUC of the verification set under different SVM parameters C (kernel=
After confirming the optimal feature subset, LR, RF, SVM, and XGBoost algorithms were selected to fit the models on the balanced training set.
Comparison of model performance on the validation set.
Model  Before tuning  After tuning  

Accuracy  Sensitivity  AUC^{a}  Accuracy  Sensitivity  AUC  
LR^{b}  0.962  0.929  0.982  0.962  0.929  0.984  
RF^{c}  0.962  0.857  0.985  0.974  0.929  0.991  
SVM^{d}  0.962  0.929  0.973  0.962  0.929  0.979  
XGBoost^{e}  0.962  0.857  0.989  0.987  0.929  0.992 
^{a}AUC: area under the receiver operating characteristic curve.
^{b}LR: logistic regression.
^{c}RF: random forest.
^{d}SVM: support vector machine.
^{e}XGBoost: eXtreme Gradient Boosting.
Hyperparameter configuration for algorithms.
Algorithm and parameter name  Initial value  Adjustment range  Result  



Penalty  L1  (L1, L2)  L2 

C  0.3  (0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 1)  0.5 



n_estimators  100  Range (0, 200, 10)  11 

max_depth  8  (1, 3, 5, 6, 7, 8, 9, 10, 15, 20)  8 

max_features  3  (2, 3, 4, 5, 6, 7, 8)  5 

min_samples_leaf  1  (1, 2, 3, 4)  2 



Kernel  Rbf  (Linear, polynomial, rbf, sigmoid)  Linear 

C  1  Range (0.01, 20, 20)  7.37 



n_estimators  100  Range (0, 200, 10)  51 

max_depth  6  Range (1, 20, 1)  8 

min_child_weight  1  (2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20)  7 

learning_rate  0.3  (0.3, 0.31, 0.32, 0.33, 0.335, 0.36, 0.38, 0.4)  0.335 

gamma  0  (0, 1, 2, 3, 4)  1 
^{a}LR: logistic regression.
^{b}RF: random forest.
^{c}SVM: support vector machine.
^{d}XGBoost: eXtreme Gradient Boosting.
Receiver operating characteristic curves of the logistic regression, random forest, support vector machine (SVM) and eXtreme Gradient Boosting (XGBoost) prediction models on the training set and the validation set, which indicate discrimination ability. The more convex the upper left corner of the curve, the better. AUC: area under the receiver operating characteristic curve.
The learning curves of the logistic regression, random forest, support vector machine (SVM) and eXtreme Gradient Boosting (XGBoost) prediction models on the training and validation sets. The 2 curves of logistic regression, support vector machine and XGBoost model are consistent at a higher accuracy level, which indicates that the model is well fitted for training. The 2 curves of random forest are not well merged, indicating that they are slightly overfitted.
Confusion matrices of the prediction models of ONFH^{a}.
Model and actual  Predictive  

ONFH  NonONFH  



ONFH  64  19 

NonONFH  21  272 



ONFH  64  19 

NonONFH  24  269 



ONFH  61  22 

NonONFH  15  278 



ONFH  67  16 

NonONFH  19  274 
^{a}ONFH: osteonecrosis of the femoral head.
^{b}LR: logistic regression.
^{c}RF: random forest.
^{d}SVM: support vector machine.
^{e}XGBoost: eXtreme Gradient Boosting.
Performance comparison on external data.
Model  Accuracy  Discrimination  Calibration  


Sensitivity  Specificity  AUC^{a}  F1 score  Logloss  
LR^{b}  0.894  0.771  0.928  0.927  0.762  0.288  
RF^{c}  0.886  0.771  0.918  0.910  0.749  0.775  
SVM^{d}  0.901  0.735  0.949  0.904  0.767  0.327  
XGBoost^{e}  0.907  0.807  0.935  0.933  0.793  0.279 
^{a}AUC: area under the receiver operating characteristic curve.
^{b}LR: logistic regression.
^{c}RF: random forest.
^{d}SVM: support vector machine.
^{e}XGBoost: eXtreme Gradient Boosting.
(A) Comparison of receiver operating characteristic curves of the 4 models on external data. The curve closer to the upper left corner showed better overall discrimination ability. (B) Comparison of precisionrecall curves of the 4 models on external data. The curve closer to the upper right corner also showed the ability to combine precision with sensitivity. AP: average precision; AUC: area under the receiver operating characteristic curve; ROC: receiver operating characteristic; SVM: support vector machine; XGBoost: eXtreme Gradient Boosting.
Comparison of calibration curves of the 4 models on external data. The calibration curve of the model is consistent with the ideal calibrated line (y=x), indicating that the predicted value of the model is close to the actual probability of the outcome. SVM: support vector machine; XGBoost: eXtreme Gradient Boosting.
On the basis of the above comparisons, we determined that the XGBoost model was the best predictive model for ONFH. Take the average of the absolute value of the SHAP of each feature as the importance of the feature. The predictor variables of the XGBoost model and their importance ranking are as follows: reduction quality (1.759), VAS score (1.483), Garden classification (0.299), time to surgery (0.247), cause of injury (0.127), and fracture position (0.090).
Global explanations of the eXtreme Gradient Boosting model based on Shapley additive explanations (SHAP) values. Summary of the SHAP values of each feature in each sample. The abscissa is the SHAP value (the impact on the model output), the ordinate is the different features, a point represents a sample, and the color represents the feature value. The larger the feature value is, the redder the color, and the smaller the feature is, the bluer the color. VAS: visual analog scale.
Local explanations of the XGBoost model. (A) The true outcome is nonosteonecrosis of the femoral head (ONFH), and the predicted outcome is nonONFH. (B) The true outcome is ONFH, and the predicted outcome is ONFH. A1 and B1 are local explanations realized by Shapley additive explanations. Variables in blue decided the sample to be classified into category nonONFH, and variables in red decided the sample to be classified into category ONFH. A2 and B2 are local explanations realized by local interpretable modelagnostic explanation (LIME). LIME can obtain the probability value of each category, and showing which variables determine the sample to be classified into category nonONFH (blue) and which variables determine the samples to be classified into category ONFH (orange), specifically listing the numerical size of the samples in these features. VAS: visual analog scale.
In this study, we compared the application of different machine learning algorithms in the prediction of femoral head necrosis after internal fixation of FNFs and obtained a 6variable XGBoost model that could be used for the clinical prediction of traumatic ONFH. This model was translated into a selfmade webbased risk calculator to estimate an individual’s probability of ONFH. The predictors included reduction quality, VAS score, Garden classification, time to surgery, cause of injury, and fracture position. This prediction model exhibited good discrimination and calibration and showed good generalization performance on external data. Performance on the internal validation set yielded an accuracy of 0.987, sensitivity of 0.929, and AUC of 0.992. Performance on external data revealed an accuracy of 0.907, sensitivity of 0.807, specificity of 0.935, AUC of 0.933, F1 score of 0.793, and logloss of 0.279. The webbased risk calculator can be found on the Herokuapp website [
While constructing the predictive model, we also conducted an excavation on the predictive variables of ONFH after internal surgery for FNF. In the design stage of the study, we made our best effort to collect relevant injury and clinical information throughout the clinical course, such as preoperative coagulation indicators, preoperative routine blood tests, and other indicators that have not been analyzed in previous studies. However, these indicators did not pass variable selection. A new British study [
Among the 6 predictors in the XGBoost model, poor reduction, severe fracture displacement, and delay in operation time were clear risk factors for ONFH after internal fixation of FNF. The VAS pain score is widely used in clinical prognosis research and has high reliability and validity. After internal fixation, patients generally experience slight soreness when they get up and sit down and when the temperature suddenly drops. When osteocytes of the hip joint change histologically, patients may experience pain. Through finite element analysis based on biomechanics, Li et al [
It is worth noting that before the categorical variables entered the machine learning classifier to fit the model, they were converted into dummy variables according to the category. The Garden classification is a 4category variable. After XGBoost modeling, only Garden classification IV became a predictor variable. At this time, Garden classification is no longer a 4category variable but becomes a 2category variable of Garden classification_IV.
Obtaining a sufficient number of training samples is difficult and timeconsuming for the prediction of femoral head necrosis after FNF. LR, RF, SVM, and XGBoost can learn effectively from a limited training set. As a strongly integrated algorithm, the performance of XGBoost was not only better than that of SVM and RF but also more accurate and reliable than traditional LR in our study.
In addition, we opened the black box of machine learning with the help of post hoc interpretability techniques of the machine learning model. Through the global interpretation based on SHAP, we can understand the relationship between predictors and outcomes in the XGBoot model. The variables of reduction quality_good and injury cause_low energy correlated negatively with the outcomes and were protective factors; VAS score, Garden classification_IV, time to surgery, and fracture position_subcapital correlated positively with the outcomes and were risk factors. Both SHAP and LIME can provide local explanations for a single sample. The explanatory plot produced by the SHAP is close to the one generated by LIME in that it shows the variables’ names and contributions that are used in the explanation [
This study has several limitations. First, the number of ONFH cases was insufficient. According to the requirements of PROBAST for the number of participants in clinical events, the ratio of participants in clinical events to the number of candidate predictors was at least 10. There were 6 predictors in the model, and only 43 patients had ONFH. However, we used the SMOTE algorithm to balance the training set and increase the number of ONFH to 152 cases. Second, the sensitivity and F1 score on the external data were approximately 0.8, which is low compared with other indicators. When using the LIME algorithm to explain individual predictions, we discovered that most samples used only 4 variables for prediction. Therefore, the reasons for the low sensitivity and F1 score may include the following: (1) the number of ONFH is insufficient and (2) there are still risk factors related to ONFH that have not been identified. In the future, we will conduct prospective validation based on this model, continue to explore important risk factors for ONFH, and modify the model to further improve the accuracy of the XGBoost prediction model.
The patients with FNF in this study were from 6 hospitals in Shanghai, which are more representative. We included a wider range of candidate variables. Instead of using traditional singlevariable analysis for variable selection, LASSO was integrated into the SVM as a new variable selection method. The performance of our model on the validation set was better than that of the naive Bayesian prediction model proposed by Cui et al [
Machine learning performs well in predicting ONFH after internal fixation of FNF. The 6variable XGBoost model predicts the risk of ONFH well and has good generalization ability in external data, which can be used for the clinical prediction of ONFH after internal fixation of FNF.
Variables and definitions.
Characteristics of 3 groups of patients with femoral neck fracture.
Correlation coefficient matrix heat maps.
Parameters of machine learning models.
average precision
area under the receiver operating characteristic curve
femoral neck fracture
falsepositive rate
least absolute shrinkage and selection operator
local interpretable modelagnostic explanation
logistic regression
osteonecrosis of the femoral head
precisionrecall
Prediction Model Risk of Bias Assessment Tool
random forest
receiver operating characteristic
Shapley additive explanations
synthetic minority oversampling technique
support vector machine
visual analog scale
eXtreme Gradient Boosting
This study was supported by the National Natural Science Foundation of China (grant number 81872718), Shang Hai Municipal Health and Family Planning Commission (grant number 201840041), The Outstanding Clinical Discipline Project of Shang Hai Pu Dong (grant number PWYgy201810), and Key Undergraduate Course Project of Shang Hai Education Commission (201965).
None declared.