Background

JMI

JMIR Med Inform

JMIR Medical Informatics

2291-9694

JMIR Publications

Toronto, Canada

v7i1e11728

30747712

10.2196/11728

Original Paper

Development of Prediction Models Using Machine Learning Algorithms for Girls with Suspected Central Precocious Puberty: Retrospective Study

Eysenbach

Gunther

Cai

Pengshan

Chen

Jinying

Pan

Liyan

MSc 1

http://orcid.org/0000-0002-7306-4916

Liu

Guangjian

PhD 1

http://orcid.org/0000-0001-9275-9214

Mao

Xiaojian

PhD 2

http://orcid.org/0000-0002-4915-4903

Huixian

MSc 1

http://orcid.org/0000-0001-5722-8602

Zhang

Jiexin

BD 1

http://orcid.org/0000-0003-1630-380X

Liang

Huiying

PhD 1

http://orcid.org/0000-0002-9987-8002

Xiuzhen

PhD 2

Department of Genetics and Endocrinology Guangzhou Women and Children’s Medical Center Guangzhou Medical University

9 Jinsui Road

Tianhe District

Guangzhou, 510623

China 86 02038857692 86 02038857692 13725100840@163.com

http://orcid.org/0000-0001-5124-3051

1 Institute of Pediatrics Guangzhou Women and Children's Medical Center Guangzhou Medical University

Guangzhou

China 2 Department of Genetics and Endocrinology Guangzhou Women and Children’s Medical Center Guangzhou Medical University

Guangzhou

China

Corresponding Author: Xiuzhen Li 13725100840@163.com

Jan-Mar2019

12 02 2019

7 1

e11728

30 7 2018 6 9 2018 10 10 2018 9 12 2018

©Liyan Pan, Guangjian Liu, Xiaojian Mao, Huixian Li, Jiexin Zhang, Huiying Liang, Xiuzhen Li. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 12.02.2019.

2019

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.

Background

Central precocious puberty (CPP) in girls seriously affects their physical and mental development in childhood. The method of diagnosis—gonadotropin-releasing hormone (GnRH)–stimulation test or GnRH analogue (GnRHa)–stimulation test—is expensive and makes patients uncomfortable due to the need for repeated blood sampling.

Objective

We aimed to combine multiple CPP–related features and construct machine learning models to predict response to the GnRHa-stimulation test.

Methods

In this retrospective study, we analyzed clinical and laboratory data of 1757 girls who underwent a GnRHa test in order to develop XGBoost and random forest classifiers for prediction of response to the GnRHa test. The local interpretable model-agnostic explanations (LIME) algorithm was used with the black-box classifiers to increase their interpretability. We measured sensitivity, specificity, and area under receiver operating characteristic (AUC) of the models.

Results

Both the XGBoost and random forest models achieved good performance in distinguishing between positive and negative responses, with the AUC ranging from 0.88 to 0.90, sensitivity ranging from 77.91% to 77.94%, and specificity ranging from 84.32% to 87.66%. Basal serum luteinizing hormone, follicle-stimulating hormone, and insulin-like growth factor-I levels were found to be the three most important factors. In the interpretable models of LIME, the abovementioned variables made high contributions to the prediction probability.

Conclusions

The prediction models we developed can help diagnose CPP and may be used as a prescreening tool before the GnRHa-stimulation test.

central precocious puberty GnRHa-stimulation test machine learning prediction model

Introduction

Precocious puberty is related to the development of secondary sexual characteristics in girls before the age of 8 years and in boys before the age of 9 years. In recent years, the age of puberty onset has shown a decreasing trend, and puberty is related to subsequent health outcomes such as breast cancer, diabetes, and behavioral disorders [1]. Central precocious puberty (CPP), also known as true precocious puberty, is caused by early activation of the hypothalamic-pituitary-gonadal axis with clinical pubertal symptoms. CPP can influence final adult height and result in psychological problems, which will cause inappropriate behaviors. It is important for girls with suspected CPP to be evaluated and diagnosed in a timely manner.

The gold standard in the confirmation of CPP is the positive response of gonadotropin to a gonadotropin-releasing hormone (GnRH)–stimulation test. In the absence of GnRH, GnRH analogues (GnRHa) are usually used instead [2]. However, the stimulation test is time consuming and expensive. Besides, the test is painful and make patients uncomfortable due to the need for repeated blood sampling at different time points. Therefore, another method to avoid the disadvantages of the GnRHa-stimulation test will be of great help in the diagnosis of CPP.

Several studies have focused on investigating a single gonadotropin biomarker to identify patients with CPP conveniently. Basal or peak serum luteinizing hormone (LH), follicle-stimulation hormone (FSH), and the ratio LH/FSH are the most common biomarkers reported [3-7]. However, the cut-off values of these single biomarkers depend on the assay used to measure the gonadotropin levels. As a result, cut-off points in previous studies differed widely. Moreover, both the Pasternak group [3] and the Mogensen group [8] reported that a single basal serum LH measurement could verify the presence of CPP, but could not confirm the absence of CPP. Therefore, a single gonadotropin parameter alone may not be sufficient for the diagnosis of CPP, and clinical and laboratory factors that can predict response to the GnRHa-stimulation test should be combined [9,10]. Suh et al [10] found that accelerated growth rate, advanced bone age, and increased basal gonadotropin and insulin-like growth factor-I (IGF-I) levels were correlated with CPP. Traditional statistical analysis including t test and binary logistic regression were used to select factors correlated with the GnRH test [11-14]. Although remarkable progress has been made in these studies, there is a long way to go for their application in clinics due to the low sensitivity or specificity of tests.

Considering the previous studies and the extensive application of machine learning algorithms in the medical field, we aimed to determine whether combining multiple variables with machine learning classifiers could improve prediction of the GnRHa-stimulation test and thus help diagnose CPP.

Methods Population and Variables

We enrolled 1757 girls with CPP symptoms who visited the Pediatric Day Ward of the Endocrinology Department at Guangzhou Women and Children’s Medical Center from January 2012 to March 2018. All subjects had undergone the GnRHa-stimulation test. Girls with any other disorders or intracranial lesions were excluded from the study.

Girls fulfilling the following eligibility criteria were considered to have a positive response to the GnRHa-stimulation test and were diagnosed with CPP in our study: (1) peak LH level ≥ 10 IU/L or peak LH level ≥ 5 IU/L combined with a ratio of peak LH to FSH value ≥ 0.6 and (2) onset of secondary sexual characteristics at the age < 8 years. Girls whose laboratory tests did not satisfy all the abovementioned criteria were considered to have a negative response. According to the long-term clinical practices, the first condition with a peak LH ≥ 10 IU/L is used as the diagnosing criterion in our hospital. Peak LH level ≥ 5 IU/L combined with a ratio of peak LH to FSH value ≥ 0.6 is widely used in China and some other countries for children undergoing the GnRH-stimulation test [15,16]. Since the stimulation effect of GnRHa is almost hundreds times that of GnRH [17], a condition that affects the levels of sex hormones due to GnRH would do the same with GnRHa. Our diagnostic criteria are an improved version of the existing criteria that are adapted to our population.

Information such as chief complaints, development of secondary sexual characters, and abnormal duration of puberty were stored as free text in the clinical records of the electronic medical records system. Laboratory test values were reported as structured data in the laboratory information system. In total, 19 variables were extracted from the clinical records and laboratory results for all the 1757 patients. Specifically, 10 variables extracted from the clinical records were weight, height, body mass index (BMI), abnormal duration of puberty in records (History), menarche, core in breast (Core), pigmentation, development stage of pubes (Pubes), development stage of left breast, and development stage of right breast. Breast and pubic hair development were evaluated using Tanner staging (stages 1 to 5). Nine variables extracted from the laboratory results were age, basal serum LH, FSH, estradiol, prolactin, testosterone, growth hormone, IGF-I, IGF-binding protein-3 levels before the GnRHa test.

Among the 1757 patients, 436 girls had examination reports available, including pelvic ultrasonography (for development of the uterus and ovaries) and radiography of the left hand (for bone age). Six variables extracted from the examination reports were development of uterus, existence of follicle, uterine volume, left and right ovary volumes, and bone age. Bone age was measured by the Greulich and Pyle method [18]. The variables from the clinical records and the examination reports were extracted first with traditional regex match using Python [19] and then examined manually by two endocrinologists.

This study was approved by the Institutional Review Board of Guangzhou Women and Children’s Medical Center and conducted in accordance with the ethical guidelines of the Declaration of Helsinki of the World Medical Association. The requirement to obtain informed consent was waived because of the retrospective nature of the study. Data used in this study were anonymous, and no identifiable personal data of the patients were available for the analysis.

Data Preprocessing

Variables with more than 20% missing data, such as growth rate of height and weight and heights of parents, were excluded from this study. Missing values for continuous variables were replaced with mean values of all the samples grouped by age. Discrete variables like experience of menarche were filled with a value of 0. Discrete variables like Tanner stage for breast and pubes were filled with the least degree (stage 1).

Model Development and Assessment

Tree learning classifiers allow nonlinear interactions between features and have good interpretability. Considering this, we selected two tree-based ensemble binary classification algorithms—extreme gradient boosting (XGBoost) and random forest—to develop our models. We also used linear supported vector machines (SVM) and decision trees for the classification to compare the performance between ensemble models and nonensemble models. The models aimed to identify relationships between the input features and the output GnRHa test results, thereby determining whether a patient responds positively to the GnRHa test.

XGBoost is a scalable tree boosting and effective learning algorithm [20]. It trains a sequence of models to minimize errors made by existing models. Models in XGBoost are decision trees. Many data scientists have applied this algorithm to solve classification problems and achieved excellent results. XGBoost has also been successfully used in medical studies [21,22]. As XGBoost is essentially a gradient boosting tree model, which is not based on distance, normalization is not required. Random forest is another classical ensemble learning algorithm with a combination of a large amount of trees [23], which trains decision trees in parallel by using data with replacement. It applies bootstrap aggregating to tree learners, which leads to better performance as variance decreases. Random forest has the ability to handle nonlinear data and is robust to noise. Besides, parameter tuning is not that complex for these two algorithms compared to other ensemble learning algorithms. SVM is a binary classifier with a maximum margin hyperplane [24]. The decision tree classifier is a tree-like model used for classification [25].

In order to obtain robust assessments and prevent overfitting, we used a nested cross-validation with an outer Monte Carlo cross-validation [26] (MCCV, repeated 20 times) and an inner k-fold cross-validation (k=5) for parameter tuning, yielding a total of 20 times the five-fold cross-validation. In the outer MCCV loop, the whole data set is randomly divided into the training set (80%) and the test set (20%) for 20 times. For each training set, the inner stratified five-fold cross-validation loop is performed as follows. The training set is split into five subsets, where four subsets are used for training and one is used for test. Parameter tuning is performed with grid search in the inner cross-validation. Finally, a model fitted on the training set with parameters that has the best area under the curve (AUC) evaluated on the inner test set is determined. The detailed training and test process is presented in Figure 1.

Feature Importance

XGBoost and random forest classifier have the ability to evaluate the importance of features. Feature importance is a feature weight and can represent the contribution to prediction. In XGBoost, feature importance is computed by the sum of times that the feature is selected as a tree node. In random forest, feature importance is calculated based on the out-of-bag (OOB) error. OOB error is the mean prediction error for training observations in the respective bootstrap sample. After randomly adding noise perturbations to OOB samples, a feature with a higher OOB error difference is more important, with higher feature importance. For both models, each feature obtained 20 feature importance values with 20 times the MCCV. We summed all the feature importance values for each feature and obtained a rank for all features.

Model Interpretation

One disadvantage of machine learning is that the model usually runs as a black box. However, it is necessary for a doctor to understand the reasons why a model makes such a prediction in the clinic, especially when timely detection is necessary. Tree-based models can provide feature importance at a global level but not in a specific case. The local interpretable model-agnostic explanations (LIME) algorithm is developed to identify an interpretable model that is locally faithful for each individual prediction [27,28]. It provides relative feature contributions for a single instance of the prediction result. LIME generates neighborhood data by randomly perturbing features from the instance. It then learns locally weighted linear models on this neighborhood data to explain each of the classes in an interpretable way. Parameters in LIME mainly include the maximum number of features in explanation, number of neighborhood samples to generate, and machine learning prediction function. We used the LIME library from the original authors for the model interpretation. The number of neighborhood samples is 5000 by default. The parameter num_features (maximum number of features) is the number of features shown in the explanation. The default value is 10. In our study, the value was 9 for a clear layout, as the contributions of features ranked after 9 were almost zero.

In our study, we used a submodular pick [27] instead of a random pick to select a diverse, representative set of samples from the test set for nonredundant explanations. For these samples, we then obtained the class probabilities, and the representative individuals were assigned with and average weight contributed by each feature to display how the classifier made a decision. Finally, we went over all these LIME results together with the endocrinologists to decide whether we should trust the results of the model.

All computation and visualization were performed in Python [19] using packages like Scikit-learn, Pandas, Lime, and Matplotlib.

Figure 1

Training and validation process of prediction models. AUC: area under receiver operating characteristic; XGBoost: extreme gradient boosting.

Results Subject Characteristics

Among the 1757 girls included in our study, 966 were positive for the GnRHa-stimulation test and diagnosed with CPP; the remaining 791 girls showed a negative response to the test. As shown in Table 1, 16 of the 19 variables were significantly higher in the CPP group than in the non-CPP group (P<.05), whereas prolactin, BMI, and pigmentation were similar in both groups.

Evaluation for Models

First, we developed prediction models with the data of 19 clinical and laboratory variables (Table 1) from all 1757 patients. Two machine learning algorithms, XGBoost and random forest classifiers, were used, and parameters with the best AUC were selected for each model. The performance as well as the selected parameters of the models are listed in Table 2, and their receiver operating characteristic (ROC) curves are plotted in Figure 2. The performance was evaluated on the 20 test sets split with MCCV. Both models had strong prediction powers, with a specificity of ≥84.32%, a sensitivity of ≥77.91%, and an AUC of ≥0.88. The XGBoost classifier is slightly more effective than the random forest classifier, especially in terms of the specificity (P<.01), whereas random forest is much more efficient in terms of the computation speed with less model complexity.

Table 1

Basic characteristics of girls who underwent the GnRHa-stimulation test.

Variables	Non-CPP^a (n=966), mean (SD)	CPP (n=791), mean (SD)	P value^b
Age (years)	7.07 (1.11)	7.52 (0.99)	<.001
LH^c (IU/L)	0.12 (0.23)	0.93 (1.28)	<.001
FSH^d (IU/L)	1.82 (1.30)	3.01 (1.62)	<.001
GH^e (ng/mL)	3.27 (3.26)	4.75 (4.69)	<.001
IGF-I^f (ng/mL)	231.35 (65.93)	317.87 (89.84)	<.001
IGFBP-3^g (μg/mL)	4.55 (0.52)	4.81 (0.55)	<.001
Estradiol (pmol/L)	102.56 (50.96)	125.81 (60.97)	<.001
Prolactin (ng/mL)	8.73 (5.39)	8.59 (5.61)	.52
Testosterone (nmol/L)	0.80 (0.39)	0.94 (0.49)	<.001
History^h (months)	7.67 (10.39)	9.27 (9.63)	<.001
Menstruation/menarche (yes, no)	N/Aⁱ	N/A	.03
Height^j (cm)	127.16 (8.61)	131.61 (8.42)	<.001
Weight^j (kg)	27.32 (5.32)	29.60 (4.95)	<.001
BMI^k (kg/m²)	16.73 (2.30)	16.91 (1.96)	.34
Breast core (yes, no)	N/A	N/A	.02
Pubes^l (1-5)	1.06 (0.27)	1.14 (0.44)	<.001
Pigmentation (yes, no)	N/A	N/A	.87
Left breast^l (1-5)	2.33 (0.84)	2.76 (0.92)	<.001
Right breast^l (1-5)	2.32 (0.84)	2.78 (0.92)	<.001

^aCPP: central precocious puberty.

^bThe equality of each indicator was evaluated by Chi-square or Student t test. P<.05 was considered significant.

^cLH: luteinizing hormone.

^dFSH: follicle-stimulation hormone.

^eGH: growth hormone.

^fIGF-I: insulin-like growth factor-I.

^gIGFBP-3: insulin-like growth factor binding protein-3.

^hAbnormal duration in records.

ⁱN/A: not applicable.

^jAt stimulation test.

^kBMI: body mass index.

^lTanner stage.

Table 2

Predictive performance of classifiers and the corresponding parameters. A paired t test was performed on specificity and sensitivity for comparison against XGBoost.

Algorithms/Variables		Specificity^a (%), mean (SD)	Sensitivity^b (%), mean (SD)	AUC^c, mean (SD)	Parameters
19 variables, 1757 patients
	XGBoost^d	85.39 (1.38)	77.94 (3.50)	0.89 (0.02)	Learning rate=0.01, max depth=3, number of trees=500
	Random forest	84.32 (1.88)^e	77.91 (3.59)^f	0.88 (0.02)	Max depth=3, criterion=gini, number of trees=20
	SVM^g	88.94 (1.76)^e	62.36 (4.12)^e	0.86 (0.04)	Kernel=linear, penalty coefficient=5
	Decision tree	75.90 (2.47)^e	71.71 (3.99)^e	0.74 (0.02)	Criterion=entropy
19 variables, 436 patients
	XGBoost	83.17 (5.29)	75.28 (6.43)	0.86 (0.04)	Learning rate=0.01, max depth=3, number of trees=500
	Random forest	83.46 (6.28)^f	74.72 (6.43)^f	0.85 (0.04)	Max depth=3, criterion=gini, number of trees=20
	SVM	88.94 (4.90)^e	62.36 (7.73)^e	0.86 (0.02)	Kernel=linear, penalty coefficient=5
	Decision tree	76.25 (7.07)^e	68.06 (7.12)^e	0.72 (0.04)	Criterion=entropy
25 variables, 436 patients
	XGBoost^d	87.66 (5.52)	76.64 (6.51)	0.90 (0.04)	Learning rate=0.01, max depth=4, number of trees=500
	Random forest	87.41 (4.22)^f	75.03 (7.91)^f	0.90 (0.05)	Max depth=3, criterion=entropy, number of trees=20
	SVM	89.81 (4.28)^f	66.53 (7.01)^e	0.86 (0.02)	Kernel=linear, penalty coefficient=5
	Decision tree	76.35 (5.51)^e	68.61 (7.16)^e	0.72 (0.05)	Criterion=entropy
1-3 variables, 1757 patients, XGBoost^d
	LH^h, IGF-Iⁱ, FSH^j	83.17 (1.62)	76.39 (3.57)	0.86 (0.02)	Learning rate=0.01, max depth=3, number of trees=500
	LH^h, IGF-Iⁱ	83.27 (1.62)	75.69 (3.61)	0.86 (0.02)	Learning rate=0.01, max depth=3, number of trees=500
	LH^h, FSH^j	83.56 (1.94)	75.83 (3.13)	0.84 (0.02)	Learning rate=0.01, max depth=3, number of trees=500
	LH^h	83.37 (2.00)	75.97 (3.74)	0.84 (0.02)	Learning rate=0.01, max depth=3, number of trees=500
	IGF-Iⁱ, FSH^j	80.77 (2.47)	57.08 (3.29)	0.77 (0.02)	Learning rate=0.01, max depth=3, number of trees=500
	IGF-Iⁱ	80.19 (3.14)	53.19 (4.55)	0.73 (0.02)	Learning rate=0.01, max depth=3, number of trees=500
	FSH^j	84.13 (3.87)	45.00 (5.34)	0.68 (0.02)	Learning rate=0.01, max depth=3, number of trees=500

^aSpecificity=number of true negatives/(number of true negatives+number of false positives).

^bSensitivity=number of true positives/(number of true positives+number of false negatives).

^cAUC, area under the receiver operating curve.

^dXGBoost: extreme gradient boosting.

^eP<.01

^fNot significant.

^gSVM: supported vector machines.

^hLH: luteinizing hormone.

ⁱIGF-I: insulin-like growth factor-I.

^jFSH: follicle-stimulation hormone.

In the data set, 436 girls had additional examination reports, and we extracted six variables from these reports (see Population and Variables subsection). To investigate whether adding image features could enhance the prediction efficiency, we combined the six variables with the 19 variables and trained and evaluated both the XGBoost and random forest models on the 436 samples, of which 180 patients belonged to the CPP group and 256 belonged to the non-CPP group. For the ease of comparison, we retrained the previous 19-variable models with the 436 samples. As shown in Table 2, the reduction in sample size led to a serious decline in model performance, whereas the addition of six image features improved their performance. Specifically, for XGBoost in 436 samples, the specificity increased from 83.17% for 19 variables to 87.66% for 25 variables, the sensitivity increased from 75.28% to 76.64%, and the AUC increased from 0.86 to 0.90. For random forest in 436 samples with 25 variables, the specificity increased from 83.46% to 87.41%, the sensitivity increased from 74.72% to 75.03%, and the AUC increased from 0.86 to 0.90 compared to the results from 436 samples with 19 variables. Similarly, as seen in the ROC curves shown in Figure 2, XGBoost performed slightly better than the random forest classifier.

To compare performance between ensemble models and nonensemble models, the SVM and decision tree classifiers were used to develop predictive models for the abovementioned settings. Higher specificities were achieved with the SVM models. However, sensitivities for the SVM models were much lower than those for the ensembles models. The decision tree models demonstrated significantly inferior performance in terms of almost all the sensitivities, specificities, and AUCs. These results suggest that the ensemble models are able to yield excellent performance while maintaining a good balance between sensitivity and specificity in the prediction of CPP.

Feature Importance

We computed the feature importance score for all the 19 variables to identify important features used by the models. The importance of each feature calculated by the models is plotted in Figure 3. In both models, the most important predictive variable was LH level, followed by IGF-I and FSH levels. The fourth most important feature for the random forest model was height, which ranked fifth in the XGBoost model. Prolactin is the fourth most important feature of XGBoost, but it contributed only a little to random forest. These data suggest that different machine learning algorithms attach importance to different combination lists of variables, although they yield similar predictive performance.

In order to further verify the importance of the top 3 features, we constructed XGBoost models with these features individually or in combination (Table 2). As expected, the models using one, two, or three features had poorer performances than the models using all features. The results showed that the higher the feature ranked, the better the corresponding model performed. Interestingly, LH alone or together with IGF-I and FSH is sufficient to predict a response to the GnRHa test with a fairly good performance and an AUC between 0.84 and 0.86. These data support the results from the feature importance calculations.

Figure 2

ROC curves for classifiers with 19 variables for 1757 patients and 25 variables for 436 patients. ROC: receiver operating curve; AUC: area under ROC.

Figure 3

Feature importance ranking for 19 variables in two classifiers calculated by the models. LH: luteinizing hormone; IGF-I: insulin-like growth factor-I; FSH: follicle-stimulation hormone; PRL: prolactin; GH: growth hormone; E2: estradiol; BMI: body mass index; TTE: testosterone; Rbreast: right breast; Lbreast: left breast; IGFBP-3: insulin-like growth factor binding protein-3; PMT: pigmentation; MST: menstruation.

Local Interpretable Model-Agnostic Explanations for Interpretation

A representative set of 200 samples, which accounted for more than 50% of the test set and were enough to be the budget size of individual instances to understand a model, were selected with the submodular pick method [27] for the 19-variable models. LIME was then applied to investigate feature contributions for each prediction. Results with top 9 features are presented in Figure 4 for one positive sample and one negative sample (more representative samples can be seen in Multimedia Appendix 1). In Figure 4, XGBoost predicts an instance where CPP positively responds to the GnRHa test with a probability of 90%. Only the feature growth hormone supports the negative prediction, whereas LH, prolactin, IGF-I, and information about body development support the positive prediction. This makes sense in the clinical diagnosis of CPP and reveals that we can trust our prediction models to a certain extent. In Figure 4, left and right breast at Tanner stage 3, FSH level > 2 IU/L, and several other features support the positive prediction with a probability of 16%. LH level of 0.07 IU/L, prolactin level > 9.53 ng/mL, IGF-I level > 220 ng/mL, and age < 7 years contribute to the negative prediction with a probability of 84%. Similar results are observed in Figure 4.

Figure 4

Results of LIME with XGBoost and Random Forest classifiers applied to one positive (A, B) and one negative (C, D) instance. The left sides are for XGBoost, and the right for Random Forest. Blue color is for the negative instance and orange is for the positive instance. The first column represents the prediction probabilities of negative and positive results achieved from classifiers. The second column shows the features’ contributions to the probability. Only the top nine features are displayed for clarity. The third column displays the original data values. LIME: local interpretable model-agnostic explanations; XGBoost: extreme gradient boosting; LH: luteinizing hormone; IGF-I: insulin-like growth factor-I; FSH: follicle-stimulation hormone; PRL: prolactin; GH: growth hormone; E2: estradiol; BMI: body mass index; TTE: testosterone; Rbreast: right breast; Lbreast: left breast; IGFBP-3: insulin-like growth factor binding protein-3; PMT: pigmentation; MST: menstruation; PMT: pigmentation.

Discussion Overview

CPP mimics pubertal development ahead of time at an inappropriate chronological age. It requires timely detection and treatment in case of physical and physiological effect on girls. The GnRHa-stimulation test is expensive and time consuming and causes discomfort to patients. Here, we applied machine learning algorithms to multiple clinical variables and built two tree-based ensemble learning classifiers for the prediction of response to the GnRHa-stimulation test. Both the XGBoost and random forest models achieved good performance in distinguishing between positive and negative responses, with the AUC ranging from 0.88 to 0.90, the sensitivity ranging from 77.91% to 77.94%, and the specificity ranging from 84.32% to 87.66%.

Comparisons with Previous Models

Several previous models focused on determining optimal blood sampling time points or appropriate cut-off values to simplify the stimulation test. Kandemir et al [29] found that a single sample of LH tested at the 40th minute after stimulation with a cut-off of 5 IU/L could yield 98% sensitivity and 100% specificity in the diagnosis of CPP. Yazdani et al [30] showed that an LH concentration > 5 IU/L at 3 hours has optimal sensitivity (83%) and specificity (97%). In the study of Çatlı et al [7], 100% sensitivity and 84% specificity were obtained using a cut-off value > 0.24 for peak LH/FSH ratio in girls. Although these models performed better than our models, they had to be used after stimulation and therefore could not avoid the disadvantages of the GnRH/GnRHa test completely.

Some models used only the basal sex hormone level. Yazdani et al [30] found that a basal LH level of >0.1 IU/L, a basal LH/FSH ratio >1, and basal estradiol level ≥1.5 ng/dL in girls have low sensitivity (10%-67%) but excellent specificity (94%-100%). Çatlı et al [7] also reported models with the basal FSH or LH levels and achieved a sensitivity of 71% and a specificity of 68% or 64%. Pasternak et al [3] reported that basal LH levels ≤ 0.1 IU/L were sufficient to rule out positive response to the GnRH test with a specificity of 94% but a sensitivity of only 64% in girls. In another model [4], the basal LH level with a cut-off value of 0.35 IU/L was associated with a sensitivity of 63.96% and a specificity of 76.3% based on the ROC with an AUC of 0.77. These results varied a lot due to the different settings and sample sizes. In this study, our models showed better performance with more features before stimulation and a larger homogeneous population, which is the largest population in such a study to our knowledge.

Predictive Features

Based on our machine learning models, basal LH, IGF-I, and FSH levels are predictive factors with top ranks for the feature importance in both models. Previous studies have demonstrated that the measurement of LH could be better than that of other sex hormones for initial evaluation of suspected puberty [3,8]. In our study, the LH level ranked first and was much more important than other variables. Besides LH, another indicator monitored in the stimulation test, FSH, was also selected by the models as the third most–important variable. Obviously, LH and FSH are important to CPP because they are biomarkers of the hypothalamic-pituitary-gonadal axis activation, which is the essence of CPP. IGF-I, which is the second most important variable in our models, is reportedly involved in GnRH regulation [31,32] and is increasingly expressed in girls with CPP [9,5]. Animal studies showed that the IGF-I signaling pathways play important roles in the timing of puberty in girls [32]. Although IGF-I has not been considered in previous models, our study suggests that IGF-I may be a valuable marker for diagnosing CPP.

Several studies [12,33,34] suggested that image reports like pelvic ultrasound and radiography of the hand have adjunct diagnostic values in CPP diagnosis but provide no reliable differentiation alone. Here, we found that adding features from the image reports improved the prediction results. Performance of models built based on 1757 samples was better than that based on 436 samples, suggesting a sample size effect. Interestingly, in the case of 436 samples with additional six image variables, the aforementioned sample size effect was balanced. This suggests that more samples with image features will produce better results. Thus, medical image examinations like bone age radiography should be considered before the GnRHa-stimulation test for girls with suspected CPP.

Interpretations of Models

We noticed that more variables were assigned with a moderate value of feature importance in the XGBoost model than those in the random forest model. This is reasonable when considering the different algorithms the two models used for prediction and importance evaluation. In XGBoost, trees are sequentially built in a boosting manner to enhance the overall performance. The estimates of feature importance are provided explicitly with the frequency that the feature is selected as a tree node from a trained predictive model. In contrast, trees are trained parallelly in a bootstrapping way in random forest to vote for the final decision. The feature importance is estimated implicitly through permuting the feature’s values and calculating the change of the model’s prediction error. Obviously, XGBoost includes each contribution of each feature to each tree into the feature importance, whereas random forest only evaluates each feature globally without specific contributions. It should be noted that different combinations of variables may produce models with similar predictive accuracy, relating to the uncertainty analysis of the solutions in any decision-making problem [35-37]. This is not rare in machine learning models in medicine [38-40]. Moreover, the most important features in the clinic such as basal LH, IGF-I, and FSH levels were all sorted out by both models, demonstrating that they are both reliable and effective in predicting response to the GnRHa test.

In order to provide endocrinology physicians a trustworthy insight into the prediction models, we also used LIME to show each feature’s contribution to predicting probabilities reasonably. The most important features used by the models for individual prediction have been proven to be significant in the clinic [3-5,7,9,10], demonstrating that our models are credible. This will greatly increase the interpretability of the machine learning models and make it convenient for individualized diagnosis in the clinic.

Limitations

There are some limitations to this study. First, growth velocity is specially related to physical development. Due to the lack of height growth rate and weight growth rate, we did not include growth velocity in our feature set. For further research, we will focus more on medical imaging and growth velocity to identify their diagnostic value with CPP. Second, our work included only girls with suspected CPP from a single center in China. The prediction models in this study may not be suitable for the population in other districts or countries. Third, manual inspection of values extracted through regular expression matching from free text could reduce errors to improve the model performance. However, this adds a considerable amount of work and thus reduces the scalability of the model. We are improving the matching algorithm with the manually inspected data to increase the level of model automation. Finally, features generated from laboratory results are more complete than those extracted from free text, which may affect the rank of feature importance. More efforts are required to enhance the data quality of unstructured features in the future.

Conclusions

Our study is the first one to apply both machine learning algorithms and the explanation method to the diagnosis of CPP. Our models can predict the response to the stimulation test before injection of GnRHa in girls who are suspected of having CPP and thus may be used as a prescreening tool to help physicians make decisions in conjunction with the GnRHa-stimulation test.

Multimedia Appendix 1

Results of LIME with XGBoost and Random Forest classifiers applied to four positive and four negative instances. LIME: local interpretable model-agnostic explanations.

Abbreviations

AUC

area under receiver operating characteristic

BMI

body mass index

CPP

central precocious puberty

FSH

follicle-stimulation hormone

GnRH

gonadotropin releasing hormone

GnRHa

gonadotropin releasing hormone analogues

IGF-I

insulin-like growth factor-I

IGFBP-3

insulin-like growth factor binding protein-3

luteinizing hormone

LIME

local interpretable model-agnostic explanations

MCCV

Monte Carlo cross-validation

OOB

out-of-bag

ROC

receiver operating characteristic

SVM

supported vector machines

XGBoost

extreme gradient boosting

This work was supported by Guangzhou Institute of Pediatrics, Guangzhou Women and Children’s Medical Center (Grant #KCP-2016-002).

HL and XL proposed the project and provided supervision. LP is responsible for data collection, project implementation, and manuscript drafting and revision. GL implemented the project and performed manuscript editing. XM provided clinical guidance and revised the manuscript. HL conducted part of the data collection and statistical analysis. JZ helped collect part of the data. All authors reviewed the manuscript in its final form.

None declared.

Abreu

Kaiser

Pubertal development and regulation

Lancet Diabetes Endocrinol 2016 03 4 3 254 264

10.1016/S2213-8587(15)00418-0

26852256

S2213-8587(15)00418-0

PMC5192018

Houk

Kunselman

Allen R

Lee

Peter A

The diagnostic value of a brief GnRH analogue stimulation test in girls with central precocious puberty: a single 30-minute post-stimulation LH sample is adequate

J Pediatr Endocrinol Metab 2008 12 21 12 1113 8

19189683

Pasternak

Friger

Loewenthal

Haim

Hershkovitz

The utility of basal serum LH in prediction of central precocious puberty in girls

Eur J Endocrinol 2012 02 166 2 295 9

10.1530/EJE-11-0720

22084156

EJE-11-0720

Ding

Yang

Shen

Huang

Liu

Evaluation of basal sex hormone levels for activation of the hypothalamic-pituitary-gonadal axis

J Pediatr Endocrinol Metab 2018 03 28 31 3 323 329

10.1515/jpem-2017-0124

29466239

/j/jpem.ahead-of-print/jpem-2017-0124/jpem-2017-0124.xml

Sørensen

Aksglaede

Petersen

Andersson

Juul

Serum IGF1 and insulin levels in girls with normal and precocious puberty

Eur J Endocrinol 2012 05 166 5 903 10

10.1530/EJE-12-0106

22379117

EJE-12-0106

Houk

Kunselman

Lee

Adequacy of a single unstimulated luteinizing hormone level to diagnose central precocious puberty in girls

Pediatrics 2009 06 123 6 e1059 63

10.1542/peds.2008-1180

19482738

123/6/e1059

Çatlı

Erdem

Anık

Abacı

Böber

Clinical and laboratory findings in the differential diagnosis of central precocious puberty and premature thelarche

Turk Pediatri Ars 2015 03 50 1 20 6

10.5152/tpa.2015.2281

26078693

tpa-50-1-20

PMC4462323

Mogensen

Aksglaede

Mouritsen

Sørensen

Main

Gideon

Juul

Diagnostic work-up of 449 consecutive girls who were referred to be evaluated for precocious puberty

J Clin Endocrinol Metab 2011 05 96 5 1393 401

10.1210/jc.2010-2745

21346077

jc.2010-2745

Nam

Rhie

Son

Park

Lee

Factors to predict positive results of gonadotropin releasing hormone stimulation test in girls with suspected precocious puberty

J Korean Med Sci 2012 02 27 2 194 9

10.3346/jkms.2012.27.2.194

22323868

PMC3271294

Suh

Choi

Kwon

Kim

Jeong

Ahn

Chae

Kim

Factors that predict a positive response on gonadotropin-releasing hormone stimulation test for diagnosing central precocious puberty in girls

Ann Pediatr Endocrinol Metab 2013 12 18 4 202 7

10.6065/apem.2013.18.4.202

24904878

PMC4027085

Advanced bone age as an indicator facilitates the diagnosis of precocious puberty

J Pediatr (Rio J) 2018 94 1 69 75

10.1016/j.jped.2017.03.010

28866322

S0021-7557(16)30375-8

Lee

Joo

Lee

Jun

Kim

The Diagnostic Value of Pelvic Ultrasound in Girls with Central Precocious Puberty

Chonnam Med J 2016 01 52 1 70 4

10.4068/cmj.2016.52.1.70

26866003

PMC4742613

Lee

Choi

Fan

Jang

Kim

Hwang

Evaluation of characteristics of the craniofacial complex and dental maturity in girls with central precocious puberty

Angle Orthod 2018 04 30

10.2319/112317-809.1

29708396

Baik

Choi

Kim

Predictive Value of Dental Maturity for a Positive Gonadotropin-Releasing Hormone Stimulation Test Result in Girls with Precocious Puberty

J Korean Med Sci 2017 02 32 2 296 302

10.3346/jkms.2017.32.2.296

28049241

32.296

PMC5219996

Subspecialty Group of Endocrinologic‚ Hereditary and Metabolic Diseases‚ the Society of Pediatrics‚ Chinese Medical Association Editorial Board‚ Chinese Journal of Pediatrics

[Consensus statement For the diagnosis and treatment of central precocious puberty (2015)]

Zhonghua Er Ke Za Zhi 2015 06 53 6 412 8

26310550

Carel

Eugster

Rogol

Ghizzoni

Palmert

ESPE-LWPES GnRH Analogs Consensus Conference Group Antoniazzi

Berenbaum

Bourguignon

Chrousos

Coste

Deal

Foster

Heger

Holland

Jahnukainen

Juul

Kaplowitz

Lahlou

Lee

Merke

Neely

Oostdijk

Phillip

Rosenfield

Shulman

Styne

Tauber

Wit

Consensus statement on the use of gonadotropin-releasing hormone analogs in children

Pediatrics 2009 04 123 4 e752 62

10.1542/peds.2008-1783

19332438

peds.2008-1783

Ibáñez

Potau

Zampolli

Virdis

Gussinyé

Carrascosa

Saenger

Vicens-Calvet

Use of leuprolide acetate response patterns in the early diagnosis of pubertal disorders: comparison with the gonadotropin-releasing hormone test

J Clin Endocrinol Metab 1994 01 78 1 30 5

10.1210/jcem.78.1.7507123

7507123

Bayley

Pinneau

Tables for predicting adult height from skeletal age: Revised for use with the greulich-pyle hand standards

The Journal of Pediatrics 1952 4 40 4 423 441

10.1016/S0022-3476(52)80205-7

Python 2019-01-22

https://www.python.org/

Chen

Guestrin

XGBoost: A Scalable Tree Boosting System

2016 08 13

KDD '16 Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 13-17, 2016

San Francisco, CA, USA

10.1145/2939672.2939785

Torlay

Perrone-Bertolotti

Thomas

Baciu

Machine learning-XGBoost analysis of language networks to classify patients with epilepsy

Brain Inform 2017 09 4 3 159 169

10.1007/s40708-017-0065-7

28434153

10.1007/s40708-017-0065-7

PMC5563301

Yao

Calhoun

Sui

An ensemble learning system for a 4-way classification of Alzheimer's disease and mild cognitive impairment

J Neurosci Methods 2018 05 15 302 75 81

10.1016/j.jneumeth.2018.03.008

29578038

S0165-0270(18)30079-7

Liaw

Wiener

R News 2002 12

2019-01-22

Classification and regression by randomForest https://cran.r-project.org/doc/Rnews/Rnews_2002-3.pdf

Hearst

Dumais

Osuna

Platt

Scholkopf

Support vector machines

IEEE Intell Syst Their Appl 1998 7 10 13 4 18 28

10.1109/5254.708428

21889629

S0003-2670(11)00968-8

Safavian

Landgrebe

A survey of decision tree classifier methodology

IEEE Trans Syst Man Cybern 1991 21 3 660 674

10.1109/21.97458

Liang

Monte Carlo cross validation

Chemometrics and Intelligent Laboratory Systems 2001 4 56 1 1 11

10.1016/S0169-7439(00)00122-2

Ribeiro

Singh

Guestrin

Why should i trust you? Explaining the predictions of any classifier

2016 8

KDD '16 Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 13-17, 2016

San Francisco, California, USA

10.1145/2939672.2939778

Liu

Kumara

Reich

Explainable data-driven modeling of patient satisfaction survey data

2017 12

2017 IEEE International Conference on Big Data

11-14 Dec. 2017

Boston, MA, USA

IEEE

10.1109/BigData.2017.8258391

Kandemir

Demirbilek

Özön

Gönç

Alikaşifoğlu

GnRH stimulation test in precocious puberty: single sample is adequate for diagnosis and dose adjustment

J Clin Res Pediatr Endocrinol 2011 03 3 1 12 7

10.4274/jcrpe.v3i1.03

21448328

PMC3065310

Yazdani

Lin

Raman

Haymond

A single sample GnRHa stimulation test in the diagnosis of precocious puberty

Int J Pediatr Endocrinol 2012 07 18 2012 1 23

10.1186/1687-9856-2012-23

22809285

1687-9856-2012-23

PMC3438042

Wolfe

Divall

The regulation of reproductive neuroendocrine function by insulin and insulin-like growth factor-1 (IGF-1)

Front Neuroendocrinol 2014 10 35 4 558 72

10.1016/j.yfrne.2014.05.007

24929098

S0091-3022(14)00054-5

PMC4175134

Divall

Williams

Carver

Koch

Brüning

Kahn

Wondisford

Radovick

Wolfe

Divergent roles of growth factors in the GnRH regulation of puberty in mice

J Clin Invest 2010 08 120 8 2900 9

10.1172/JCI41069

20628204

41069

PMC2912185

de Vries

Phillip

Role of pelvic ultrasound in girls with precocious puberty

Horm Res Paediatr 2011 02 75 2 148 52

10.1159/000323361

21228561

000323361

Eksioglu

Yilmaz

Cetinkaya

Cinar

Yildiz

Aycan

Value of pelvic sonography in the diagnosis of various forms of precocious puberty in girls

J Clin Ultrasound 2013 02 41 2 84 93

10.1002/jcu.22004

23124596

Fernández Martínez

Fernández Muñiz

Tompkins

On the topography of the cost functional in linear and nonlinear inverse problems

Geophysics 2012 01 77 1 W1 W15

10.1190/geo2011-0341.1

Polikar

Ensemble based systems in decision making

IEEE Circuits Syst Mag 2006 9 6 3 21 45

10.1109/MCAS.2006.1688199

Fernández-Martínez

Fernández-Muñiz

Pallero

Pedruelo-González

From Bayes to Tarantola: New insights to understand uncertainty in inverse problems

Journal of Applied Geophysics 2013 11 98 62 72

10.1016/j.jappgeo.2013.07.005

deAndrés-Galiana

Fernández-Martínez

Luaces

Del

CJJ

Huergo-Zapico

Acebes-Huerta

González

González-Rodríguez

Analysis of clinical prognostic variables for Chronic Lymphocytic Leukemia decision-making problems

J Biomed Inform 2016 04 60 342 51

10.1016/j.jbi.2016.02.017

26956213

S1532-0464(16)00039-3

Kosmicki

Sochat

Duda

Wall

Searching for a minimal set of behaviors for autism detection through feature selection-based machine learning

Transl Psychiatry 2015 02 24 5 e514

10.1038/tp.2015.7

25710120

tp20157

PMC4445756

Liu

Wang

Zhuang

Liang

Lin

Pan

Zeng

Cao

Zhao

Xia

Developing a Machine Learning System for Identification of Severe Hand, Foot, and Mouth Disease from Electronic Medical Record Data

Sci Rep 2017 11 27 7 1 16341

10.1038/s41598-017-16521-z

29180702

10.1038/s41598-017-16521-z

PMC5703994