This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.

Though shock wave lithotripsy (SWL) has developed to be one of the most common treatment approaches for nephrolithiasis in recent decades, its treatment planning is often a trial-and-error process based on physicians’ subjective judgement. Physicians’ inexperience with this modality can lead to low-quality treatment and unnecessary risks to patients.

To improve the quality and consistency of shock wave lithotripsy treatment, we aimed to develop a deep learning model for generating the next treatment step by previous steps and preoperative patient characteristics and to produce personalized SWL treatment plans in a step-by-step protocol based on the deep learning model.

We developed a deep learning model to generate the optimal power level, shock rate, and number of shocks in the next step, given previous treatment steps encoded by long short-term memory neural networks and preoperative patient characteristics. We constructed a next-step data set (N=8583) from top practices of renal SWL treatments recorded in the International Stone Registry. Then, we trained the deep learning model and baseline models (linear regression, logistic regression, random forest, and support vector machine) with 90% of the samples and validated them with the remaining samples.

The deep learning models for generating the next treatment steps outperformed the baseline models (accuracy = 98.8%, F1 = 98.0% for power levels; accuracy = 98.1%, F1 = 96.0% for shock rates; root mean squared error = 207, mean absolute error = 121 for numbers of shocks). The hypothesis testing showed no significant difference between steps generated by our model and the top practices (

The high performance of our deep learning approach shows its treatment planning capability on par with top physicians. To the best of our knowledge, our framework is the first effort to implement automated planning of SWL treatment via deep learning. It is a promising technique in assisting treatment planning and physician training at low cost.

Shock wave lithotripsy (SWL, or extracorporeal shock wave lithotripsy) has been considered as a safe and effective noninvasive treatment option for nephrolithiasis since its introduction in early 1980s [

Given such risks, previous studies have identified proper patient selection, modifications in treatment technique, and employment of adjunctive measures as elements to improve SWL outcomes [

Appropriate control over shock wave delivery has a strong impact on treatment success and minimal complications. A treatment plan for shock wave delivery is a series of shock wave delivery steps with a specified power level, shock rate, and number of shocks; a successful sample SWL treatment plan is shown in

A sample shock wave lithotripsy (SWL) treatment plan.

Shock wave delivery steps | Power level | Shock rate (per minute) | Number of shocks |

Step 1 | 1 | 120 | 100 |

Step 2 | 2 | 120 | 100 |

Step 3 | 3 | 120 | 100 |

Step 4 | 4 | 120 | 100 |

Step 5 | 5 | 120 | 100 |

Step 6 | 6 | 120 | 100 |

Step 7 | 7 | 120 | 100 |

Step 8 | 8 | 120 | 2300 |

Effective fragmentation leads to fewer shocks overall and therefore less damage to tissue [

Although the strength, rates, and total number of shock waves are identified as the important factors of SWL treatment outcomes, there is no case-by-case guideline for physicians to optimize shock wave delivery protocols that take into account patient demographics and stone characteristics. The optimal energy delivery strategy remains controversial. In vitro and in vivo studies suggest that the strategy of ramping up shock wave energy is beneficial to improve fragmentation and stone clearance and limit renal damage, but clinical results are discordant [

As a result, SWL success rates are significantly different among physicians.

Percentiles of treatment success rates.

Percentiles | Treatment success rates, % |

Minimum | 54.5 |

10th percentile | 74.8 |

20th percentile | 79.1 |

30th percentile | 82.6 |

40th percentile | 84.7 |

50th percentile | 86.6 |

60th percentile | 88.9 |

70th percentile | 91.4 |

80th percentile | 94.3 |

90th percentile | 100 |

Maximum | 100 |

Machine learning techniques have been applied in the planning process of high-quality personalized treatments, such as radiation therapies [

To train and evaluate our models, we used a dataset of renal treatments with Storz SLX-T from the International Stone Registry provided by Translational Analytics and Statistics. Each treatment consisted of PPC and several treatment steps (ie, ternaries of a power level, a shock rate, and number of shocks). The power level ranged from 1 to 9. The options for shock rates were 60, 90, 120, and 180 shocks per minute. The maximum number of shocks was typically set at 3000 for renal stones. The PPC in our dataset included patient gender, age, stone location (one-hot encoding), stone size, mean arterial pressure before treatment, anticoagulant use, sedation use, whether multiple stones existed, and whether strapping was applied.

Our deep learning models were trained with the best treatment plans for obtaining the best planning capability. We selected 54 physicians in the top quartile of treatment success rates. These physicians had more than 91.4% treatment success rates. Then, we selected their successful treatment cases with no reported complications, in which they were stone free or had fragments ≤4 mm and typically passed on their own without further treatment. We identified 1216 cases in total and assumed these cases are the best practices in SWL treatment planning.

We then built the step dataset from the identified successful cases to train and evaluate the step generation model. We identified steps by power level change or shock rate change and limited the number of shocks to 1000 for each step, a natural step length in previous literature [

Then, we exhaustively decomposed each case into samples by step for the step generation task, where the ternary of each step was generated by its previous steps and PPC. An

At last, we constructed 8583 samples for step generation. We randomly chose 90% of the samples for model training and used the remaining samples for validation. In the data split, we enforced that samples from the same treatment case were only contained within the same split.

We first built deep neural networks to separately generate power levels, shock rates, and numbers of shocks for the next steps, given previous steps and PPC (_{i}

where the initial values _{0}_{0}_{1}

The framework for automated shock wave lithotripsy (SWL) treatment planning. LSTM: long short-term memory; PPC: preoperative patient characteristics; ReLU: rectifier linear unit.

Then, the encoded previous steps were concatenated to PPC vectors and fed to deep neural networks. In our implementation, we used 2 fully connected layers with a rectifier linear unit (ReLU) function as activation functions, because ReLU functions are nonsaturated and make the model less likely to overfit [

where _{n}

We hypothesized that the deep learning approach is comparable to the treatment practices of top physicians and that it outperforms machine learning models which do not take treatment sequences as inputs. Thus, we compared the performance of the deep learning model and other up-to-date machine learning models.

Three classical machine learning approaches were selected as baselines for generating power level, shock rate, and number of shocks, respectively. We used logistic regression, random forest classifier (RFC), and support vector classifier (SVC) as the baseline models for power level generation and shock rate generation. We chose linear regression, random forest regression (RFR), and support vector regression (SVR) as the baseline models to generate the number of shocks. As these baseline models could not be fed with sequential data directly, the features for the baseline models were (1) the average power level, average shock rate, and average number of shocks in previous steps; (2) the power level, shock rate, and number of shocks in the last step; and (3) PPC.

We trained the deep learning models and baseline models with 90% of the samples. Then, we validated them with the remaining samples and calculated evaluation metrics. In the multiclass tasks of power level generation and shock rate generation, we used accuracy, macro-averaged precision, macro-averaged recall, and macro-averaged F1 score as the evaluation metrics [_{i j}

The precision and recall of category

Macro-averaged precision and recall are the average of precisions and recalls for all categories:

The F1 score of category

and macro-averaged F1 score is defined as the average of F1 scores for all categories:

Because the number of shocks is an integer, we used the root mean squared error (RMSE) and mean absolute error (MAE) as the metrics to evaluate the models generating the number of shocks and to measure the average magnitude of errors. At last, we conducted paired

The deep learning models generated high-quality treatment steps and outperformed the baselines, as summarized in

Model performance in power level generation.

Model | Accuracy | Precision | Recall | F1 | ||

Deep learning | 0.988 | 0.980 | 0.980 | 0.980 | 0.707 | .480 |

Logistic regression | 0.974 | 0.964 | 0.964 | 0.964 | 1.257 | .209 |

RFC^{a} |
0.708 | 0.823 | 0.859 | 0.803 | 4.976 | <.001 |

SVC^{b} |
0.981 | 0.969 | 0.976 | 0.972 | 2.205 | .028 |

^{a}RFC: random forest classifier.

^{b}SVC: support vector classifier.

Model performance in shock rate generation.

Model | Accuracy | Precision | Recall | F1 | ||

Deep learning | 0.981 | 0.963 | 0.957 | 0.960 | 0.277 | .782 |

Logistic regression | 0.978 | 0.932 | 0.960 | 0.945 | 2.331 | .020 |

RFC^{a} |
0.952 | 0.930 | 0.986 | 0.956 | 2.064 | .039 |

SVC^{b} |
0.976 | 0.926 | 0.956 | 0.939 | 2.510 | .012 |

^{a}RFC: random forest classifier.

^{b}SVC: support vector classifier.

Model performance in shock number generation.

Model | RMSE^{a} |
MAE^{b} |
||

Deep learning | 207 | 121 | 0.350 | .727 |

Linear regression | 265 | 206 | 0.917 | .359 |

RFR^{c} |
255 | 158 | 0.628 | .530 |

SVR^{d} |
350 | 173 | 9.427 | <.001 |

^{a}RMSE: root mean squared error.

^{b}MAE: mean absolute error.

^{c}RFR: random forest regression.

^{d}SVC: support vector regression.

The analysis also tested the difference between the generated step and the ground truth. In the paired

Furthermore, we analyzed the performance of the deep learning models on samples of various treatment sequence lengths to gain a better understanding of how the treatment sequence information could aid decision making. We partitioned the validation dataset into 9 sets by the number of previous treatment steps and summarized the validation results in

Power level generation performance in samples containing different numbers of previous treatment steps.

Number of previous treatment steps | Accuracy | Precision | Recall | F1 |

1 | 1.000 | 1.000 | 1.000 | 1.000 |

2 | 1.000 | 1.000 | 1.000 | 1.000 |

3 | 1.000 | 1.000 | 1.000 | 1.000 |

4 | 1.000 | 1.000 | 1.000 | 1.000 |

5 | 1.000 | 1.000 | 1.000 | 1.000 |

6 | 0.983 | 0.980 | 0.980 | 0.980 |

7 | 0.926 | 0.915 | 0.939 | 0.925 |

8 | 0.889 | 0.873 | 0.914 | 0.888 |

9 | 0.875 | 0.875 | 0.500 | 0.933 |

Shock rate generation performance in samples containing different numbers of previous treatment steps.

Number of previous treatment steps | Accuracy | Precision | Recall | F1 |

1 | 1.000 | 1.000 | 1.000 | 1.000 |

2 | 0.992 | 0.997 | 0.972 | 0.984 |

3 | 1.000 | 1.000 | 1.000 | 1.000 |

4 | 1.000 | 1.000 | 1.000 | 1.000 |

5 | 0.992 | 0.997 | 0.974 | 0.985 |

6 | 0.975 | 0.976 | 0.802 | 0.861 |

7 | 0.889 | 0.857 | 0.631 | 0.888 |

8 | 0.917 | 0.864 | 0.642 | 0.902 |

9 | 1.000 | 1.000 | 1.000 | 1.000 |

Performance of the generation of the number of shocks in samples containing different numbers of previous treatment steps.

Number of previous treatment steps | RMSE^{a} |
MAE^{b} |

1 | 31 | 25 |

2 | 32 | 17 |

3 | 34 | 24 |

4 | 139 | 60 |

5 | 365 | 310 |

6 | 317 | 233 |

7 | 273 | 190 |

8 | 275 | 242 |

9 | 99 | 76 |

^{a}RMSE: root mean squared error.

^{b}MAE: mean absolute error.

The validation showed that the capability of the deep learning model for step generation is on par with that of top physicians. Based on the high-quality step generation, we generated treatment plans by iteratively generating steps with the trained models (

Previous literature has shown a series of work on standardizing SWL treatment [

The analysis results revealed that deep learning models for treatment step generation effectively learn from SWL treatment plans and achieve the step generation capability of top physicians. The performance comparison indicated that utilization of a previous treatment sequence in deep learning improves the quality of generated steps. By iteratively generating treatment steps, our automated planning framework can avoid human biases and generate personalized, high-quality, and consistent SWL treatment plans based on PPC, including patient demographics and stone characteristics. With the help of these automatically generated treatment plans, physicians can minimize the trial-and-error process and implement evidence-based personalized treatment. This framework can be generalized to different machine types, so physicians can easily adapt to new generations of SWL machines.

Our proposed model only learns and imitates the best practices, but cannot perform better than them. Even the best physician cannot plan successful SWL treatment plans for all cases, so successful difficult cases, including those requiring long treatment sequences, are rare for model training. Therefore, our model may be good at planning easier cases, but less adept in rare difficult cases, similar to physicians’ actual practice. As the treatment cases, especially successful difficult cases, accumulate, our model is likely to gain an expert-level planning capability to handle difficult cases.

Due to data limitations, we were only able to consider a small set of patient demographics and stone characteristics. However, our framework can be easily extended to utilize a larger set of parameters than has previously been used. Moreover, the data are retrospective. Therefore, clinical studies are warranted to confirm the effectiveness and efficiency of this framework.

To the best of our knowledge, our framework is the first effort to implement automated planning of SWL treatment via deep learning. Its assistance for inexperienced urologists in designing SWL treatment plans is useful in both SWL treatment planning and physician training. While the applications of machine learning in diagnosis are becoming more mature, few studies exist in automated treatment plan generation. Our approach is a step forward in exerting the potential of machine learning in medical sciences.

long short-term memory

mean absolute error

mean squared error

preoperative patient characteristics

rectifier linear unit

random forest classifier

random forest regression

root mean squared error

recurrent neural network

support vector classifier

support vector regression

shock wave lithotripsy

All authors designed the study. RGNS provided the data. ZC and DDZ developed the deep learning model for treatment plan generation. ZC implemented and evaluated the deep learning model and drafted the manuscript. All authors revised the manuscript.

RGNS was an employee of Translational Analytics and Statistics. BDH is a consultant for NextMed Management Services. The remaining authors declare no conflicts of interest.