Peak Outpatient and Emergency Department Visit Forecasting for Patients With Chronic Respiratory Diseases Using Machine Learning Methods: Retrospective Cohort Study

Background: The overcrowding of hospital outpatient and emergency departments (OEDs) due to chronic respiratory diseases in certain weather or under certain environmental pollution conditions results in the degradation in quality of medical care, and even limits its availability. Objective: To help OED managers to schedule medical resource allocation during times of excessive health care demands after short-term fluctuations in air pollution and weather, we employed machine learning (ML) methods to predict the peak OED arrivals of patients with chronic respiratory diseases. Methods: In this paper, we first identified 13,218 visits from patients with chronic respiratory diseases to OEDs in hospitals from January 1, 2016, to December 31, 2017. Then, we divided the data into three datasets: weather-based visits, air quality-based visits, and weather air quality-based visits. Finally, we developed ML methods to predict the peak event (peak demand days) of patients with chronic respiratory diseases (eg, asthma, respiratory infection, and chronic obstructive pulmonary disease) visiting OEDs on the three weather data and environmental pollution datasets in Guangzhou, China. Results: The adaptive boosting-based neural networks, tree bag, and random forest achieved the biggest receiver operating characteristic area under the curve, 0.698, 0.714, and 0.809, on the air quality dataset, the weather dataset, and weather air quality dataset, respectively. Overall, random forests reached the best classification prediction performance. Conclusions: The proposed ML methods may act as a useful tool to adapt medical services in advance by predicting the peak of OED arrivals. Further, the developed ML methods are generic enough to cope with similar medical scenarios, provided that the data is available. (JMIR Med Inform 2020;8(3):e13075) doi: 10.2196/13075


Introduction
Worldwide, one of the fundamental issues in hospital management is the sudden inflow of outpatient and emergency department (OED) patients [1]. Influenza season (epidemic period) is one of the causes for OED overcrowding and generates a large flow of patients [2]. In particular, weather and air quality are important factors that affect the health status of JMIR Med Inform 2020 | vol. 8  individuals and populations with chronic respiratory diseases [3]. Chronic respiratory diseases such as asthma and chronic obstructive pulmonary disease (COPD) often require regular OED medication as the condition changes, which can cause further OED overcrowding [4]. Nevertheless, the crowding could be alleviated and mitigated considerably by forecasting levels of demand for OED care and giving health care staff an opportunity to prepare for this demand [5]. Efficient patient flow has been proven to potentially increase the capacity of the existing system, minimize patient care delays, and improve overall quality of health care [6][7][8][9][10].
There have been many attempts to predict daily patient volumes visiting emergency departments (EDs) using machine learning (ML) and deep learning models based on weather and air quality [11,12].
Bibi et al [13] created a computer-based model called an artificial neural network (ANN) using a backpropagation to predict volumes of ED visits of patients with asthma, COPD, or acute or chronic bronchitis 7 days in advance. The study included a dataset (1020 days of ED activity) extracted from an ED admittance database at the Barzilai Medical Center (Ashkelon, Israel). The mode integrated 5 indicators (ie, temperature, relative humidity, barometric pressure, sulfur dioxide, and nitrogen oxide) and achieved the prediction accuracy with an average error of 12%. However, indicators and data collections are relatively inadequate.
Moustris et al [14] developed three different ANN models to forecast the childhood asthma admissions 7 days in advance for the subgroups of 0 to 4 years of age and 5 to 14 years of age, as well as for the whole study population. The study used 6 indicators, that is ozone, carbon monoxide, PM10 (particulate matter of 10 μm in diameter or smaller), PM25 (particulate matter less than 2.5 μm in diameter), and sulfur dioxide, from Athens, Greece to train the ANN model. The evaluation of the three ANN models' forecasting abilities on the root mean square error (mean bias error) were 6.8 (1.4), 3.2 (1.3), and 5.2 (0.3) for 0 to 4 years of age, 5 to 14 years of age, and the whole study population, respectively. However, the study only took into account air quality indicators and ignored weather factors.
Soyiri [15] explored the base and reduced predictive quantile regression models (QRMs) to detect peak numbers of daily asthma admissions in London with sensitivity levels of 76% and 62%; as well as specificities of 66% and 76%, respectively. The research used 10 indicators (ie, air temperature, vapor pressure, humidity, ozone, carbon monoxide, nitrogen dioxide, nitrogen oxide, PM10, and formaldehyde) to build the QRMs. The findings also reaffirmed the known associations between asthma and temperature, and ozone and carbon monoxide levels. Nevertheless, the accuracy of the model is not very high.
Khatri et al [16] employed an ANN-based classifier using multilayered perceptions with a backpropagation algorithm that predicts peak events, that is days of peak demand, for patients with respiratory diseases. The study used 8 predictors (ie, outdoor temperature, relative humidity, wind speed, carbon monoxide, ozone, sulphur dioxide, nitrogen dioxide, and PM25) to construct the model. The proposed ANN model achieved a good forecasting performance with the overall accuracy of the system at 81.0%. Even so, the study population only included visits for respiratory diseases data in EDs. Further, the research did not consider dividing data into weather and air pollution separately.
Yucesan et al [17] developed a multi-method patient arrival forecasting outline for EDs using a private hospital ED case in Turkey. The methods followed within this study include the single methods linear regression (LR), autoregressive integrated moving average (ARIMA), ANN, exponential smoothing, and the hybrid methods ARIMA-ANN and ARIMA-LR. The ARIMA-ANN hybrid model is shown to outperform in terms of forecasting accuracy. This study explored a novel attempt of applying these methods to model ED patient arrivals and making an overall assessment among them.
Muhammet et al [18] analyzed variations in annual, monthly, and daily ED arrivals based on regression and neural network models with the aid of collected data from a public hospital ED in Istanbul. Both of the methods have been proven to be useful and readily available tools for forecasting ED patient arrivals.
The results show that ANN-based models have higher model accuracy values and lower values of absolute error in terms of forecasting ED patient arrivals over the long-and medium-term. The value of the standard error of regression for the ANN modeling, which is 30.022306, refers to the difference between the real ED patient arrivals and the forecasted ED patient arrivals per day covering the total of the three patient groups.
Although ED forecasting has attracted many researchers, we found few studies designed to predict OED visits of patients with chronic respiratory diseases using multiple ML methods. In a real medical scene, patients with chronic respiratory diseases often go to outpatient clinics. Therefore, it would be of great significance to forecast the peak OED visits for chronic respiratory diseases.
In this paper, we employed bagging [19], adaptive boosting [20] and random forest [21] algorithms to predict the peak arrival of chronic respiratory disease OED visits based on the weather and air quality data. Meanwhile, we compared the ensemble models with the general linear model (GLM) [22] and the polynomial nuclear support vector machine (SVM) [23].
The results show that ensemble models outperform the GLMs and SVM. Further, we found that the predictive performance of ML algorithms gradually improves with the increase of input features. By the ML approaches, the OED managers can plan resources to meet the excessive demand of patients with respiratory diseases after short-term fluctuations in air pollution or weather. For statistical purposes, the days where the daily volume was less than 24 were labeled as nonpeak events, and the rest were labeled as peak events. Table 1 describes the Pearson correlation coefficient between OED visit numbers and input indicators. We found that OED visit numbers showed positive correlations with wind speed, atmospheric pressure, carbon monoxide, sulphur dioxide, nitrogen dioxide, and PM25. However, OED visit numbers showed negative correlations with outdoor temperature, relative humidity, and ozone. The weather and air quality data distribution of patients with acute exacerbations of COPD from peak and nonpeak groups was shown in Table 2.

Data Analysis
Since the effect of weather and air quality on respiratory conditions in humans was not instantaneous, representative lags were applied to variables based on the work done previously in this area [3,[24][25][26]. To simplify the delayed impact of respiratory conditions, we considered a 3-day lag for all variables.
We removed records with less than 10 people on weekends to eliminate weekend effects, bringing the total number of samples collected to 559. To create a meaningful feature vector for training and cross-validation, the date field was removed to obtain a (X, y), where X was a matrix with the dimensions (m × n = 559 × 9) representing values of variables, and y was a vector of length (m=559) representing the output class of the examples (ie, events). Analysis of the data suggested that the output class was highly imbalanced with 413 examples of nonpeak and 146 examples of peak events.

Machine Learning Approaches
In this section the ML algorithms are presented and discussed; details of the updating and classification processes are described in the following algorithms.

Random Forest
1. Create a new training set from a sample of a training set 2. Repeat step 1 N times to get N new training sets, and train N trees on the training sets 3. Identify the optimal candidate node as the prediction space from the randomly selected m feature set when building the tree model Boosting 1. Initialize the weight vector of the training data 2. Construct m weak classifiers 3. Calculate the classification error rate of the m weak classifiers 4. If one sample is misclassified, its weight will be increased, and the next weak classifier pays more attention to this sample; otherwise, its weight will be decreased. 5. After all the weak classifiers finish the training, the stronger classifier is constructed.

Metrics
Precision, recall, and F measure are the metrics that are used to evaluate our proposed ML methods. Based on the classification of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN), we have the following formulas.
We then define the F measure, a metric that balances precision and recall.

Evaluation
We calculate the overall accuracy, precision, recall, and F measure for nonpeak events and peak events, respectively. Evaluation of the ML approaches on the weather and air quality data are shown in Table 3. It showed that the developed random forest gave the best predictive performance. This was mainly due to the data collection fitting better with the random forest. In addition, we used the receiver operating characteristic (ROC) curve to evaluate the multiple ML approaches on the same dataset (Table 4). We found that adaptive boosting neural networks achieved the biggest ROC area under the curve on the air quality data, tree bag on the climate data, and random forest on weather and air quality data. In general, we discovered that the predictive performance of the ML approaches improves as data variables increase.

Clinical Significance
Recent studies have shown that weather and air pollution have been a major problem leading to an increase in daily deaths and hospital admissions for chronic respiratory illnesses [3][4][5]27,28].
We focused the distribution of daily patient visits for 2 years (ie, 2016 and 2017) (Figure 2). It is worth noting that peak days are more dominant from October to March, which indicates that the haze is a strong predictor, as these months are mostly colder in Guangzhou. Thus, it is important to recognize the peak OED visits for respiratory conditions. Previous studies mainly focused on the peak event forecasting ED visits for patients with one or more diseases. We expanded the study population to include outpatient visits for patients with chronic respiratory diseases. In fact, many patients with chronic respiratory diseases also seek treatment from outpatient departments. Thus, predicting the OED peak visits for chronic respiratory disease plays an important role in clinical management.
We developed a variety of learning methods to forecast the OED peak visits, from simple models to complex ensemble learning ones. In particular, the ensemble learning models achieved good prediction results. In terms of indicators, most of the previous studies used air pollution indicators to predict the peak events of ED visits; however, we used weather and air quality indicators to build a more complete set of features.

Limitations
There are a few limitations to this study. In this study, we used nine variables, namely, wind speed, atmospheric pressure, outdoor temperature, relative humidity, carbon monoxide, ozone, sulphur dioxide, nitrogen dioxide, and PM25, as these variables have been associated with exacerbation of respiratory diseases. However, there are some other variables that also contribute to the exacerbation of these diseases, such as formaldehyde and nitrogen oxide [29]. The Environmental Protection Agency of Guangzhou does not disclose the daily data for variables such as formaldehyde. Other pollutants are either not measured or had too many missing values. Therefore, we were not able to include these variables in our study.
In terms of weather, Guangzhou as a coastal city in southern China has a higher air humidity than other northern cities. In terms of air pollution, some studies have shown that patients with lower levels of economics and education are more susceptible to air pollution [30]. Guangzhou has a significantly higher economic and educational level than the national average. However, the pollution of haze and the harmful emissions of Guangzhou are also serious [31]. In particular, the lighter particulate matter is higher than other northern cities due to automobile exhaust and industrial emissions. Therefore, the prediction result of this study may not be directly applicable to other regions due to the regional differences in climate and air pollution.

Conclusion
In this paper, we investigated ML methods to forecast the peak events of patients with chronic respiratory diseases visiting OEDs combined with nine weather and air quality predictors. Overall, random forest outperforms the other methods in the accuracy, F measure, and ROC on the validation dataset. Compared with similar studies before, we used more indicators and ML methods to study the subject and achieved good results. The ML methods may act as a useful tool to adapt medical services in advance by predicting the peak number of OED arrivals.

Conflicts of Interest
None declared.