Classification of the Disposition of Patients Hospitalized with COVID-19: Reading Discharge Summaries Using Natural Language Processing

Background: Medical notes are a rich source of patient data; however, the nature of unstructured text has largely precluded the use of these data for large retrospective analyses. Transforming clinical text into structured data can enable large-scale research studies with electronic health records (EHR) data. Natural language processing (NLP) can be used for text information retrieval, reducing the need for labor-intensive chart review. Here we present an application of NLP to large-scale analysis of medical records at 2 large hospitals for patients hospitalized with COVID-19. Objective: Our study goal was to develop an NLP pipeline to classify the discharge disposition (home, inpatient

To prepare for a possible second wave in Massachusetts, we set out to conduct a large-scale study of factors associated with outcomes in hospitalized patients at 2 large academic Boston hospitals.This effort required the significant task of reviewing medical records for over 1000 patients.For structured parts of the electronic health record (EHR), automated data extraction is straightforward.However, some essential information is exclusively or most reliably available only in semistructured or unstructured narrative medical notes, including patient-reported symptoms, examination findings, or social habits.Thus, developing automated approaches to EHR information extraction wherever possible is critical for more complete patient phenotyping.
Natural language processing (NLP) deals with automated analysis of unstructured text data.Recent advances in NLP machine learning have empowered computers to do several tasks such as machine translation, speech recognition, speech synthesis, semantic understanding, and text summarization [11,12].NLP has the advantage of being much faster than human chart review of medical records [13][14][15][16].
Here we present an automated approach, using NLP, to extract a specific outcome from hospital discharge summaries: discharge destination or "disposition" (ie, anticipated location or status following discharge).Dispositions of interest included home, inpatient rehabilitation center, skilled nursing inpatient facility (SNIF), and death.Discharge disposition of patients with COVID-19 from health care facilities is important due to the high risk of transmission of the disease within nursing homes and hospitals when patients are discharged to locations other than home, and also because it represents an important measure closely related to functional outcome and level of disability following hospitalization, as well as overall costs of care.Furthermore, this information has the potential to aid health care facilities in resource planning to better prepare for the incoming flow of patients.Although our model is tailored for discharge disposition, the approach we developed is generalizable to other outcomes available in discharge summaries.

Study Overview
Data were extracted from the hospital electronic medical record under a research protocol approved for a waiver of informed consent by the Partners Healthcare Institutional Review Board.Clinical data were retrospectively analyzed for all adult patients who tested positive for SARS-CoV-2 infection between March 10 and June 30, 2020.A total of 1737 patients admitted to 2 major Boston hospitals, 1232 from Massachusetts General Hospital (MGH) and 505 from Brigham and Women's Hospital (BWH), were included.Only patients with a physician discharge summary and available known ground-truth discharge disposition were included.

Overview
Data consisted of discharge summaries, which are unstructured free-text notes written by physicians, and a ground-truth record of discharge disposition, used to assess the accuracy of the NLP results.The methodology for note preprocessing is shown in Figure 1.The upper part of the figure provides an overview of the text extraction for each field on the list of extraction fields depicted in Table 1.The lower part of the figure shows the methodology steps where the text extracted from all the fields is processed for modeling.The data were randomly stratified into train and test sets for modeling, which we address in the Model Development section.

Document Preprocessing
Admission, discharge, and birth dates were removed from the discharge summaries, as well as punctuation, special characters, blank spaces, and numerical digits.Notes were then subjected to lowercasing, tokenization, and correction using lemmatization, a procedure for obtaining the root form of the word, using vocabulary (dictionary importance of words) and morphological (word structure and grammar relations) analysis.WordNetLemmatizer from NLTK library in Python (Version 3.7; Python Software Foundation) was used with a part-of-speech (POS) tag specified as a verb.Patients' names, addresses, health care facilities, and hospital unit names were removed, as well as single letters.Abbreviation expansion and spelling corrections were performed for a small list of frequently used clinical words (Table S1 in Multimedia Appendix 1).A list of commonly used and less informative stopwords was also removed from the notes (Table S2 in Multimedia Appendix 1).

Processing of Specific Discharge Summary Fields
Discharge summaries at MGH and BWH are semistructured, with a series of named fields containing specific types of mostly free-text information (Table 1).We present an example of discharge summary notes with protected health information removed (Table S3 in Multimedia Appendix 1).Text fields were identified based on information extracted from the notes using regular expressions with search tokens (Table 1).The function "str.extractall" from Python was used to extract a length of 200 letters of text onwards from all instances where the search token appeared.Some notes contained a "discharge disposition" field used to list the discharge disposition.We deleted this field to avoid an overly "easy" solution, because this field is not universally available, and because we wished to assess how well the approach is able to perform when structured data is unavailable.In a field where more than one extraction was performed (ie, with more than one search token), the corresponding results were joined, and duplicated words were removed.To illustrate with an example, for the "Diet" field, using the regular expressions with search tokens "diet" and "nutrition," 200 letters were captured for each search token, for a total of 400 letters.Since there might be repeated information in the discharge summary regarding diet and nutrition recommendations, duplicated words were removed from the captured text.Where no data was captured with the search tokens, an indication of missingness was set with the name of the field and the suffix "_missing." The texts extracted from all fields (depicted in Table 1) were joined to create a reduced version of the discharge summary, which was then subjected to tokenization, lemmatization, and abbreviation expansion, as described in the Document Preprocessing subsection.The vocabulary used for modeling was created based on these reduced versions of the discharge summaries contained in the training set.Documents were represented as a binary bag-of-words (BoW; ie, an ordered series of binary vectors indicating whether a given n-gram [word or sequence of 2 or 3 words] is present in the document, disregarding grammar and word order).The function CountVectorizer was used with its default parameters from XSL • FO RenderX Python, except for the n-gram range, which was set as unigrams (1 word), bigrams (2 consecutive words), and trigrams (3 consecutive words).As a first step to reduce dimensionality, only features present in at least 10% of the reduced version of the discharge summary notes were considered.Multiclass logistic regression with the least absolute shrinkage and selection operator (LASSO) [17] was used to further sparsify the model.

Outcome Measure
The multiclass outcome measure was discharge disposition, composed of the classes: home, inpatient rehabilitation, SNIF, and death."Home" included "home or self-care," "home-health care services," and patients who "left against medical advice."SNIF included "Skilled Nursing Facility" and "Custodial Care Facility."

Model Development
The training algorithm used the one-vs-rest scheme for multiclassification, where a binary problem was fitted for each class and the class weight was balanced.Logistic regression [18] with LASSO regularization was used as the classification model.The model estimator is depicted in equation 1 and the LASSO regularization objective can be written as in equation 2.
corresponds to the design input matrix and corresponds to the vector of observations, where n is the number of observations, in this case the number of discharge summaries or number of patients, and p the number of features in .The vector of regression coefficients is given by corresponds to the L1 norm of this coefficients vector, and λ is the regularization parameter that controls the amount of shrinkage.The regularization adds a penalty on the weights to prevent overfitting [19].The inverse of the regularization strength C was varied for the values {0.005, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5}.
Stratified random sampling was used to split the data set into a training set (70%) and a hold-out test set (30%).A randomized search was used for hyperparameter tuning during training with 100 iterations of 5-fold cross-validation.The solver was set to "liblinear" and the "warm start" hyperparameter was varied between true/false, where "true" corresponded to reusing the solution of the previous call to fit as initialization, and "false" corresponded to erasing the previous solution.

Performance Measures
The R 2 coefficient of determination score was used in cross-validation scoring to select the best model configuration in the training data.The one standard error rule was used to select the regularization parameter.The simplest model, whose R 2 mean score fell within 1 standard deviation of the maximum R 2 , was selected.
To measure model performance on test data, the area under the receiver operating characteristic curve (AUROC) was calculated.The ROC curve is a function of recall (sensitivity) versus the false positive rate (FPR; ie, 1-specificity; Table S1 in Multimedia Appendix 1).The pair (Recall k , FPR k ) is called an operating point for this curve, where k is a threshold that is varied to generate the ROC curve.The equations for these metrics are presented in Table S4 in Multimedia Appendix 1.
The area under the precision-recall curve (AUPRC), which is an important measure in the presence of class imbalance, was also calculated.The pair (Recall k , Precision k ) is referred to as an operating point for this curve.Average precision (AP; Table S3 in Multimedia Appendix 1) summarizes this plot as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight.
The F 1 -score (Table S4 in Multimedia Appendix 1) was also assessed as another performance metric commonly reported for data sets with imbalanced numbers across classes [20].
In total, 100 iterations of bootstrap random sampling with replacement were performed to calculate 95% CIs for performance metrics.

Summary of Patient Population
From 1917 patients' medical records, 1752 had a physician discharge summary and a discharge disposition within the categories of home, inpatient rehabilitation, SNIF, and death.Only adults (aged ≥18 years) were included in the analysis, leaving a study cohort of 1737 patients.The cohort was split into train and test sets using stratified random sampling according to outcome.Age in the train and test sets was balanced, with a median of 62 and 60 years old, respectively (Table 2).The majority of patients were White (n=774; median 44.6%) and Black or African American (n=285; median 16.4%).Most were discharged home (n=1052; 60.6%).Among all patients with COVID-19 in this sample, there were 243 (14.0%) nonsurvivors.The preprocessed data set for modeling was created based on the notes extracted in all fields except the "discharge disposition" and "code status" fields, as described in the Processing of Specific Discharge Summary Fields subsection.Before dimensionality reduction, where features present in at least 10% of the reduced version of the discharge summary notes were considered, there were a total of 15,182 tokens (unigrams).After applying this dimensionality reduction step, we were left with 477 tokens.With this set of tokens, 3497 combinations of n-grams were generated, leading to a total of 1056 features with duplicates removed.Thus, the total number of candidate features in the training vocabulary was 1056, including 460 unigrams, 329 bigrams, and 267 trigrams.

Modeling Results
The best model configuration parameters and performance results in the hold-out test set are presented in Table 3 with 95% CIs.The corresponding confusion matrices normalized by precision and recall are presented in Figure 2. The performance discriminated by discharge outcome is presented in Table 4.
"Service" was the feature assigned the highest importance for classification of the discharge outcomes.For inpatient rehabilitation and SNIF dispositions, the coefficient values for this feature are positive, which indicates that this term will most likely appear in the preprocessed notes for both outcomes."Home care," "healthcare home," and "home health" were assigned a positive coefficient value for home disposition."Service healthcare home" was also assigned high importance for this outcome, suggesting that this feature is related to patients discharged home with home health care services provided."Medicine" and "appointments specialty" were also important for this outcome."Rehab" had positive coefficients for both inpatient rehabilitation and SNIF dispositions."Intubate" and "ARDS" (acute respiratory distress syndrome) are important features for inpatient rehabilitation disposition.For death, "discharge" and "activity tolerate" presented negative coefficient values, indicating that these features are unlikely to appear in discharge summaries of deceased patients."Brief assessment" and "brief" are assigned high coefficient values for this outcome."Covid" was assigned a positive coefficient value for predicting death, while the term was given negative values for inpatient rehabilitation and SNIF.
Training performance is depicted in Figure S1 in Multimedia Appendix 2, with the curve corresponding to the R 2 scores for the different values of the inversed regularization strength.The top 15 features and their relative importance obtained with LASSO regularization are presented for each outcome (Figure S2 in Multimedia Appendix 2).Blue bars correspond to features with positive coefficient values and red bars to features with negative coefficient values.The areas under the ROC and Precision-Recall curves for the best model are also presented (Figure S3 in Multimedia Appendix 2).We also assessed how the model performance and the features selected as the most important in the train set varied with the dimension of the train set (Figure S4 in Multimedia Appendix 2).The hold-out test set for model evaluation was fixed and the train set dimension was varied from 10% to 100% of the original train set, with 1215 patients.We observed that the best performance was achieved with a higher number of patients in the train set (ie, the original train set of 100%).However, with 50% versus 100% of the original train set, the model achieved good performance for 1018 (versus 1056) vocabulary features (AUROC 0.97 versus 0.98 and AP 0.79 versus 0.81, respectively).We assessed the common features between each train set and the original train set (Figure S5 in Multimedia Appendix 2).Among the top 30 features, there were 10 common features between the 50% and the original train sets.A higher number of common features was found for the train set with 90% of the original train set, with a total of 17 common features.Finally, we observed that more than half of the features in the top 30 from the original train set were selected as top 30 in at least two train sets (Figure S6 in Multimedia Appendix 2).

Principal Findings
In this study, a machine learning-based NLP pipeline was developed to classify the discharge disposition of adult patients hospitalized with COVID-19.The model achieved near-perfect identification of patients with outcomes of home disposition or death.For the intermediate outcomes of inpatient rehabilitation or SNIF, performance was imperfect but also acceptable.Due to this classification task being relatively easy, more complex and time-consuming modeling approaches, such as recurrent neural networks or bidirectional encoder representations from transformers were not considered.We acknowledge that for harder tasks, these approaches can improve performance.The final method is automated, thus enabling large-scale rapid processing of thousands of discharge summaries, a task that is infeasible when relying on manual chart review.

Limitations
The present analysis was limited to a cohort of patients with COVID-19, who may have specific medical symptoms related to the disease.Therefore, as future work, it is proposed to extend the model to other cohorts.Further, although results spanned 2 hospitals, they are located in the same geographic region (Boston, Massachusetts).Thus, our cohort may not be representative of other US and non-US populations.Moreover, decision making for discharge disposition may vary for different hospitals, according to the number of SNIFs or rehabilitation centers in the geographic area, which may affect the generalizability of the model.The models were developed with textual information from discharge summaries, while the addition of other clinical features (eg, physical or occupational therapy reports, social work or case manager notes) was not considered, which is a limitation of the study and can be pursued in future work.

Comparison With Prior Work
Extraction of information from clinical narratives is a growing application of NLP in health care.NLP has been used to extract information from hospital discharge notes about medical conditions such as postsurgical sepsis [21], pneumonia [22], or other potential medical problems [23], as well as to identify critical illness [24,25], detect adverse events [26], predict risk of rehospitalization [27], extract medication information [28], and risk stratify patients [29].To the best of our knowledge, ours is the first work on classifying hospital discharge disposition based on discharge summary notes using machine learning and NLP.

Conclusions
This study shows that a supervised learning-based NLP approach can be used to accurately classify the discharge disposition of hospitalized patients with COVID-19 in an automated fashion.This model, and the NLP approach used to develop it, have the potential to accelerate and increase the scale of research that is possible with EHR data.

Figure 2 .Figure 3 .
Figure 2. Confusion matrices for the best model evaluated in the hold-out test set normalized (A) by recall and (B) by precision.SNIF: skilled nursing inpatient facility.

Figure 1 .
Methodology steps for discharge summary notes preprocessing and modeling.The list of extraction field is depicted in Table1.

Table 1 .
Information captured from discharge summaries, grouped in fields, and respective search tokens used in the regular expression.

Table 2 .
Baseline characteristics of the study patient population stratified by train and test sets.

Table 4 .
Model performance in the hold-out test set by discharge outcome.