This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
The amount of incoming data into physicians’ offices is increasing, thereby making it difficult to process information efficiently and accurately to maximize positive patient outcomes. Current manual processes of screening for individual terms within long free-text documents are tedious and error-prone. This paper explores the use of statistical methods and computer systems to assist clinical data management.
The objective of this study was to verify and validate the use of a naive Bayesian classifier as a means of properly prioritizing important clinical data, specifically that of free-text radiology reports.
There were one hundred reports that were first used to train the algorithm based on physicians’ categorization of clinical reports as high-priority or low-priority. Then, the algorithm was used to evaluate 354 reports. Additional beautification procedures such as section extraction, text preprocessing, and negation detection were performed.
The algorithm evaluated the 354 reports with discrimination between high-priority and low-priority reports, resulting in a bimodal probability distribution. In all scenarios tested, the false negative rates were below 1.1% and the recall rates ranged from 95.65% to 98.91%. In the case of 50% prior probability and 80% threshold probability, the accuracy of this Bayesian classifier was 93.50%, with a positive predictive value (precision) of 80.54%. It also showed a sensitivity (recall) of 98.91% and a F-measure of 88.78%.
The results showed that the algorithm could be trained to detect abnormal radiology results by accurately screening clinical reports. Such a technique can potentially be used to enable automatic flagging of critical results. In addition to accuracy, the algorithm was able to minimize false negatives, which is important for clinical applications. We conclude that a Bayesian statistical classifier, by flagging reports with abnormal findings, can assist a physician in reviewing radiology reports more efficiently. This higher level of prioritization allows physicians to address important radiologic findings in a timelier manner and may also aid in minimizing errors of omission.
In today’s environment with electronic medical records (EMR) gaining prevalence in hospitals, urgent care clinics, and specialist facilities, primary care physicians are receiving more clinical reports on a daily basis. These electronic systems typically generate more pages per report than in the past. A Brigham and Women’s Hospital study reports that full-time primary care physicians on average review 930 pieces of chemistry/hematology data and 60 pathology/radiology reports in a given week [
In fact, failure to review and follow up on an outpatient test result compromises patient safety and raises malpractice concerns in the order of billions of dollars annually [
Currently, patient clinical data are both structured and unstructured. Structured patient data are typically a laboratory test containing discrete numerical values. An example can be a patient’s potassium result, which could have a value of “4.2”. Discrete results can easily be identified and traced by automatic systems. Urgent or critical laboratory values can be detected by performing a simple numerical comparison. A problem exists, however, with free-text reports such as radiology results. These reports have to be read by the physician and important findings need to be noted and logged for proper tracking [
There are many existing applications that use natural language processing (NLP) to extract patient medical information from free-text reports for purposes of medical billing or populating a patient's health record. Several studies have also demonstrated the ability of NLP to extract clinical information, such as pneumonia cases, from radiology reports [
By performing an initial screen of incoming data and flagging reports as potentially low-priority or high-priority, a classifier can aid physicians in better prioritization of his or her stack of clinical data that is to be reviewed. The intent is not to replace the manual review and signing off of each report, but rather to assist the physician by providing a level of prioritization to the stack of unordered documents awaiting review, an additional safety net. The benefits of such a system would be, at the very least, quicker notification of results to a patient and fewer missed or overlooked findings, resulting in better patient outcomes and possibly even less malpractice exposure.
A statistical approach using the Bayes theorem was developed to classify free clinical reports as low- or high-priority. A naive Bayesian classifier is a probabilistic classifier based on Bayes’ theorem that makes a strong independence assumption. In context, the classifier assumes that all features of a document are independent of one another. The presence or absence of one feature is assumed to have no effect on the presence or absence of any other feature. When classifying text, each feature is an individual word in the text.
A supervised learning approach is used to enable a naive Bayes classifier to differentiate a document into different categories. In the case of classifying clinical reports, the classifier categorizes reports as low- or high-priority. The classifier is trained using a corpus of documents that is already categorized. The corpus of documents is tokenized, and each word from the documents is assigned a probability of appearing in a high-priority report. Each word, or feature, is represented by “fi”. The probability of a report being high-priority is given by the Bayes theorem shown here in generic form,
P(H | fi) = [P(f1, f2,f3...fi| H) P(H)] /
[P(H) P((f1, f2,f3...fi| H) + (1-P(H)) (1-P(f1, f2,f3...fi| H))].
The term P(H) represents the prior probability or the probability of any given document being high-priority. P(fi | H) is the probability of the feature, fi, appearing in a document given that the document in question is high-priority. The denominator is the probability of fi appearing in any given document, or P(fi). Thus, the equation can be simplified into,
P(H | f) = [P(f, f,f...f| H) P(H)] /
[P(f1, f2,f3...fi)].
The experiment is based on the premise that there are patterns within clinical reports that influence a physician’s determination of the reports severity, and these patterns can be detected by a computer based on the relative presence of certain words in documents. If true, then a computer could use principles of statistics and machine learning to prioritize free-text clinical reports.
Blue Hills Medical Associates is an internal medicine practice, consisting of 2 physicians and 1 nurse practitioner situated with the encatchment area of three community hospitals, each affiliated with a separate major Massachusetts health system. The practice sees over 60 patients daily and receives over 5000 pages of clinical reports each month in the form of faxes, paper mail, and electronic results via a health level-7 (HL7) interface. These reports include consult reports, laboratory results, hospital admission and discharge reports, as well as radiology reports. The focus of this study was on the management of radiology results. There were 2 primary care physicians who reviewed the reports.
There were two sets of data that were used in this study. Both sets of data were extracted from clinical reports stored in the EMR at the practice site used for this study. The first set, the training data, was used to train the Bayesian classifier to detect physicians’ definitions of what constitutes a high-priority report. The second set of data, the test data, was a set of documents independent from the training set used to test and validate the classifier against the physicians’ own categorization.
Radiology reports usually have an Impression section, summarizing the report’s key findings. The Impression section is the interpreting radiologist’s summarization and prioritization of the report’s key findings. Focusing on this section allows for easier data processing, since the reports have been preprioritized by level of importance by the radiologist. The Impression section was extracted from each report for inclusion in the corpora based on the assumption that it contained the key information that distinguishes a low-priority report from a high-priority report. In the few cases where a report does not contain an Impression section or its equivalent, the entire report body was processed. The described extraction limits the amount of extraneous data.
There were one hundred reports, 50 from each category generated between the years of 2011 and 2013, that were selected from the EMR that were representative of the types of low- and high-priority reports seen in study site. These were then categorized into low- and high-priority by the physicians.
Deidentified high-priority (top) and low-priority (bottom) patient reports in text format.
There were three hundred and fifty four radiology and diagnostic reports, ordered by the practice and generated between the years 2011 and 2013, that were selected randomly out of 4800 reports to test the classifier trained by the training dataset. These reports include CAT scans of the head abdomen, MRIs of the head and neck cervical spine lumbar spine abdomen, and plain X-ray films of the chest abdomen and various extremities (
Distribution of the types of reports used in the test dataset.
Type of report | Percentage, n (%) |
Mammograms | 35/354 (9.9) |
CAT scans | 36/354 (10.2) |
Plain radiology films | 71/354 (20.1) |
Ultrasounds | 70/354 (19.8) |
MRIs | 142/354 (40.1) |
Reports flagged as high-priority are those reports that require further follow-up by the primary care physician. An example is a finding of a renal cyst, which may require a 6-month follow-up ultrasound. Another example is a lung nodule, which may require a 4-month follow-up CAT scan.
The next step was text processing and feature extraction, which began with cleanup routines such as conversion of all characters to lowercase type and the removal of stop words. Stop words are words that do not have any value in determining the priority level of a document. Examples of stop words include “the”, “it”, “of”, and “a”. Removing stop words in the preprocessing step is a common practice in artificial intelligence. Doing so minimizes the overall processing load and memory requirements, and results in a narrow set of clinically relevant terms [
In addition, clinical reports often contain common phrases such as “otherwise normal chest” that can distinguish a high-priority report from a low-priority report. A naive Bayes classifier only extracts individual words from documents and assumes that the probability of each word being in different document categories is independent of the probabilities of other words. However, if a document contains a phrase such as “otherwise normal chest”, the individual probabilities assigned to each word in the phrase are clinically dependent on each other. Thus, common phrases were identified, and white spaces contained within these phrases in clinical reports were removed to create a single term that the classifier could recognize. A list of common phrases used in this study is provided in
System architecture.
Examples of common phrases used in the data cleaning process.
Common phrases |
“No significant abnormality is identified” |
“No mammographic change or evidence of malignancy” |
“No acute cardiopulmonary process” |
“No acute pulmonary process” |
“Within normal limits” |
“Normal abdominal ultrasound” |
“No acute intracranial process” |
“Appropriate for age” |
“Routine annual screening mammogram” |
“No acute pathology” |
“Correlation recommended” |
“Biopsy should be performed” |
“Surgical consultation is suggested” |
“Appear significantly changed” |
Common phrase and white space removal depiction.
Common phrase before white space removal | Common phrase after white space removal |
“within normal limits” | “withinnormallimits” |
“Normal abdominal ultrasound” | “Normalabdominalultrasound” |
The remaining words were stored as a “bag of words”, which is a representation of text as an unordered collection of terms that disregards word order or grammar. The naive Bayesian classifier treats each term in this “bag of words” independently from the others. The average number of total unique words in the “bag of words” per report was 684.
For the implementation of the naive Bayesian classifier, an open source, C# implementation of a spam filter algorithm [
The performance of this Bayesian classifier implementation was evaluated using precision, recall, F-measure, and accuracy, standard performance measures for classification and machine learning tasks [
Precision = TP/(TP+FP),
where TP is true positive and FP is false positive.
Recall is the proportion of actual positives that are correctly identified as such,
Recall = TP/(TP+FN),
where FN is false negative.
F-Measure is the harmonic mean of the precision and recall measures,
F-Measure = (2)(precision)(recall/(precision+recall)).
Accuracy is the percentage of true positives and true negatives to the total number of reports processed.
Accuracy = TP+TN/(TP+TN+FP+FN).
In this study, 354 radiology reports, randomly selected from the date range of 2011 to 2013, were tested to evaluate the performance of the Bayesian classifier in detecting high-priority reports. The classifier was trained on data, preclassified by physicians, whose interrater reliability was a Cohen’s kappa value of 0.86. This training set consisted of 50 low-priority and 50 high-priority radiology reports randomly selected from the same time range. The performance of the algorithm was tested under 2 independent conditions, the prior probability of a report being high-probability or Pp, and the probability threshold Pth, at which a report is classified as high-probability. Tests were run for 2 possible values for each of these variables, giving a total of 4 sets of results for analysis.
The probability of each report being high-priority was determined using Bayes formula as described in the Introduction. The distribution of the probabilities of each report being high-priority is shown below for each of the prior probabilities (Pp) (
The frequency distribution of the radiology reports for each calculated probability range shows a clustering of reports at both extremes, with the majority of reports having a probability of 0 or 1, and the fewest number of reports being in the mid probability ranges from 0.2000 to 0.6999.
Distribution of reports in each probability range. The x-axis represents probability and the y-axis represents number of reports from the test set.
Precision, recall, F-measure, and accuracy values for the classifier with Pp= 10%.
|
50% Pth | 80% Pth |
10% Pp of high-priority report | TP 89, FP 22, TN 240, FN 3 | TP 88, FP 21, TN 241, FN4 |
Precision, % | 80.18 |
80.73 |
Recall, % | 96.74 |
95.65 |
F-measure, % | 87.66 | 87.56 |
Accuracy, % | 92.94 |
92.94 |
As can be seen, the precision, recall, F-measure, and accuracy values were similar for either probability thresholds given that the Pp was 10%. The accuracy rates for both thresholds are above 90%, but in both situations, there are a few false negatives.
In the situation of Pp being 50%, the precision, recall, F-measure, and accuracy values are noticeably different. The number of false negatives was the lowest at Pp of 50% for both values of Pth. In fact, there was only 1 false negative in each threshold probability scenario resulting in a 0.28% false negative rate. This is also reflected in the recall rate, which decreases as the number of false negatives increases, as seen when comparing
Precision, recall, F-measure, and accuracy values for the classifier with Pp= 50%.
|
50% Pth | 80% Pth |
50% Pp of report being high-priority | TP 91, FP 25,TN 237, FN 1 |
TP 91, FP 22, TN 240, FN 1 |
Precision, % | 78.45 |
80.53 |
Recall, % | 98.91 |
98.91 |
F-measure, % | 87.50 | 88.78 |
Accuracy, % | 92.66 |
93.50 |
The results indicate that a naive Bayesian classifier works remarkably well in classifying radiology reports as low-priority or high-priority. The recall rate varied from 95.65% to 98.91%. This signifies that the classifier succeeded in accurately detecting high-priority reports, also known as true positives, while minimizing false negatives. The rate of false negatives was very low in this study, with the number of false negatives varying from 1 to 4. As indicated earlier, a lower false negative rate is desirable in clinical contexts and the current application. The precision rate, however, was lower, varying from 78.45% to 80.73%. In other words, the classifier had a higher rate of false positives. That is, it classified a greater number of documents as being high-priority, even though they were actually low-priority.
The actual magnitude of change in performance was not too dramatic for the 2 values of Pth (50% and 80%). The reason for this is seen by observing the nearly bimodal distribution of reports (
Similar observations can be made about varying the Pp. This is the prior probability used in the Bayes formula to calculate the probability that a report is high-priority. Increasing this value increases the likelihood that a report is high-priority. Choosing a higher Pp will have the effect of potentially increasing the false positive rate in the same way as it did when the threshold, Pth, was increased. The data also demonstrated this, but again, modestly. The bimodal distribution described earlier again shows why this was the case.
This study showed a clear distinction between low-priority and high-priority reports. Why was there such a clear distinction? Low-priority reports are the normal reports. They typically have text in the Impression section such as “no evidence of fracture” or “no acute disease of the chest”. Our negation algorithm removed all negated terms. So these normal reports were presented as empty text to the Bayesian classifier. The text from a high-priority report would typically have language such as “nodule identified”, “possible developing mass”, or “small infiltrate suggesting early pneumonia”. Since these terms are not seen in a low-priority report, the classifier assigns a very high probability to reports containing these terms. Furthermore, by removing negated terms, we greatly improved the scores of the training dataset. Removing stop words had minimal impact on the document scores, but it rendered a cleaner “bag of words” to study and debug.
A closer review of the reports indicates the reason for a higher number of false positives. A false positive report for a mammogram read,
...there is no mammographic evidence of malignancy; routine follow-up mammogram in 1 year is recommended; bi-rads category 1. negative according to the nci model the patients lifetime risk of developing breast cancer is 3.6%...
Although this report should have been classified as low-priority, there was language used by the radiologist to provide general guidance to the ordering physician,
More robust negation detection should be developed as a part of any future enhancements. Additionally, use of NLP and/or common phrase detection may enhance the ability of the classifier to better distinguish if terms mentioned are part of a patient’s report findings, as in the case described above. Additional statistical methods, in addition to Bayesian statistics, could also provide a stronger classification system.
Although this study had a relatively low number of documents to test and a small number of reviewers, the superior results obtained provide assurance for the performing of future studies and make intuitive sense for the nature of the primary care setting. Patients in this setting are generally healthy and tend to have normal radiology reports. The distinction between a normal and an abnormal report is usually obvious due to the presence of key words, making it easy for a classifier to detect an abnormal report and denote it as high-priority.
The results of this study highlight the promise of using statistical classifiers, such as this Bayesian implementation, in prioritizing a primary care physician’s workload across electronic systems in real-time with an ability to be trained, a marked difference from the retrospective and static analyses done by many of the prior studies in the literature. Due to the EMR-agnostic design of this classifier, it is generalizable to any EMR system or patient data interface for that matter. The real-time incoming data feed to an EMR can consist of various entry points, such as HL7, faxes, scanned documents, and Web services. This classifier might also offset the chance of radiologists not electronically coding radiology reports as normal or abnormal, as these specialists are typically required by the American College of Radiology to electronically code radiology reports as normal or abnormal when communicating with primary care physicians [
While the scope of documents was limited in this study to radiology, we believe the classifier can be adapted to other verticals within health care. Implementing it on a greater number of radiology reports and testing it on other report types, such as pathology and microbiology reports, will further test the effectiveness of the Bayesian classifier in this study.
In conclusion, a Bayesian classifier can be used, in conjunction with other available methods, to detect high-priority radiology reports and improve primary care provider efficiency in addressing these reports. This novel study showed, for the first time to our knowledge, that the Bayesian system, used on this representative sample of free text, unstructured radiology reports received in a primary care setting, displayed a high rate of success in detecting true positives. Use of this type of technology has the potential to improve patient safety, as well as minimize physician malpractice exposure.
Future work may include studying the effectiveness of this classifier in a different practice setting, such as a specialist’s office. For example, in an oncology or cardiology practice, given the nature of each specialty, a greater number of patient reports are expected to be abnormal, and yet may be classified by the specialist as low-priority. It would be interesting to see how this classifier would perform. It is possible that more advanced techniques such as NLP, in combination with a statistical classifier, would be required in order to have a satisfactory rate of high-priority detection. Furthermore, search engine capabilities could be a future extension as specific terms within reports can be identified, leading to a more connected experience for the patient [
computerized axial tomography
electronic medical records
health level-7
magnetic resonance images
natural language processing
protected health information
The authors acknowledge the Blue Hills Medical Associates facility for serving as the testing site for this study and Drs Maggio and Chakrabarti for reviewing patient reports. Internal funding supported this work.
Authors MS and AM worked for Hermes Clinical, Inc during the writing of this paper. Hermes Clinical, Inc is a health care information technology company that utilizes some of the technologies described in this paper.