This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
Automated medical history–taking systems that generate differential diagnosis lists have been suggested to contribute to improved diagnostic accuracy. However, the effect of these systems on diagnostic errors in clinical practice remains unknown.
This study aimed to assess the incidence of diagnostic errors in an outpatient department, where an artificial intelligence (AI)–driven automated medical history–taking system that generates differential diagnosis lists was implemented in clinical practice.
We conducted a retrospective observational study using data from a community hospital in Japan. We included patients aged 20 years and older who used an AI-driven, automated medical history–taking system that generates differential diagnosis lists in the outpatient department of internal medicine for whom the index visit was between July 1, 2019, and June 30, 2020, followed by unplanned hospitalization within 14 days. The primary endpoint was the incidence of diagnostic errors, which were detected using the Revised Safer Dx Instrument by at least two independent reviewers. To evaluate the effect of differential diagnosis lists from the AI system on the incidence of diagnostic errors, we compared the incidence of these errors between a group where the AI system generated the final diagnosis in the differential diagnosis list and a group where the AI system did not generate the final diagnosis in the list; the Fisher exact test was used for comparison between these groups. For cases with confirmed diagnostic errors, further review was conducted to identify the contributing factors of these errors via discussion among three reviewers, using the Safer Dx Process Breakdown Supplement as a reference.
A total of 146 patients were analyzed. A final diagnosis was confirmed for 138 patients and was observed in the differential diagnosis list from the AI system for 69 patients. Diagnostic errors occurred in 16 out of 146 patients (11.0%, 95% CI 6.4%-17.2%). Although statistically insignificant, the incidence of diagnostic errors was lower in cases where the final diagnosis was included in the differential diagnosis list from the AI system than in cases where the final diagnosis was not included in the list (7.2% vs 15.9%,
The incidence of diagnostic errors among patients in the outpatient department of internal medicine who used an automated medical history–taking system that generates differential diagnosis lists seemed to be lower than the previously reported incidence of diagnostic errors. This result suggests that the implementation of an automated medical history–taking system that generates differential diagnosis lists could be beneficial for diagnostic safety in the outpatient department of internal medicine.
Diagnostic error, defined as the failure to establish an accurate and timely explanation of the patient’s health problem or to communicate that explanation to the patient [
Diagnostic error–related paid malpractice claims occur more often among outpatients than among inpatients [
From this perspective, newly developed technology, such as computerized automated history-taking systems and diagnostic decision support systems, can be leveraged to address this issue; these systems have a long history, since they were introduced in the 1960s and 1970s [
These automated systems have generated concerns about their negative effects on the diagnostic accuracy of physicians. For instance, physicians may not accept correct diagnoses or may accept incorrect diagnoses generated by the systems [
We conducted a retrospective observational study using data from Nagano Chuo Hospital in Japan. The Research Ethics Committee of Nagano Chuo Hospital approved this study (serial number: NCR202104). The requirement to obtain written informed consent from patients was waived by the Research Ethics Committee under the condition that we used an opt-out method. We informed patients by showing the detailed information of the study on the official website of Nagano Chuo Hospital.
We included patients aged 20 years and older who used AI Monshin—an AI-based automated medical history–taking system—in the outpatient department of internal medicine for whom the index visit was between July 1, 2019, and June 30, 2020, followed by unplanned hospitalization within 14 days. A follow-up duration of 14 days was selected to improve the sensitivity to detect diagnostic errors [
The details of AI Monshin were presented in a previous report [
To identify whether diagnostic errors occurred in this study, we used the Revised Safer Dx Instrument [
The identification of diagnostic errors in this study was conducted through the algorithm as discussed in this section. In the first step, two reviewers (YH and SS) independently evaluated the diagnostic process of included cases using the Revised Safer Dx Instrument by reviewing the medical records. The presence or absence of diagnostic errors in each case was judged based on the score of item 13 [
The final diagnoses of all cases were confirmed by two reviewers (YH and SS) based on the discharge summary. Disagreements were resolved by discussion among the three reviewers (YH, SS, and YN). Based on the confirmed final diagnoses, the other two reviewers (RK and SK), who were blinded to the evaluation of diagnostic errors, independently judged whether the final diagnosis of each case was included in the list of 10 differential diagnoses generated by AI Monshin. Disagreements were resolved by discussion between the two reviewers (RK and SK).
For cases with confirmed diagnostic errors, further review was conducted to identify the contributing factors of these errors via discussion among the three reviewers (YH, SS, and YN). The Safer Dx Process Breakdown Supplement was used as a reference to classify the contributing factors of diagnostic errors and outcomes in this study [
From the medical records, we extracted data on the age and sex of patients, chief complaints, and the experience of physicians who saw patients at the index visits (ie, resident: up to 5 years of experience after graduation; staff: more than 5 years of experience after graduation). The primary outcome was the incidence of diagnostic errors.
We calculated the required sample size to be 139 cases, with an incidence of diagnostic errors of 10.0% and a margin of 5.0%. It was estimated that there were approximately 150 patients who were eligible for this study between July 1, 2019, and June 30, 2020. Even with the expectation that approximately 5 to 10 cases could be excluded, 150 cases were a reasonable target number of cases for this study.
Continuous data are presented as medians with the 25th and 75th percentiles. Categorical data are presented as counts and proportions (%). For the primary outcome, we calculated the incidence of diagnostic errors with 95% CI. To evaluate the baseline factors and the differential diagnosis list of AI Monshin with regard to the incidence of diagnostic errors, we compared the incidence of diagnostic errors between the groups of older adults (aged ≥65 years) and non–older adults (aged <65 years) [
A total of 150 cases were unexpectedly hospitalized within 14 days after the index visit that took place at the outpatient department of internal medicine; AI Monshin was used at the index visit. Only 2 (1.3%) patients did not complete history-taking by AI Monshin: a woman in her 70s complained of an uncomfortable feeling on her tongue, abdominal pain with distention, and appetite loss, and a man in his 70s complained that his cold was not getting better. After excluding 4 (2.7%) cases in which AI Monshin did not develop 10 differential diagnoses (2 cases: incomplete history-taking; 2 cases: patients presented for further investigation for abnormal test results), the data from 146 cases were analyzed for this study. The median age of the patients was 71 (IQR 59-82) years, 72 (49.3%) were male, 71 (48.6%) were seen by residents at the index visit, and 103 (70.5%) were admitted to the hospital on the same day as the index visit.
The top three most common chief complaints were abdominal pain (37/146, 25.3%), fever (20/146, 13.7%), and melena or hematochezia (15/146, 10.3%). During follow-up outpatient visits or admission, the final diagnosis was confirmed for 138 patients (94.5%). The most common diagnosis was lower respiratory tract infection (15/138, 10.9%), followed by ischemic colitis (8/138, 5.8%), diverticular bleeding (8/138, 5.8%), and congestive heart failure (8/138, 5.8%). The final diagnosis was based on the differential diagnosis list from AI Monshin for 69 out of 138 patients (50.0%).
Flow of reviews for confirming diagnostic errors. AI: artificial intelligence.
The incidence of diagnostic errors was significantly higher in patients aged 65 years and older compared to those under 65 years of age (15/96, 16% vs 1/50, 2%; OR 9.1, 95% CI 1.2-70.8;
According to the Safer Dx Process Breakdown Supplement, the most common contributing factors for diagnostic errors in 16 cases were “problems ordering diagnostic tests for further workup” (n=13, 81%), followed by “problems with data integration and interpretation” (n=10, 63%), “problems with physical exam” (n=9, 56%), and “performed tests not interpreted correctly” (n=8, 50%;
From the aspect of the differential diagnosis list for cases with diagnostic errors, AI Monshin listed the final diagnosis in the list in 5 out of 16 cases (31%) and the initial diagnosis in 4 out of 16 cases (25%). On the other hand, in cases without diagnostic errors, AI Monshin listed the final diagnosis in the differential list in 64 out of 122 cases (52.5%, excluding 8 cases where the final diagnosis was unknown). In summary, despite using AI Monshin, physicians could not make the correct diagnoses as were suggested in the differential diagnosis list in 5 of 69 cases (7% omission errors). On the other hand, the incorrect initial diagnoses made by physicians were listed in the differential diagnosis list in 4 of 69 cases (6% commission errors). Regarding the outcome, no cases of diagnostic errors resulted in death or permanent harm. A total of 2 cases out of 16 (13%) were classified as Category C: “An error occurred that reached the patient but did not cause the patient harm.” Diagnostic errors resulted in some harm in 14 out of 16 cases (88%; 2 cases were classified as Category E: “An error occurred that may have contributed to or resulted in temporary harm to the patient and required intervention”; 12 cases were classified as Category F: “An error occurred that may have contributed to or resulted in temporary harm to the patient and required initial or prolonged hospitalization”). The median time between the index visit and the time that the final diagnosis was made was 3 (IQR 2-6) days.
The details of 16 diagnostic error cases.
Case No.a | Age (y) | Sexb | Physician of first |
Chief complaint | Initial |
Final |
Index visit to |
Outcome categoryc | Initial diagnosis was on listd | Final diagnosis was on listd |
1 | 95 | F | Resident | Fever | URIe | Cholangitis | 4 | F | No | No |
2 | 76 | M | Resident | Abdominal pain | GERDf | Cholecystitis | 2 | F | Yes; |
No |
3 | 83 | M | Resident | Abdominal pain | Costochondritis | Pneumonia | 3 | F | No | No |
4 | 55 | M | Resident | Hematochezia | Infectious enteritis | Diverticular bleeding | 2 | F | Yes; |
Yes; |
5 | 89 | F | Staff | Nausea | Unknown | Acute pyelonephritis | 3 | F | No | No |
6 | 75 | M | Staff | Cough | URI | Interstitial pneumonia | 3 | F | No | Yes; |
7 | 66 | M | Resident | Abdominal pain | Constipation | Intestinal obstruction | 6 | F | Yes; |
No |
8 | 70 | F | Staff | Cough | Unknown | Heart failure | 3 | F | No | Yes; |
9 | 77 | F | Resident | Palpitation | Heart failure | Pulmonary embolism | 2 | E | Yes; |
No |
10 | 82 | M | Staff | Fever | URI | Cholecystitis | 3 | F | No | No |
11 | 81 | F | Resident | Anorexia | Choledocholithiasis | Acute pyelonephritis | 2 | C | No | No |
12 | 72 | M | Staff | Headache, lightheadedness | Fatigue | Vestibular neuritis | 8 | E | No | No |
13 | 86 | M | Resident | Abdominal pain | Enteritis | Intestinal obstruction | 0g | F | No | Yes; |
14 | 78 | M | Staff | Abdominal pain | Hemorrhoid | Infectious enteritis | 9 | C | No | No |
15 | 91 | M | Staff | Fever, cough, back pain | URI | Acute pyelonephritis | 7 | F | No | Yes; |
16 | 72 | M | Resident | Dyspnea, cough, malaise | URI | Interstitial pneumonia | 11 | F | No | No |
aAll diagnoses were common. All cases had typical presentations except for case 2.
bFemale (F) or male (M).
cOutcome was classified, along with the Safer Dx Process Breakdown Supplement, as follows: Category C, “An error occurred that reached the patient but did not cause the patient harm”; Category E, “An error occurred that may have contributed to or resulted in temporary harm to the patient and required intervention”; Category F, “An error occurred that may have contributed to or resulted in temporary harm to the patient and required initial or prolonged hospitalization” [
dAI Monshin’s differential list; where a diagnosis was on the list, its rank on the list is indicated.
eURI: upper respiratory infection.
fGERD: gastroesophageal reflux disease.
gThe final diagnosis was made at the second visit, which was on the same day as the index visit.
Breakdown analysis of the contributing factors for diagnostic errors.
Contributing factors and details | Cases (N=16), n (%) | ||
|
|||
|
Delay in seeking care | 0 (0) | |
|
Lack of adherence to appointments | 0 (0) | |
|
Other | 0 (0) | |
|
|||
|
Problems with history | 4 (25) | |
|
Problems with physical exam | 9 (56) | |
|
Problems ordering diagnostic tests for further workup | 13 (81) | |
|
Failure to review previous documentation | 4 (25) | |
|
Problems with data integration and interpretation | 10 (63) | |
|
Other | 0 (0) | |
|
|||
|
Ordered test was not performed at all | 0 (0) | |
|
Ordered test was not performed correctly | 0 (0) | |
|
Performed test was not interpreted correctly | 8 (50) | |
|
Misidentification | 1 (6) | |
|
Other | 0 (0) | |
|
|||
|
Problems with timely follow-up of abnormal diagnostic test results | 1 (6) | |
|
Problems with scheduling of appropriate and timely follow-up visits | 2 (13) | |
|
Problems with diagnostic specialties returning test results to clinicians | 2 (13) | |
|
Problems with clinicians reviewing test results | 0 (0) | |
|
Problems with clinicians documenting action or response to test results | 0 (0) | |
|
Problems with notifying patients of test results | 0 (0) | |
|
Problems with monitoring patients through follow-up | 0 (0) | |
|
Other | 0 (0) | |
|
|||
|
Problems initiating referral | 1 (6) | |
|
Lack of appropriate actions on requested consultation | 0 (0) | |
|
Communication breakdown from consultant to referring provider | 0 (0) | |
|
Other | 0 (0) |
Among 146 patients who used the AI-driven, automated history-taking system, which developed a list of the top 10 differential diagnoses, diagnostic errors occurred in 11.0% of cases. These patient histories were collected at the index visit to the outpatient department of internal medicine, followed by unplanned hospitalization of the patient within 14 days. The incidence of diagnostic errors was statistically higher among older adult patients; however, the sex of the patients, the experience of the physicians, and the accuracy of the differential diagnosis list of the AI system were not statistically associated with the incidence of diagnostic errors. In all cases where diagnostic errors occurred, the final diagnoses were common diseases, as reported in a previous study that was conducted in primary care settings in the United States between 2006 and 2007 [
To the best of our knowledge, this is the first observational study that evaluated the effects of implementation of an automated medical history–taking system with a differential diagnosis generator in routine clinical practice using the validated Revised Safer Dx Instrument to detect diagnostic errors. However, this study also had some limitations. First, this study did not include patients who did not use an automated history-taking system with a differential diagnosis generator or those who were not admitted; therefore, the incidence of diagnostic errors should be interpreted with caution. Second, exclusion of the cases in which AI Monshin did not develop 10 differential diagnoses may have reduced the incidence of diagnostic errors in this study. Since inadequate and inappropriate history could be a contributing factor for diagnostic errors, excluding such a case may merit the optimistic assumption of AI Monshin’s performance. Third, because the judgment of diagnostic errors was conducted by a retrospective review of the charts, some bias could not be avoided. However, as the review process was predefined and at least two reviewers independently assessed each case, we are sure that these biases were avoided as much as possible. Fourth, we are unsure of the effects of COVID-19 on diagnostic errors in the outpatient department. Future studies may focus on the incidence of diagnostic errors between hospitals with and without implementation of an automated medical history–taking system with a diagnostic decision support function in a prospective design.
The incidence of diagnostic errors in this study was 11.0%, which was lower than that reported in previous studies (13.7% and 20.9%) that included cases similar to this study (ie, patients who were unexpectedly hospitalized within 14 days after their index visit) [
The quality of clinical history documented by AI Monshin may be a key component of the results. There may be high discrepancies in clinical history between patient reports and physician documentation [
In addition to making a high-quality document of medical history, an automated medical history–taking system with a differential diagnosis generator seems to have some advantages. First, this system can be integrated into routine diagnostic processes in clinical practice. Currently, one of the most important concerns in the diagnostic decision support system is its low usage rate. For example, in the case of Isabel, which is one of the most famous AI-driven diagnostic decision support systems that generates a differential diagnosis list based on entered information by physicians, a previous study showed that only 7.9% of participants who were given open access to Isabel reported using Isabel at least once a week, whereas the others never used it [
Furthermore, several limitations exist regarding the implementation of automated history-taking systems with differential diagnosis generators. First, at present, the accuracy of differential diagnosis lists of AI systems is not sufficiently high to believe the lists every time. A previous study reported that the prevalence of the correct diagnosis in the top 10 list of differential diagnoses from diagnostic decision support systems in clinical practice settings was around 50% [
The incidence of diagnostic errors seems to be reduced by the implementation of an automated medical history–taking system with a diagnostic decision support function in the outpatient department. Although the accuracy of the differential diagnosis list from AI Monshin remains low, the negative effects of incorrect differential diagnosis lists from AI systems on the diagnostic accuracy of physicians could be counteracted by the high-quality clinical history taken by AI systems. Therefore, in total, the implementation of an automated history-taking system with diagnostic decision support may have more beneficial impacts than negative effects on diagnostic safety in the outpatient department.
Details of the histories written by AI Monshin in 16 diagnostic error cases. AI: artificial intelligence.
artificial intelligence
Grants-in-Aid for Scientific Research
odds ratio
This work was supported by the Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research (KAKENHI) program (grant JP21K10355).
RK and YH were responsible for conceptualization of the study and for developing the study methodology. YH conducted the formal analysis, was responsible for securing resources, performed data curation, and was responsible for project administration and funding acquisition. RK, YH, SS, YN, and SK conducted the study investigation. RK was responsible for writing and preparing the original draft of the manuscript. YH and TS were responsible for reviewing and editing the manuscript. All authors have read and agreed to the published version of the manuscript.
None declared.