This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
In the context of the current refugee crisis, emergency services often have to deal with patients who have no language in common with the staff. As interpreters are not always available, especially in emergency settings, medical personnel rely on alternative solutions such as machine translation, which raises reliability and data confidentiality issues, or medical fixed-phrase translators, which sometimes lack usability. A collaboration between Geneva University Hospitals and Geneva University led to the development of BabelDr, a new type of speech-enabled fixed-phrase translator. Similar to other fixed-phrase translators (such as Medibabble or UniversalDoctor), it relies on a predefined list of pretranslated sentences, but instead of searching for sentences in this list, doctors can freely ask questions.
This study aimed to assess if a translation tool, such as BabelDr, can be used by doctors to perform diagnostic interviews under emergency conditions and to reach a correct diagnosis. In addition, we aimed to observe how doctors interact with the system using text and speech and to investigate if speech is a useful modality in this context.
We conducted a crossover study in December 2017 at Geneva University Hospitals with 12 French-speaking doctors (6 doctors working at the outpatient emergency service and 6 general practitioners who also regularly work in this service). They were asked to use the BabelDr tool to diagnose two standardized Arabic-speaking patients (one male and one female). The patients received a priori list of symptoms for the condition they presented with and were instructed to provide a negative or noncommittal answer for all other symptoms during the diagnostic interview. The male patient was standardized for nephritic colic and the female, for cystitis. Doctors used BabelDr as the only means of communication with the patient and were asked to make their diagnosis at the end of the dialogue. The doctors also completed a satisfaction questionnaire.
All doctors were able to reach the correct diagnosis based on the information collected using BabelDr. They all agreed that the system helped them reach a conclusion, even if one-half felt constrained by the tool and some considered that they could not ask enough questions to reach a diagnosis. Overall, participants used more speech than text, thus confirming that speech is an important functionality in this type of tool. There was a negative association (
In emergency settings, when no interpreter is available, speech-enabled fixed-phrase translators can be a good alternative to reliably collect information from the patient.
In the context of the current refugee crisis, emergency services are increasingly confronted with patients who have no language in common with staff and may not share the same culture. For example, at Geneva University Hospitals (HUG), 52% of patients are foreigners and 10% speak no French at all. In 2017, the 10 languages for which interpretation services were the most solicited were Tigrinya, Tamil, Albanian, Farsi, Spanish, Somalian, Syrian, Dari, Portuguese, and Arabic (North Africa). Taken together, these languages represent 75% of the interpreting hours at HUG (Geneva University Hospitals, personal communication, 2017).
This language barrier situation is known to pose many safety and ethical problems: It is responsible for increased risks for patients [
Different solutions are available for use in emergency settings to address these language barriers, but they all have their drawbacks. Phone-based interpreter services, which are the most common solution, are generally considered adequate, but they are expensive (3 Swiss francs/minute with AOZ Medios, a national interpreting service mandated by the Swiss Federal Office of Public Health), not always available for some languages, and less satisfactory than face-to-face interaction with a physically present interpreter [
For these reasons, we have developed a new type of speech-enabled fixed-phrase translation tool for medical dialogue (BabelDr [
This study is the first step in this direction. It aims to determine whether this type of restricted translation tool can be used by doctors to perform a diagnostic interview and reach a correct diagnosis and to quantify if speech adds value to fixed-phrase translators. Although different evaluations of medical devices have been conducted [
The BabelDr app can be characterized as a “phraselator” [
For speech recognition and matching, the system combines rule-based and robust methods, derived from the rules. When the doctor speaks, the system first recognizes what is said using both a grammar-based version of “Nuance” and a specialized statistical version (Nuance Communications Inc, Burlington, MA). It then maps the recognition results to the closest core sentence using both rules and robust matching techniques borrowed from information retrieval, described in detail elsewhere [
After translating a sentence to the patient (
Screenshot of the BabelDr app.
Example of the interface after the doctor translated “Avez-vous de la fièvre?” (Do you have fever?).
This study aims (1) to determine whether a restricted translation tool like BabelDr can be used by doctors to perform a diagnostic interview and reach a correct diagnosis and (2) to quantify how doctors use text versus speech interactions in order to investigate if speech adds value to fixed-phrase translators. Our hypotheses were that this type of tool would demonstrate good functional suitability (doctors can collect all the information necessary to reach a diagnosis in an efficient way) and usability (doctors will use more speech to interact than text, as speaking should allow them to communicate more naturally, like when working with interpreters).
The study was conducted at the HUG research laboratory in December 2017. In this crossover trial, 12 French-speaking doctors were asked to use BabelDr to diagnose two standardized Arabic-speaking patients (one male and one female) whose main complaint was lower back pain. The male patient was standardized for nephritic colic and the female patient, for cystitis. These two diagnoses are among the 10 most frequent at HUG (Geneva University Hospitals, personal communication, 2018). Each of the 12 doctors carried out a diagnostic interview with both patients, where half of the doctors began with the male patient and the other half began with the female patient.
Before the diagnostic interviews, doctors were informed about the main patient complaint (pain in the lower back). At the beginning of the session, they received a short introduction to BabelDr and tested a few interactions. It was strongly suggested that they use complete sentences and ask yes/no questions, so that the patients could answer nonverbally.
Doctors only had access to the BabelDr tool. The diagnostic domain was set to “lower back pain” to match the patient complaint. In the context of this study, the other domains were not made available in order to simplify system usage. It was ascertained beforehand that all available questions potentially relevant to the patient complaint were included in this domain. The language pair was French to Arabic; the male or female patient was chosen depending on the case.
During the sessions, the doctors wrote down the information they were able to collect based on the patient’s responses. At the end of each session, the doctors wrote down their diagnoses. These data allowed us to answer the first question on whether the system enables doctors to reach a correct diagnosis.
All interactions with the system were logged. For each session, we collected audio recordings of each spoken interaction with the system as well as the corresponding recognition results. We also logged which recognition results or text examples the doctors chose to translate for the patients. Finally, the duration of each session was measured. These data were analyzed to provide a quantitative answer to our second research question, namely, whether speech interaction is useful in this type of tool.
At the end of each session, participants completed a satisfaction questionnaire that included a total of 23 questions. The questions were derived from the System Usability Scale questionnaire by Brooke [
Study participants were 12 French-speaking doctors: 6 from the emergency service at HUG and 6 general practitioners who also regularly work in this service. All work in French, but three were not native speakers (#6, #11, #12). Only one doctor (#6) had previously used a former version of BabelDr in another study [
Of the two Arabic standardized patients, one was a man from Syria and one was a woman from Jordan. Both were refugees and recruited from among master’s degree students at the Faculty of Translation and Interpreting. They had a high level of literacy, but no specific medical knowledge. Neither of the patients spoke French. One week before the experiment, both patients received an a priori list of symptoms for the condition they were to present, expressed in layman’s terms. They were instructed to provide a negative or noncommittal answer to questions relating to other symptoms during the diagnostic interview.
All participants received remuneration for their participation in the study.
The institutional ethics committee approved the study protocol (Req-2017-00996). Participation in the study was voluntary, with written agreement obtained from all doctors and patients. All data were anonymous and stored on a secure University of Geneva server.
Doctors were able to reach a correct diagnosis in all 24 sessions based on the information collected using BabelDr. For the renal colic scenario, four doctors proposed multiple related diagnoses (
Diagnoses made by the 12 doctors.
Doctor no. | Female patient (with cystitis) | Male patient (with renal colic) | ||
Diagnosis | Other diagnoses | Diagnosis | Other diagnoses | |
1 | Cystitis | No | Renal colic | Pyelonephritis |
2 | Cystitis | No | Renal colic | No |
3 | Cystitis | No | Renal colic | No |
4 | Cystitis | No | Renal colic | No |
5 | Cystitis | No | Renal colic | No |
6 | Cystitis | No | Renal colic | Lumbosciatica |
7 | Cystitis | No | Renal colic | No |
8 | Cystitis | No | Renal colic | No |
9 | Cystitis | No | Renal colic | Pyelonephritis, lumbosciatica |
10 | Cystitis | No | Renal colic | No |
11 | Cystitis | No | Renal colic | No |
12 | Cystitis | No | Renal colic | Pyelonephritis, lumbosciatica, appendicitis |
For each doctor, we measured the time to complete the dialogue, the number of speech interactions, the number of speech interactions resulting in a translation for the patient, and the number of text items directly translated from the list of sentences.
Time and number of interactions for both scenarios.
Variable | Female patient with cystitis, median (range) | Male patient with renal colic, median (range) |
Time to diagnosis (min:seconds) | 13:37 (4:09-35:37) | 16:37 (4:35-23:35) |
Speech interactions (n) | 28.5 (17-46) | 36 (20-66) |
Speech translated (n) | 19.5 (8-23) | 26.5 (13-51) |
Text translated (n) | 4.5 (0-36) | 10 (0-23) |
The association between the percentage of translated speech and the number of translated texts was investigated using a linear regression model. Since each medical practitioner assessed two patients, data were clustered. Therefore, a regression model with mixed effects was used: A random effect was set on the intercept to account for between-practitioner variability. In addition, a multivariable analysis was conducted to adjust for the session and the scenario.
Interactions by participant for the scenario with the female patient.
Interactions by participant for the scenario with the male patient.
The percentage of translated speech was negatively associated with the number of translated texts (
The percentage of translated speech was higher in the second session than in the first session (difference=4.3%; 95% CI 1.1-7.4;
Analyses by scenario showed that the proportion of translated speech was lower in the renal colic scenario than in the cystitis scenario (difference=4.3%; 95% CI –7.6 to –1.1;
Association between the percentage of translated speech and the number of translated texts (A) and between French native speakers and the percentage of translated speech (B), system confidence score (C), and speech interaction (D). Circles represent each individual doctor's data; the black line represents the unadjusted regression line and black squares represent the mean values.
Examples of transcriptions and mapped core sentences.
Speech utterances | Core sentences | |
Associations between French native speakers and the percentage of translated speech, system confidence, and speech interaction were also investigated using a linear regression model with fixed effects. No association was found between French native speakers and the percentage of translated speech (
Results of the satisfaction questionnaire completed after the dialogue with the female patient. The numbers in circles represent the number of doctors.
Results of the satisfaction questionnaire completed after the dialogue with the male patient. The numbers in circles represent the number of doctors.
All participants were able to pose their questions to the patients and reach the correct diagnosis based on the information collected using BabelDr. However, although they believed that the system helped them to reach a conclusion, some felt constrained by the tool, as they could not ask enough questions to reach a diagnosis. Speech was the preferred modality, even if all doctors translated items from the text list, thus showing that both modalities are useful. The use of text was statistically influenced by the percentage of successful speech interactions and by the session (first use vs second use). Therefore, speech seems to help in using the system, as participants can express themselves freely and see the most related core sentences.
Other studies have analyzed user satisfaction (of both patients and medical staff) [
A preliminary version of the tool was used in the study. The system coverage, that is, the questions available to the doctors, is being continually improved based on the collected data. It is possible that the perception of constraint reported by the users was at least partially caused by insufficient coverage for the scenarios selected for this study, rather than by the system itself.
For the cystitis scenario, doctors would have benefited to have been able to change to another domain (abdominal pain), which was not accessible for this study. In addition, the doctors were informed beforehand of the patient’s chief complaint. This matches the usual practice at HUG where this information is collected from patients during admission, but another study without prior information would ascertain whether the subdivision into domains, as done in BabelDr, meets the doctor’s requirements.
The two standardized patients had a higher education level and no difficulty understanding the Arabic translations provided by the system. In the case of less literate patients, misunderstandings might cause incorrect patient responses and thus lead to incorrect diagnoses. Although the BabelDr translations are aimed at simplicity, a study of the translation quality and accessibility is currently in progress to ascertain whether the translations are suited to patients of different ages, education levels, and cultural and geographic origins.
Due to the rehearsed nature of the patient narratives, based on the given lists of symptoms rather than the potentially vague or contradicting observations by a real patient, it can be argued that the system performance in terms of diagnostic success would be lower with real patients. However, we suspect that the system’s restriction to yes/no questions might actually improve clarity by enforcing precise questions and unambiguous patient responses.
During this experiment, we observed very few user errors, such as doctors forgetting to shut off the microphone or using questions that could not be answered nonverbally. Anecdotally, we have observed more such errors in real-use cases with real patients. However, it is possible that in the artificial setting of this study, doctors were more attentive to the system than when using it with a real patient, where the focus would be more on the patient, and thus, the proportion of successful interactions might be lower.
The number of dialogues per doctor (n=2) in this study was insufficient to measure a quantifiable learning effect, but a study is currently in progress at HUG, where BabelDr is used in real settings and the collected data will allow us to study its learnability.
Our results show that speech and text interaction are complementary in a tool such as BabelDr. Future developments of the system include an improved text-search module providing more flexibility than the current keyword search.
Development of a bidirectional version of the system is ongoing. In this new version, patients will have an interface where they are presented with a range of responses (eg, numeric values, colors, and pictograms). This will allow us to extend the questions available to the doctors by including open questions and will possibly reduce doctors’ feelings of being constrained by the system.
This study showed that a phraselator can be an alternative to machine translation and traditional fixed-phrase translators to reliably collect information from the patient in situations where no interpreter is available. Although doctors felt constrained by the system, they were able to confidently reach a diagnosis, and all believed they could use this type of system in everyday medical practice. The relevance of task-based evaluation to assess the usefulness and usability of translation tools for the diagnosis task was also demonstrated and confirms the importance of reliability in this type of oral context. Doctors clearly appreciated the way in which speech recognition results were presented in the form of a back translation to French, which provided the exact meaning of the translation produced for the patient. Future studies with BabelDr have to confirm these conclusions in real-life settings and investigate the proportion of cases that can be reliably diagnosed with such a tool.
Geneva University Hospitals
The study was funded by the Fondation privée des Hôpitaux Universitaires de Genève. We thank Nuance Communication, Inc, for making their software available free of charge for research purposes. We also thank Jessica Rochat, Rosemary Sudan, and Emmanuel Rayner for their contributions to this study.
None declared.