This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
In the field of medicine and medical informatics, the importance of comprehensive metadata has long been recognized, and the composition of metadata has become its own field of profession and research. To ensure sustainable and meaningful metadata are maintained, standards and guidelines such as the FAIR (Findability, Accessibility, Interoperability, Reusability) principles have been published. The compilation and maintenance of metadata is performed by field experts supported by metadata management apps. The usability of these apps, for example, in terms of ease of use, efficiency, and error tolerance, crucially determines their benefit to those interested in the data.
This study aims to provide a metadata management app with high usability that assists scientists in compiling and using rich metadata. We aim to evaluate our recently developed interactive web app for our collaborative metadata repository (CoMetaR). This study reflects how real users perceive the app by assessing usability scores and explicit usability issues.
We evaluated the CoMetaR web app by measuring the usability of 3 modules:
A total of 12 individuals participated in the study. We found that over 97% (85/88) of all the tasks were completed successfully. We measured usability scores of 81, 81, and 72 for the 3 evaluated modules. The qualitative analysis resulted in 24 issues with the app.
A usability score of 81 implies very good usability for the 2 modules, whereas a usability score of 72 still indicates acceptable usability for the third module. We identified 24 issues that serve as starting points for further development. Our method proved to be effective and efficient in terms of effort and outcome. It can be adapted to evaluate apps within the medical informatics field and potentially beyond.
Raw data are useless without metadata that characterizes and contextualizes its content. A number is meaningless without the information on which parameter it describes (eg, blood pressure) and a finding is of no use without its context (eg, sepsis as a comorbidity vs sepsis as cause of death). Metadata itself always needs context (eg, the concept it describes). In many cases, metadata are merely implied by column headers of tabular databases and the implicit knowledge of the few people working with the database. Many information scientists have researched the field of metadata, for example, Wilkinson et al [
Particularly in the context of data integration within large research networks, comprehensive metadata are essential. “Data integration is the problem of combining data residing at different sources, and providing the user with a unified view of these data” [
Software-driven data integration involves multiple technical components: various
From the user perspective, these components are managed and elaborated by the following roles:
To provide a data warehouse with comprehensive and accurate data, different roles need access to different classes of information residing in the described data integration system. We identified 3 cases in which access barriers prevent users from contributing their expertise [
All users need access to the listing of all data elements represented in the data warehouse. These annotations and context information can be derived from the metadata repository and must be visualized.
Data managers and, in particular, data providers need full access to the mapping rules for data harmonization. They are only available in the formal language, which requires the respective information technology background. Data providers usually do not have that knowledge.
Data coordinators need access to the provenance information of the metadata to be able to curate it. “Especially in collaborative metadata development, a comprehensive annotation about ‘who contributed what, when and why’ is essential” [
In most cases, barrier (1) is resolved through metadata browsers [
The German Center for Lung Research (German: Deutsches Zentrum für Lungenforschung [DZL]) implemented the collaborative metadata repository (CoMetaR), applying principles of collaborative metadata development and FAIR metadata warehousing [
This study evaluates the usability of 3 modules built for common tasks in the field of data integration and metadata maintenance.
The usability evaluation performed was a combination of (1) the think aloud method and (2) usability questionnaires. By combining both methods, we wanted to measure both observable and perceived usability. The execution consisted of two phases: (1) a screen sharing–supported training specific to the respective user’s roles and (2) solving of the given tasks by the participant with subsequent retrospection, including the completion of a usability questionnaire. All evaluations were performed by the same experimenter.
This method is commonly applied to the usability evaluations of web interfaces [
We decided not to record the participants but to make notes on their expressions as well as their app use behavior. These notes focused on usability, functional, and methodological issues. The advantage of this approach is a more comfortable setting for the user on the one hand and less effort for the experimenter on the other hand. The downside is the potential information loss because the experimenter already filters information.
As our interpretation model, we used the 7 categories described in ISO 9241-110 [
We used the System Usability Scale invented by Brooke in 1996 as a measurement tool for the usability of the app. This scale was introduced as a
The CoMetaR web app is divided into a concept tree navigation area and a module area. Modules can be selected in the module menu in the top-right corner, as shown in
The core module functionality of the CoMetaR web app (
As our metadata are growing and developing over time with many participants involved, we decided to provide the provenance module, which enables users to track all changes. These changes may be the additions, moves, or removal of concepts in the concept tree, but also modifications of their annotations. When selecting the provenance module (
Screenshot of the collaborative metadata repository (CoMetaR) web app core module. Left side: concept tree. Right side: module content (concept details). Top-right corner: module navigation. Top-left corner: home button, search panel, and help panel.
Screenshot of the collaborative metadata repository (CoMetaR) web app provenance module. Left side: concept tree with colorized annotations for added, moved, or removed and modified items. Light yellow box: information box for the item ATC Catalog on mouse-over. Right side: module content (upload history visualization). ATC: Anatomical Therapeutic Chemical.
Our data integration process is supported by the data integration module. The integration process for a single data source is divided into 4 parts. (1) The export of data from the source system, (2) the preparation of data for the integration software, (3) configuration of the integration software, and (4) its execution. As the configuration file is written in formal language to be interpreted by software, it is not accessible for humans who lack the required technical background. To verify the configurations, the respective data providers must be able to access the formulated rules. For this task, they can upload the configuration file to the data integration module (
Screenshot of the collaborative metadata repository (CoMetaR) web app data integration module. Left side: concept tree. Light yellow boxes: corresponding mapping rules. Right side: module content (configuration file upload).
CoMetaR was designed to support data integration tasks. In the German Center for Lung Research, we have been practicing data integration since 2016 and identified information that is of high interest for data integration experts. For example, to match and map elements of the source data to the integrated data, the person formulating the rules needs to know which elements are part of the integrated metadata, what are their exact characteristics (method of measurement, scale, classification, etc), and how they are uniquely identified. If these characteristics change, the mapping rules must be adjusted. For various processes, people often want the metadata to be available in Microsoft Excel format, yielding the need for respective export capabilities. For these and further scenarios, we defined 10 tasks that verified CoMetaR’s suitability in the field of lung research. The following tasks were composed by 2 experts, who have been internationally active in the field of data integration for >5 years. The composition process included brainstorming, discussion, and finally consensus. To assign modules to each participant, we considered their user roles as well as their everyday tasks. All users must solve core module tasks, all data coordinators must solve provenance module tasks, and all data managers who upload data must solve the data integration module tasks.
The first 4 tasks aim at the use of the core module. They test the ability to search for and find specific thesaurus elements and their annotations as well as the capability to export data:
1. Indicate which of the parameters
2. Indicate code, datatype, and unit of the spirometry parameter Forced Expiratory Volume in 1 Second (
3. Regarding the last change of the concept
4. Describe in detail which individual steps you would take to print the subtree of
The following 2 tasks aim at the use of the provenance module. They test the ability to track changes within the thesaurus:
5. Indicate which concepts have been added, moved, or removed in the last month.
6. Pick one concept for which annotations have been changed in the last upload. Indicate who performed this change on which date.
The last 4 tasks aim at the use of the data integration module. They test the ability to verify individual upload client configurations:
7. Examine the configuration for falsely mapped concepts.
8. Examine the configuration for properly mapped concepts.
9. Examine the metadata for concepts that are not mapped in the configuration but you could provide.
10. Update your local configuration to meet changed concept references. Describe your approach.
Tasks 7, 8, and 9 must be seen as one task with 3 subtasks. The participants were asked to use their own configuration files designed for uploading the data they administered. Some configuration files comprise hundreds of mapping rules. Depending on the size and coverage of certain data sources, task fulfillment takes a considerable amount of time. During the live evaluation, the participants were asked to work on each of these 3 tasks exemplarily to be able to fill out the System Usability Scale questionnaire. They completed the tasks asynchronously and reported their results when they finished.
For 3 of the 4 data integration module tasks, we asked the participants to use their own configuration file for analysis. These comprise rules to define how local concepts are mapped to concepts in the central data warehouse. The file format is XML. The configuration files are used by a data transformation and upload client software. Configuration files do not contain any instance data. By using real configuration files instead of an artificial example, we were able to test our app in a realistic scenario and identify faulty mappings. In addition, this setup allowed participants to work with familiar information.
The experimenter completed a notes sheet alongside following the evaluation procedure. It was structured to contain one row per participant and the following columns:
The questionnaires handed to the participants contained 10 usability questions defined in the System Usability Scale. They were put into a Microsoft Excel sheet with one row for each question and columns for values of 0 to 4. The final score for the 10 questions was calculated within the sheet. The participants were handed one sheet per evaluated module.
A spreadsheet was used to collect the scores per participant and module to calculate the quantitative analysis parameters, that is,
Given an experience level from 1 to 5, the score weighted by experience differs by up to 16 points, which corresponds to previous findings [
To evaluate our web app, we decided to interact with the participants remotely (participants were not invited to a local test laboratory) and synchronously (the evaluator and participant executed the test session in real time). We made one exception for a very time-consuming task type, which certain participants completed asynchronously. This method appeared to be the most efficient in terms of preparation effort, travel time, and risk of SARS-CoV-2 infection. Its suitability was shown in a comprehensive study: Bastien [
As a communication platform, we used the GoToMeeting web conference software by LogMeIn [
The target audience of CoMetaR is experts who contribute to the task of data integration as data providers, data managers, or data coordinators. Our implementation of CoMetaR is dedicated to lung research. Therefore, in this evaluation, we included members of the German Center for Lung Research and collaborating organizations. The included participants should cover a wide range of roles and responsibilities. These characteristics determine the module that they can work on effectively. For example, data managers who load data into a data warehouse have a data integration configuration file and can use the data integration module. The core module is relevant to all the user roles. In contrast, the provenance module is mostly relevant for data coordinators and data managers, whereas the data integration module is mostly relevant for data managers and data providers. In addition to their user role, profession, age, and English level, we also asked for the participants’ experience with the app. English and experience levels were measured on a scale of 1 to 5.
Bastien [
All methods were performed in accordance with the relevant guidelines and regulations. This study was granted an exemption from requiring ethics approval by the ethics committee of the Faculty of Medicine at the Justus-Liebig-University in Giessen, Germany. Informed consent for participation in the study was obtained from all the participants.
All patient-related data were recorded anonymized. It covers age, profession, role, evaluated modules, English level, and experience with the app. The data were further coarsened using age classes of 10 years to prevent participant reidentification.
Before any evaluation, we performed a screen sharing–supported training specific to the respective user’s roles, regardless of previous experiences with the app. The goal of this training was to provide participants with equal basic knowledge about the web app’s structure and functionality. We asked for the participants’ previous experiences with the system, which may influence the evaluation outcome [
After giving each participant introductory training regarding the app’s functionalities, they had the option to ask questions and clarify misunderstandings. Following, for each tested module, they were asked to fulfill each task one by one. The tasks were communicated via speech. The experimenter asked the participants to verbalize their thoughts during the evaluation and reminded them whenever they forgot. After the participant solved the tasks for a module, the experimenter asked them to fill out the usability questionnaire we sent them previously via email. Furthermore, they were invited to participate in a retrospective dialogue, again noting the findings.
The experimenter played a passive role. During the evaluation, he was not supposed to speak besides reminding the participant to verbalize their thoughts. In cases where the participants were stuck, the experimenter gave hints to lead to the information that had to be received from the app. Meanwhile, the experimenter completed the structured notes sheet documenting the participants’ verbalized thoughts, spontaneous reactions, and their app use behavior, focusing on the previously mentioned usability categories [
The traditional think aloud method requires recording the entire evaluation session and the following transcription. As mentioned in the study design, we did not record sessions because transcription occurred during the session.
For quantitative analysis, we calculated aggregated scores (
We conducted a thematic analysis of the information gathered during the evaluations to identify usability issue patterns and to present a descriptive account of users’ experiences. After familiarization with all notes, we went through all notes again and generated usability issue statements. We followed a latent approach, which means that we interpreted the data to create statements that were more meaningful. For example, task 2 asked the participants to indicate the properties of the spirometry parameter
For documentation and analysis, we used only Microsoft Excel and Microsoft Word.
The System Usability Scale questionnaire consists of 10 questions, 5 of which stated a positive usability and 5 of them stated negative usability. As some questions include negations, we assumed a possible misinterpretation. Therefore, we immediately checked each questionnaire for outliers and inquired when we identified potential misinterpretations. When inquiring, we again pointed out that we do not insist on better scores but on valid answers.
We wanted to ensure correct and comprehensive categorization, as well as unambiguous wording for qualitative analysis. A second person who was familiar with the study design and aspects of usability checked all categorizations. The resulting tables are the results of in-depth dialogues.
All participants in this evaluation currently work for or in collaboration with the German Center for Lung Research. Their operation areas and responsibilities vary, but all contribute to the data integration task.
Characteristics of the 12 participants including age, experience level, English level, profession, user roles, and tested modules.
Characteristics | Participants | |||||||||||
|
A | B | C | D | E | F | G | H | I | J | K | L |
Age (years) | 30-40 | 30-40 | 30-40 | 40-50 | 50-60 | 60-70 | 30-40 | 50-60 | 30-40 | 50-60 | 60-70 | 20-30 |
Experience level (1-5) | 3 | 3 | 4 | 2 | 4 | 3 | 3 | 3 | 3 | 1 | 2 | 4 |
English level (1-5) | 3 | 3 | 4 | 3 | 4 | 4 | 4 | 5 | 3 | 3 | 2 | 4 |
Profession | MDa | DMb | MIc | SCd | MD | GBe | MI | DM | DM | MD | MD | BIf |
Has role data manager | ✓g | ✓ | ✓ | ✓ |
|
|
|
✓ | ✓ | ✓ | ✓ | ✓ |
Has role data provider | ✓ | ✓ | ✓ |
|
|
✓ |
|
|
|
|
|
|
Has role data coordinator | ✓ | ✓ | ✓ |
|
✓ | ✓ | ✓ |
|
|
|
|
|
Tested core module | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Tested provenance module | ✓ | ✓ | ✓ |
|
✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
|
✓ |
Tested data integration module | ✓ | ✓ | ✓ | ✓ |
|
|
|
|
✓ |
|
|
|
aMD: medical documentalist.
bDM: data manager.
cMI: medical informatics specialist.
dSC: study coordinator.
eGB: graduate biologist.
fBI: bioinformatics specialist.
gCharacteristic present.
The training took between 10 and 30 minutes, depending on how many modules were presented and how many questions the participants had. After training, for task completion, the core module took between 8 and 26 (average 14, SD 6) minutes. The provenance module took between 3 and 20 (average 9, SD 5) minutes. The configuration module took between 21 and 51 (average 37, SD 12) minutes. Regarding the latter, we did not include the time spent asynchronously to complete the tasks.
Each participant solved the tasks of one or more CoMetaR modules (core module n=12, provenance module n=10, data integration module n=5). Subsequently, they completed one System Usability Scale questionnaire separately for each module. According to Bangor et al [
Aggregated System Usability Scale scores.
Module and score type | Values, mean (SD; range) | |
|
||
|
Usability score | 81.5 (9.1; 60.0-92.5) |
|
Weighted by experience | 73.8 (7.8; 60.0-84.5) |
|
||
|
Usability score | 72.3 (16.0; 37.5-90.0) |
|
Weighted by experience | 63.9 (15.20; 37.5-79.5) |
|
||
|
Usability score | 81.0 (9.9; 65.0-92.5) |
|
Weighted by experience | 73.0 (9.9; 57.0-84.5) |
All the participants successfully solved all given tasks. In total, 12 participants solved 48 core module tasks, 10 participants solved 20 provenance module tasks, and 5 participants solved 20 data integration module tasks. In the case of task 2, 2 participants did not find the correct tree node and needed a hint. During the provenance module tasks, 1 participant lost track because he loaded too much information from multiple modules into the tree. He needed a hint to reset the app to solve task 5. In total, 97% (85/88) of all tasks were solved independently.
Our thematic analysis led to 24 usability issue themes, which covered all functional inadequacies and complications identified during the experiment. We grouped these themes into the 7 categories described in ISO 9241-110 (
Using the search function for
The help window does not help with task 2.
The mouse-over tooltip of upload bars sometimes distracts and overlays other bars.
Changing the selection of upload bars leads to changes in the concept tree. The system gives insufficient feedback that these changes were applied.
The search function only searches for fixed substrings and does not behave comparably to a mighty World Wide Web search engine. This might lead to incorrect conclusions whether a concept is part of the metadata.
The users expected the fixed headings for the currently displayed subtree to be interactive.
The provenance module disappears when clicking a tree element and the element’s core information are shown instead.
An element’s change history is part of the core module and not the provenance module.
Structural information for elements (added, moved, or removed) are not explicitly displayed in the element’s history (last changes).
The number of search matches is not the number of matched concepts but of all matched attributes.
Some annotations like
The structural annotations (added, moved, or removed) refer to the selected provenance timespan and not only to the selected uploads.
It is not intuitive that a moved element’s old and new concept tree position are both selected when clicking one of them.
Many people search the code for
For some users, it is not intuitively clear that details for a tree node are shown when clicking them.
Symbols in the tree are not explained through a legend, but only mouse-over tooltips.
The minimap or outline next to the scrollbar is not intuitive for users that are not familiar with such.
The scroll bar is differently styled than a standard scroll bar and might not instantly be recognized as such.
For some users, it is not noticeable whether an upload was selected.
The function of the
The temporal order (left to right or right to left) of multiple uploads on the same day is not clear.
For elements with more than one configuration rule, it is not intuitive that the rules are applied from top to bottom order.
Activating multiple modules and searches leads to an overload of information in the concept tree.
Loading too many information into the tree and expanding many of affected tree elements leads to high central processing unit (CPU) use.
In total, 12 participants took part in the evaluation of up to 3 modules of the CoMetaR web app, and each participant completed up to 10 tasks; 97% (85/88) of all tasks were solved independently and successfully. The core module and data integration module both obtained a mean usability score of 81, which proves good and nearly excellent usability. For inexperienced users, we estimated a mean usability score of 73, which proves good and acceptable usability. The provenance module has a mean usability score of approximately 72, which implies good and acceptable usability. For inexperienced provenance module users, we estimated a mean usability score of 63, which indicates unacceptable usability. We identified 24 issues with the app, which we grouped into 5 usability categories based on ISO 9241-110. From our point of view, of particular note are (1) information displayed in the concept tree can be overwhelming, especially if information from multiple modules is shown at once. (2) For many users, the provenance module and its functionalities are not accessible. The number of options, such as filtering by timespan or upload package, demand an extensive introduction and learning period. (3) The search functionality can output far more hits than expected because every literal information about concepts is considered. Some sort of categorization or filtering may be useful.
The strength of our study design is the relationship between effort and outcome. Although we omitted the step of recording audio and video of each session, we found a considerable compilation of usability issues and clear quantitative categorization of our tested modules owing to the System Usability Scale questionnaire. All testing sessions were performed by a single experimenter. For thematic analysis, an additional scientist was consulted.
Retrospectively, we identified 4 problems regarding the evaluation methodology. The web conference software used in this evaluation was always visible and, in some cases, overlapped crucial information in the browser window. Second, one person tried to participate via an Apple product and was not able to establish screen sharing because of missing technical literacy. The third problem concerns communicational logistics, specifically around task instructions being communicated verbally by the evaluator. Some participants missed important aspects of the tasks because they were inattentive or started solving the tasks before the instruction was finished. Finally, some tasks were not formulated in sufficient detail. For example, for task 5, a participant thought it would be sufficient to read the respective upload description, but we expected them to list all changes explicitly in detail.
We did not record audio and video, for which reason we probably missed single verbalizations and observations. Thus, we cannot claim that our list of usability issues is complete at 100%, which arguably is never the case. In addition, the experimenter already filtered information during the test sessions, which might have biased the qualitative analysis outcome. We still assume that we found most usability issues, especially the most severe ones, because the experimenter was able to follow every action throughout all sessions without difficulty.
As all tasks were performed in our production environment, the upload history and thus the collection of added, moved, or removed or modified concepts varied. This may have led to differing results among the participants. We assumed that these differences were negligible in the usability evaluation.
In 2009, considering 317 web apps, Bangor et al [
Regarding the think aloud method, it is usual to record and transcribe all user sessions. Other studies show that this consumes a considerable amount of time and labor, which is often done by multiple scientists. In addition, we did not count code quantities within a transcript, as this is often done in a thematic analysis. We adopted the highest-level themes from an ISO standard instead of creating them ourselves.
After evaluating our app, we are able to improve it by addressing all found usability issues. This will, in the first place, improve research in the field of lung research because lung research–specific metadata availability and accessibility will be improved. This app has already been considered by other German Centers for Health Research. We hope to be able to generally improve the field of health research.
Second, we applied a methodology that allows the usability evaluation of metadata management apps with a considerably low effort in time and labor. In an adapted form, this method can be applied to similar apps. Although the first 4 tasks of our evaluation are specific to the field of lung research concerning content, their content-agnostic intention is to check if basic information can be retrieved from the app. This includes the existence and findability of concepts (task 1), identification of a concept’s annotations (task 2), its development over time (task 3), and the export of information about a unit of concepts (task 4). The application programming interface for the data integration module is specific to our data integration configuration file format, but the tasks represent the crucial steps to be taken to verify such a configuration file. The next step for this project could be the application of this evaluation method to comparable apps to approve its reliability and to find common usability issues.
We also hope that the findings of our qualitative analysis raise other developers’ awareness of possible shortcomings in their own apps. For example, they might also plan to visually annotate concepts in the concept tree, in which case we highly recommend not displaying too much information at once.
A potential alternative or addition to the think aloud method with a thematic approach could be a heuristic evaluation performed by usability experts. The advantages and disadvantages of both methods were researched by Yen and Bakken [
We experienced issues with the web conference software, whose control panel sometimes overlapped crucial information on the user display. For further remotely and synchronously performed evaluations, we recommend ensuring that all relevant web app content is always visible, for example, by choosing different conference software.
We found that the assumed average usability score for inexperienced users was approximately 8 points lower than the original average score. This implies, on the one hand, that entry barriers exist within the app. On the other hand, these barriers can at least partly be overcome with experience. Measuring such a score might be of special interest for apps that provide a more efficient alternative to existing methods of information retrieval. Entry barriers may lead to rapid rejection of the entire software.
Our goal was to find usability issues of the CoMetaR web app and to measure its usability as perceived by real users. We identified 24 issues, which will be starting points for app improvement. On average, the app was assessed as good and in parts nearly excellent in terms of usability. Our method proved effective and efficient in terms of effort and outcome. Future research should improve our app and evaluate similar solutions. We invite other researchers interested in evaluating biomedical metadata repositories to adapt our methodology. All source codes are publicly accessible under GitHub [
collaborative metadata repository
Dublin Core
Deutsches Zentrum für Lungenforschung
Findability, Accessibility, Interoperability, Reusability
Forced Expiratory Volume in 1 Second
Resource Description Framework
Simple Knowledge Organization System
SPARQL Protocol and Resource Description Framework Query Language
The German Center for Lung Research (German: Deutsches Zentrum für Lungenforschung) is funded by the German Federal Ministry of Education and Research (German: Bundesministerium für Bildung und Forschung). Marc Griffiths proofread the paper as a native English speaker.
All data generated or analyzed during this study are included in this published paper.
MRS developed the collaborative metadata repository software, which was evaluated in this study. MRS and RWM elaborated on the study design, including the composition of tasks. MRS performed the interviews with all participants and interpreted the data. RWM and AG substantively revised the study during all steps.
None declared.