This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
Given the costs of machine learning implementation, a systematic approach to prioritizing which models to implement into clinical practice may be valuable.
The primary objective was to determine the health care attributes respondents at 2 pediatric institutions rate as important when prioritizing machine learning model implementation. The secondary objective was to describe their perspectives on implementation using a qualitative approach.
In this mixed methods study, we distributed a survey to health system leaders, physicians, and data scientists at 2 pediatric institutions. We asked respondents to rank the following 5 attributes in terms of implementation usefulness: the clinical problem was common, the clinical problem caused substantial morbidity and mortality, risk stratification led to different actions that could reasonably improve patient outcomes, reducing physician workload, and saving money. Important attributes were those ranked as first or second most important. Individual qualitative interviews were conducted with a subsample of respondents.
Among 613 eligible respondents, 275 (44.9%) responded. Qualitative interviews were conducted with 17 respondents. The most common important attributes were risk stratification leading to different actions (205/275, 74.5%) and clinical problem causing substantial morbidity or mortality (177/275, 64.4%). The attributes considered least important were reducing physician workload and saving money. Qualitative interviews consistently prioritized implementations that improved patient outcomes.
Respondents prioritized machine learning model implementation where risk stratification would lead to different actions and clinical problems that caused substantial morbidity and mortality. Implementations that improved patient outcomes were prioritized. These results can help provide a framework for machine learning model implementation.
Machine learning has had growing popularity in clinical settings related to the widespread adoption of electronic health records [
An important consideration impacting utility is choosing the clinical setting and problem in which a machine learning model is to be implemented [
Given these costs, a systematic approach for determining which machine learning models should be prioritized for implementation into clinical practice may be valuable. In determining priorities, it would be important to involve key stakeholders at the institution in which deployment is planned. We chose to survey 2 pediatric centers, 1 in the United States with a more established biomedical informatics program, and 1 in Canada with a less established biomedical informatics program, to gain insight into whether experience and expertise affected preferences for machine learning model prioritization. Consequently, the primary objective was to determine the health care attributes respondents at 2 pediatric institutions rate as important when prioritizing machine learning model implementation. The secondary objective was to describe their perspectives on machine learning model implementation using a qualitative approach.
This was a mixed methods study that included a quantitative and a qualitative component. The institutions were The Hospital for Sick Children (SickKids) in Toronto, Ontario, Canada, and Lucile Packard Children’s Hospital in Palo Alto, California, United States.
We included health system leaders, physicians, and data scientists at SickKids and Lucile Packard Children’s Hospital at the time of survey distribution. We excluded trainees.
The survey was developed by the study team based on their impression of health care attributes respondents might consider to be important; the machine learning–focused questions are presented as
We then asked about their knowledge of artificial intelligence on a 5-point Likert scale ranging from 1 (no knowledge at all) to 5 (a lot of knowledge). We asked them to rate their understanding of how machine learning models are built and interpreted, and how statistics are conducted and interpreted, using 5-point Likert scales ranging from 1 (no understanding) to 5 (fully understand). We asked if they had decision-making ability to implement artificial intelligence initiatives within their work environment, and how many machine learning models had been deployed at their institutions in the last 5 years.
The next section asked respondents to rank the following 5 clinical problem and implementation consequence attributes in terms of whether machine learning implementation would be useful: “the clinical problem being solved is common,” “the clinical problem causes substantial morbidity or mortality,” “risk stratification would lead to different clinical actions that could reasonably improve patient outcomes,” “implementing the model could reduce physician workload,” and “implementing the model could save money.” Important attributes were defined as those ranked as most important or second most important (rank of 1 or 2) by respondents. The survey then asked 2 open-ended questions focused on clinical areas where being able to accurately predict an outcome might be useful, and clinical areas in which prioritization or reorganization of waitlists might be useful. Finally, the survey asked whether they would be willing to participate in a qualitative interview.
For the qualitative aspect, we purposively sampled respondents to maximize variation by institution and self-rated understanding of machine learning. Semistructured interviews were conducted using Zoom (Zoom Video Communications, Inc.) or Microsoft Teams by a member of the SickKids team (EP) with expertise in the conduct of qualitative interviews. Respondents were asked to list 3 scenarios in which a machine learning model for risk stratification could be useful and then to state which scenario was the most important to implement first and the rationale for the choice. They were then asked how they would feel about using a machine learning model for risk stratification as opposed to their current approach, and to describe concerns they had about using a machine learning model to guide patient care. The interviews were recorded and transcribed verbatim.
The data from the quantitative survey from SickKids and Lucile Packard Children’s Hospital were compared using the Fisher exact test. Analyses were performed in R (R Core Team) using RStudio version 3.6.1 [
The analysis of qualitative data was performed according to the principles of grounded theory methodology; data collection and analysis occurred concurrently. Qualitative transcripts were analyzed by 2 independent reviewers (NA and EP) using the constant comparative method to develop a theoretical framework for respondents’ perspectives of machine learning that are grounded in their individual experiences and understandings. Sampling was continued until saturation was reached, which was defined as the point in which no new themes emerged from the data.
The study was approved by the Research Ethics Board at SickKids. The need for Institutional Review Board approval was waived by Lucile Packard Children’s Hospital as the data collection was performed by SickKids personnel. For the quantitative survey, completion of the survey was considered implied consent to study participation. For the qualitative component, respondents provided verbal consent to participate.
The quantitative survey was distributed at SickKids between November 1, 2021, and January 6, 2022 and at Lucile Packard Children’s Hospital between March 15, 2022, and April 12, 2022. Among 613 eligible respondents, 275 (44.9%) responded.
CONSORT (Consolidated Standards of Reporting Trials) diagram of participant identification, selection, and participation.
Demographic characteristics of participants at 2 pediatric institutions (N=275).
Characteristic | SickKids (n=195), n (%) | Lucile Packard Children’s Hospital (n=80), n (%) | |||||
Male gender | 93 (47.7) | 35 (43.8) | .64 | ||||
|
|
|
|
||||
|
Physician | 165 (84.6) | 73 (91.3) | .20 | |||
|
Health system leader | 22 (11.3) | 17 (21.3) | .05 | |||
|
Data scientist | 15 (7.7) | 2 (2.5) | .18 | |||
|
|
|
<.001 | ||||
|
Hematology oncology | 33 (16.9) | 14 (17.5) |
|
|||
|
General medicine | 21 (10.8) | 7 (8.8) |
|
|||
|
Critical care medicine | 11 (5.6) | 12 (15.0) |
|
|||
|
Emergency medicine | 14 (7.2) | 0 (0) |
|
|||
|
Cardiology | 9 (4.6) | 7 (8.8) |
|
|||
|
Neurology | 11 (5.6) | 3 (3.8) |
|
|||
|
Endocrinology and metabolism | 10 (5.1) | 6 (7.5) |
|
|||
|
Gastroenterology | 9 (4.6) | 0 (0) |
|
|||
|
Respirology | 4 (2.1) | 4 (5.0) |
|
|||
|
Infectious disease | 2 (1.0) | 5 (6.3) |
|
|||
|
Surgery | 0 (0) | 6 (7.5) |
|
|||
|
Adolescent medicine | 6 (3.1) | 0 (0) |
|
|||
|
Other | 20 (10.3) | 7 (8.8) |
|
|||
|
Not known | 45 (23.1) | 9 (11.3) |
|
|||
|
|
|
.006 | ||||
|
<1 | 6 (3.1) | 0 (0) |
|
|||
|
1-4 | 38 (19.5) | 5 (6.3) |
|
|||
|
5-10 | 38 (19.5) | 25 (31.3) |
|
|||
|
11+ | 113 (57.9) | 50 (62.5) |
|
|||
Decision-making ability to implement artificial intelligence initiatives | 99 (50.8) | 41 (51.3) | >.99 | ||||
|
|
.43 | |||||
|
None | 31 (15.9) | 11 (13.8) |
|
|||
|
1 | 7 (3.6) | 6 (7.5) |
|
|||
|
2-4 | 14 (7.2) | 9 (11.3) |
|
|||
|
5-10 | 2 (1.0) | 1 (1.3) |
|
|||
|
11+ | 4 (2.1) | 0 (0) |
|
|||
|
Do not know | 137 (70.3) | 53 (66.3) |
|
aRespondent may choose more than 1 option and thus, numbers do not add to 100%.
Self-rating of knowledge of artificial intelligence and understanding of machine learning and statistics.
Areas | SickKids (n=195), n (%) | Lucile Packard Children’s Hospital (n=80), n (%) | ||
|
|
|
.93 | |
|
None | 10 (5.1) | 5 (6.3) |
|
|
Very little | 67 (34.4) | 30 (37.5) |
|
|
Some | 83 (42.6) | 31 (38.8) |
|
|
Moderate | 30 (15.4) | 11 (13.8) |
|
|
A lot | 5 (2.6) | 3 (3.8) |
|
|
|
|
.72 | |
|
None | 44 (22.6) | 18 (22.5) |
|
|
Very little | 56 (28.7) | 28 (35.0) |
|
|
Somewhat | 64 (32.8) | 25 (31.3) |
|
|
Moderate | 24 (12.3) | 8 (10.0) |
|
|
Fully | 7 (3.6) | 1 (1.3) |
|
|
|
|
.19 | |
|
None | 4 (2.1) | 1 (1.3) |
|
|
Very little | 18 (9.2) | 7 (8.8) |
|
|
Somewhat | 67 (34.4) | 38 (47.5) |
|
|
Moderate | 78 (40.0) | 29 (36.3) |
|
|
Fully | 28 (14.4) | 5 (6.3) |
|
Ranked as importanta by respondents for prioritization of machine learning.
Attributes considered important | SickKids (n=195), n (%) | Lucile Packard Children’s Hospital (n=80), n (%) | Median importance score (IQR)b | |
The clinical problem being solved is common | 66 (33.8) | 35 (43.8) | .16 | 3 (2-3) |
The clinical problem causes substantial morbidity or mortality | 133 (68.2) | 44 (55.0) | .05 | 2 (2-3) |
Risk stratification would lead to different clinical actions that could reasonably improve patient outcomes | 145 (74.4) | 60 (75.0) | >.99 | 1 (1-2) |
Implementing the model could reduce physician workload | 29 (14.9) | 11 (13.8) | .96 | 4 (3-4) |
Implementing the model could save money | 11 (5.6) | 2 (2.5) | .42 | 5 (4-5) |
aImportant defined as attributes ranked as most important or second most important (rank of 1 or 2) in terms of whether a machine learning model would be useful.
bAcross both institutions.
Perspectives of machine learning implementation in pediatric medicine from qualitative interviews.
Themes and subthemes | Example quotations | ||||
Benefits of machine learning implementation | |||||
|
|
|
|||
|
|
Complex scenario |
|
||
|
|
Support less experienced clinicians |
|
||
|
|
Reduce cognitive load |
|
||
|
|
Reduce cognitive bias |
|
||
|
|
|
|||
|
|
Standardize care |
|
||
|
|
More effective triage |
|
||
|
|
Facilitate precision medicine |
|
||
|
|
|
|||
|
|
Freeing up time for physicians |
|
||
|
|||||
|
|
|
|||
|
|
Algorithmic bias |
|
||
|
|
Lack of transparency and trust |
|
||
|
|
Not incorporating clinical expertise into decisions |
|
||
|
|
||||
|
|
Need for outcome evaluation |
|
||
|
|
Data quality |
|
||
|
|
|
|||
|
|
Challenges in workflow implementation |
|
||
|
|
Accountability |
|
||
|
|
|
|||
|
|
Uncertainty in physician role |
|
In this mixed methods study, we found that the attributes most commonly listed as important for machine learning model implementation were risk stratification leading to different actions that could reasonably improve patient outcomes and a clinical problem that causes substantial morbidity or mortality. Few respondents considered reducing physician workload and saving money as important. We also found that important attributes were similar at the 2 institutions despite different levels of biomedical informatic program establishment and different health care systems.
The wide range of recommended areas for machine learning model implementation highlights the need for prioritization given the likely limited capacity to develop, deploy, and monitor machine learning models, even at large institutions with mature bioinformatics programs. This study is important as it provides a framework by which institutional leaders could make decisions about which machine learning models to prioritize for implementation. While we found that risk stratification that improves patient outcomes was the most common important attribute, additional considerations include actions that would arise from high- and low-risk labels, evidence that differential actions will improve outcomes, and identifying ideal thresholds for risk categorization. Even once a model is deployed, ongoing monitoring of model performance and the impact of model deployment on patient care and clinical workflows are additional postimplementation considerations.
While we evaluated attribute importance across respondent types, Wears and Berg [
We also found that across both institutions, respondents had greater confidence in their understanding of statistics and relatively lower confidence in their understanding of machine learning. These perspectives did not differ between the 2 institutions despite different levels of establishment of their biomedical informatic programs. Our results suggest that across pediatric medicine in general, more education focused on machine learning is required during training and continuing education.
Our results complement the work of others who have highlighted the requirements of clinical decision support including those based on machine learning. Items important to consider include the need to avoid black boxes, excessive time requirement, and complexity in addition to ensuring relevance, respect, and scientific validity [
The strengths of this study include its mixed methods design and inclusion of 2 different pediatric institutions by country and establishment of their biomedical informatic programs. However, our results should be interpreted in light of their limitations. We had a relatively low response rate; respondents were likely biased in favor of interest in machine learning. Thus, nonrespondents likely would have had lower familiarity with machine learning and likely would have had less strong opinions about attributes considered important for machine learning prioritization. We also had a greater proportion of physicians than system leaders or data scientists; these groups may have different priorities or implementation concerns.
In conclusion, respondents prioritized machine learning model implementation where risk stratification would lead to different actions and clinical problems that caused substantial morbidity and mortality. Implementations that improved patient outcomes were prioritized. These results can help provide a framework for prioritizing machine learning model implementation.
Quantitative survey administered.
Comparison of participants with artificial intelligence knowledge high versus not high (N=275).
Examples of recommendations of areas in pediatric care that should be prioritized for machine learning from quantitative survey.
LS is supported by the Canada Research Chair in Pediatric Oncology
None declared.