Scalable Approach to Consumer Wearable Postmarket Surveillance: Development and Validation Study

Background: With the capability to render prediagnoses, consumer wearables have the potential to affect subsequent diagnoses and the level of care in the health care delivery setting. Despite this, postmarket surveillance of consumer wearables has been hindered by the lack of codified terms in electronic health records (EHRs) to capture wearable use. Objective: We sought to develop a weak supervision–based approach to demonstrate the feasibility and efficacy of EHR-based postmarket surveillance on consumer wearables that render atrial fibrillation (AF) prediagnoses. Methods: We applied data programming, where labeling heuristics are expressed as code-based labeling functions, to detect incidents of AF prediagnoses. A labeler model was then derived from the predictions of the labeling functions using the Snorkel framework. The labeler model was applied to clinical notes to probabilistically label them, and the labeled notes were then used as a training set to fine-tune a classifier called Clinical-Longformer. The resulting classifier identified patients with an AF prediagnosis. A retrospective cohort study was conducted, where the baseline characteristics and subsequent care patterns of patients identified by the classifier were compared against those who did not receive a prediagnosis. Results: The labeler model derived from the labeling functions showed high accuracy (0.92; F 1 -score=0.77) on the training set. The classifier trained on the probabilistically labeled notes accurately identified patients with an AF prediagnosis (0.95; F 1 -score=0.83). The cohort study conducted using the constructed system carried enough statistical power to verify the key findings of the Apple Heart Study, which enrolled a much larger number of participants, where patients who received a prediagnosis tended to be older, male, and White with higher CHA 2 DS 2 -VASc (congestive heart failure, hypertension, age ≥75 years, diabetes, stroke, vascular disease, age 65-74 years, sex category) scores ( P <.001). We also made a novel discovery that patients with a prediagnosis were more likely to use anticoagulants (525/1037, 50.63% vs 5936/16,560, 35.85%) and have an eventual AF diagnosis (305/1037, 29.41% vs 262/16,560, 1.58%). At the index diagnosis, the existence of a prediagnosis did not distinguish patients based on clinical characteristics, but did correlate with anticoagulant prescription ( P =.004 for apixaban and P =.01 for rivaroxaban). Conclusions: Our work establishes the feasibility and efficacy of an EHR-based surveillance system for consumer wearables that render AF prediagnoses. Further work is necessary to generalize these findings for patient populations at other sites.


Background
Consumer-facing devices such as the Apple Watch [1] and Fitbit [2] now have the capability to notify users with a prediagnosis such as atrial fibrillation (AF).As these notifications may incentivize patients to seek follow-up medical care, wearables now have the potential to affect diagnosis rates and initiate cascades of medical care [3,4].Although these devices undergo premarket validation to obtain Food and Drug Administration (FDA) clearance [5], limited information exists on their postmarket use and clinical utility.
To conduct postmarket surveillance on consumer wearables, electronic health records (EHRs) should capture wearable use, in particular those incidents where patients received prediagnosis notifications.However, EHRs are often built around medical diagnosis codes used for billing purposes [6,7], which do not contain terms for describing wearable use.Prescription wearables should have ordering information, but this does not capture how the wearables are used.Therefore, unstructured data such as clinical notes must be parsed to obtain the wearable use information.
Deep learning-based natural language processing (NLP) methods [8][9][10] have been shown to outperform traditional approaches on clinical note classification tasks [11,12].However, these deep learning-based classifiers require large, hand-labeled training sets that are costly to generate.For EHR-based postmarket surveillance to be widely implemented, a scalable approach is necessary to reduce the labeling burden.

Objectives
We aimed to demonstrate the feasibility and efficacy of postmarket surveillance on consumer wearables that render AF prediagnoses.The first aim of this study was to evaluate the efficacy of a weakly supervised approach to heuristically generate labels for a training set.A labeler model derived from programmatically expressed heuristics probabilistically assigns labels to clinical notes regarding whether the note contains a mention of the patient receiving a prediagnosis from a wearable.The second aim was to evaluate the performance of a classifier fine-tuned on the training set labeled by the labeler model, which identifies mentions of an AF prediagnosis in a note.The third aim was to summarize the clinical characteristics of patients identified by the classifier and compare them to patients who were not alerted to a prediagnosis.

Cohort Identification
We used the Stanford Medicine Research Data Repository (STARR) data set [13], which contains EHR-derived records from the inpatient, outpatient, and emergency department visits at Stanford Health Care and the Lucile Packard Children's Hospital.We retrieved all clinical notes from the STARR data set that contain a mention of a wearable device (Textbox 1), resulting in 86,260 notes from 34,329 unique patients.Following the FDA guidance for pertinent cardiovascular algorithms [5], we excluded patients younger than 22 years of age when the note was written, leaving 78,323 notes from 30,133 unique patients.We further limited the data set to notes written on or after January 1, 2019, since the first consumer-facing AF detection feature became available in December 2018 [14].The resulting cohort comprised 56,924 notes from 21,332 unique patients.These notes were then labeled independently by 2 data scientists, and differences were adjudicated by 2 physicians.
A clinical note was labeled as positive when the patient received an automated AF notification from the wearable, or when the patient initiated an on-demand measurement (eg, electrocardiogram strip) that resulted in an AF prediagnosis.There were no instances where the 2 physicians disagreed on the label.The resulting test set contained 105 positive notes (prevalence=0.18).
In addition to the test set, we prepared a development set of 600 notes that was used to aid the development of the labeler model.This set was manually labeled by a single data scientist, using a labeling guideline (Multimedia Appendix 1) that was developed as part of the test set generation.The development set contained 100 positive notes (preva-lence=0.17).

Labeler Model Derivation
We then derived a labeler model that used weak supervision to probabilistically assign labels for the training set.Specifically, as shown in Figure 1, we used data programming [15], where labeling heuristics are expressed as code-based labeling functions.Using the encoded heuristics, the labeling functions make predictions as to which label a clinical note should be assigned.Predictions from these labeling functions are then combined to develop a generative labeler model.We used the Snorkel framework [16] to implement data programming.A preprocessing framework [17] was applied to our notes to split them into sentences using the spaCy [18] framework, with a specialized tokenizer to recognize terms specific to medical literature.Thus parsed grammatical information was made available to the labeling functions as metadata.
We then used the development set to understand how the AF prediagnosis was described, and we expressed each pattern as a labeling function.The development process was iterative, where the Snorkel framework allowed us to observe the predictive values of the labeling functions on development set records.Each function could then be further optimized to reduce the differences between predictive values and actual labels, leading to overall performance improvement on the development set.Textbox 3 shows all the terms that were identified as denoting AF.Negations were properly handled.
Once developed, we applied the labeling functions on the samples and then instructed Snorkel to fit a generative model on the output.Specifically, we used 10-fold cross-validation on the test set and chose the labeler model with the best F 1 -score.This model was then applied to the entire corpus of 56,924 notes to probabilistically assign labels.

Classifier Fine-Tuning
Notes that were probabilistically labeled by the labeler model were then used to fine-tune a large, NLP-based classifier: Clinical-Longformer [12] (Figure 2).The resulting classifier takes plain note text as the input and classifies the note as positive (ie, includes mention of a patient receiving an AF notification, or patient-initiated cardiac testing or electrocardiogram resulting in an AF prediagnosis) or negative.When a classifier is tuned on the labeler model output, it enables generalizing beyond the labeling heuristics encoded in the Specifically, we fine-tuned the pretrained Clinical-Longformer for the sequence classification task, with varying training set sizes.For a single fine-tuning run, we chose the snapshot with the best F 1 -score on the test set as the representative.The Adam optimizer was used, with the learning rate ramping up to 1 × 10 −5 followed by linear decay over 3 epochs.Clinical-Longformer has a maximum input length of 4096 subword tokens: 94% (53,509/56,924) of our notes fit this criterion, and notes with more tokens were trimmed.Fine-tuning other NLP-based classifiers (eg, ClinicalBERT [11], which takes a smaller number of input tokens [512 or fewer]) resulted in abysmal performance numbers (F 1 -score=0.21),hinting that they could not be properly fine-tuned on our lengthy clinical notes.
The test set was never presented to the classifier during the fine-tuning process.Since our data set was highly skewed toward negative samples, we stratified the training set to maintain a 1:2 ratio between the positive and negative notes.All samples were chosen randomly.
The classifier with the best F 1 -score was then run across the entire set of 56,924 clinical notes to identify all incidents of AF prediagnoses.

Retrospective Cohort Study
Using the classifier, we identified patients who received an AF prediagnosis and performed 3 retrospective cohort studies comparing the characteristics of patients who received a prediagnosis to those who did not, using the same STARR data set.
First, we considered all the patients in the cohort regardless of their prior AF diagnosis.We compared the demographics, CHA 2 DS 2 -VASc (congestive heart failure, hypertension, age ≥75 years, diabetes, stroke, vascular disease, age 65-74 years, sex category) [19] score, and its related comorbidities on the date the index note was created.We defined the oldest note with a prediagnosis as the index note since it was the most likely to drive downstream medical intervention.When a patient had not received any prediagnosis, the oldest note with mention of a wearable was chosen as the index.
Second, we focused on patients who did not have a prior AF diagnosis.A patient was filtered out if the patient had received an AF diagnosis, defined as an ambulatory or inpatient encounter with SNOMED code 313217 and its descendants, prior to the index note.We then compared the same demographics and comorbidities between those who received a prediagnosis and those who did not, on the date the index note was created.
Lastly, we further confined the analysis to patients who received a clinician-assigned AF diagnosis within 60 days from the index note.Same as before, we excluded patients who had a prior AF diagnosis before the index note.Patients were then grouped based on whether they had received an AF prediagnosis from a wearable and characterized on the date they received the index AF diagnosis.In addition to the demographics and comorbidities, we also compared anticoagulant medication (Textbox 4), rhythm management medication (Textbox 5), and cardioversion rates between the 2 groups.Only the index prescription and procedure that took place within 60 days from the index diagnosis were considered.

Statistical Analysis
When compiling patient race and ethnicity information, we used the 5 categories of race defined by the US Census and denoted Hispanic as a dedicated ethnicity.A total of 11.12% (2371/21,327) of the patients were missing race and ethnicity information, so we categorized them as belonging to the undisclosed category.
For hypothesis testing, we used the 1-tailed Welch t test for continuous variables and χ 2 test for categorical variables.One-tailed tests were chosen over 2-tailed tests since clinical contexts helped establish the comparison direction, providing for a stricter analysis.Statistical analysis was performed using Pandas

Ethical Considerations
The STARR data set is derived from consented patients only.Patients were not compensated for participation.Data analyzed in this study were not deidentified, but its analysis was conducted in a HIPAA (Health Insurance Portability and Accountability Act)-compliant, high-security environment.The Stanford University Institutional Review Board approved this study (62865).

Labeler Model Performance
In total, 8 labeling functions were developed.Most (7/8, 88%) labeling functions used the grammatical information present in the metadata, whereas 1 (12%) used a simple dictionarybased lookup.Table 1 provides the performance of each labeling function, followed by the combined labeler model.
Since each labeling function was geared toward identifying positive samples that follow a specific pattern, each labeling function exhibited substantially higher precision than recall.By combining these labeling functions into 1 generative labeler model, we improved recall (0.72).The high labeler model accuracy (0.92) also showed that the model correctly classified negative samples.After running the labeler model on the set of 56,924 clinical notes, 5829 notes were flagged as positive, a substantial increase from the 105 positive notes identified through manual labeling.

Classifier Performance
Here, we report the performance of the classifier that was fine-tuned using the clinical notes labeled by the labeler model.Table 2 shows the average performance of the classifier on the test set, across varying training set sizes.
The training set size was capped at 15,000 to maintain the 1:2 positive-to-negative ratio (the labeler model labeled 5829 notes as positive).Regardless of the training set size, the test set was excluded from the input to the fine-tuning process.Table 2 demonstrates how classifier performance benefits from the weakly supervised approach.In particular, a training set size of 600 emulated the hypothetical scenario where the size of the training set is limited due to manual labeling overhead.Such a small data set was not enough to adequately fine-tune Clinical-Longformer (F 1 -score=0.48).
As the training set size increased, the classifier obtained better performance, reaching the best average F 1 -score of 0.83.When compared to the labeler model in Table 1 (recall=0.72),the classifier significantly improved recall (0.81), demonstrating that the classifier managed to generalize beyond the rules specified by the labeling functions.The receiver operating characteristic curve (Figure 3) shows that even the best classifier with a training set size of 600 performed worse than classifiers from larger data set sizes.In the precision-recall curve (Figure 4), the classifier lost substantial precision for small gains in recall, further hinting that the classifier was not properly trained.Across all training set sizes and runs, the best-performing classifier achieved an F 1 -score of 0.85 (accuracy=0.95).
Running this classifier on 56,924 clinical notes identified 6515 notes as containing an AF prediagnosis across 2279 unique patients.

Cohort Study: All Patients
Table 3 summarizes the characteristics of the entire cohort regardless of their prior AF diagnosis, reflecting the characteristics of generic patients that used wearables.In all, 5 patients were missing sex information and were not included in the analysis.Patients who received an AF prediagnosis from a wearable tended to be older, with more comorbidities except for diabetes mellitus.White and male individuals constituted a larger portion of patients with a prediagnosis, who also exhibited higher CHA 2 DS 2 -VASc scores.

Cohort Study: Patients Without a Prior AF Diagnosis
Table 4 then compares the characteristics of patients who had no AF diagnosis prior to the index note, highlighting the efficacy of wearables on the undiagnosed population.These patients exhibited similar characteristics to the overall cohort, where those who received an AF prediagnosis tended to be older, White, and male, with more comorbidities except for diabetes mellitus.In particular, 50.63% (525/1037) of the patients who received a prediagnosis had CHA 2 DS 2 -VASc scores of 2 or higher, warranting anticoagulation therapy [22].In contrast, among the patients without a prediagnosis, only 35.85% (5936/16,560) had CHA 2 DS 2 -VASc scores of 2 or higher.

Cohort Study: Patients With a Clinician-Assigned AF Diagnosis
Among those patients who did not have a prior AF diagnosis, 29.41% (305/1037) of the patients with a wearable-assigned prediagnosis received a clinician-assigned AF diagnosis within 60 days from the index prediagnosis.The average duration from prediagnosis to diagnosis was 4.74 days.In contrast, only 1.58% (262/16,560) of those patients without a prediagnosis received a clinician-assigned AF diagnosis.
Table 5 compares the clinical characteristics of those patients who received an AF diagnosis, based on whether they had received a wearable-assigned prediagnosis prior to the diagnosis.
None of the patient characteristics reported in Table 5 differed significantly between those with an AF prediagnosis and those without (all P>.05).However, anticoagulant prescriptions differed based on AF prediagnoses, where more patients with a prediagnosis were prescribed apixaban and rivaroxaban.

Principal Findings
In this study, we applied a weak supervision-based approach to demonstrate the feasibility and efficacy of an EHR-based postmarket surveillance system for consumer wearables that render AF prediagnoses.
We first derived a labeler model from labeling heuristics expressed as labeling functions, which showed high accuracy (0.92; F 1 -score=0.77)on the test set.We then fine-tuned a classifier on labeler model output, to accurately identify AF prediagnoses (0.95; F 1 -score=0.83).
Further, using the classifier output, we identified patients who received an AF prediagnosis from a wearable and conducted a retrospective analysis to compare the baseline characteristics and subsequent clinical treatment of these patients against those who did not receive a prediagnosis.Across the entire cohort, patients with a prediagnosis were older with more comorbidities.The race and sex composition of these patients also differed from those who did not receive a prediagnosis (P<.001).
Focusing on the subgroup of patients without a prior AF diagnosis (Table 4), we observed that a higher percentage of patients (525/1037, 50.63% vs 5936/16,560, 35.85%) who received a wearable-assigned prediagnosis exhibited CHA 2 DS 2 -VASc scores that warranted a recommendation for anticoagulation therapy [22].This increased likelihood for anticoagulation therapy could be attributed to an early prediagnosis from the wearable.
In the same subgroup, patients who received a prediagnosis were 18.61 times more likely to receive a clinicianassigned AF diagnosis than those who did not.The existence of a prediagnosis was not correlated with patient demographics, comorbidities, or AF subtype at the index diagnosis (Table 5) but did correlate with anticoagulant prescription, where patients with an AF prediagnosis were more frequently prescribed apixaban (P=.004) and rivaroxaban (P=.01).

Comparison With Prior Work
Given that more consumer wearables will be introduced with increasing prediagnostic capabilities, a surveillance framework for wearable devices is urgently needed to properly assess their impact on downstream health care [3,4].However, publications sponsored by wearable vendors focused mostly on ascertaining the accuracy of the prediagnostic algorithm itself [1,2].
On the other hand, publications that sought to conduct postmarket surveillance relied solely on manual chart review [3,4], which is hard to scale.In a prior study on wearable notifications, clinician review of 534 clinical notes yielded only 41 patients with an AF prediagnosis [3].With a weakly supervised approach, our clinician review of 600 notes (ie, the test set) allowed the subsequent identification of 2279 patients with a prediagnosis.Such an improvement in recall enhanced the statistical power of our analysis.First, our cohort study findings that showed patients with an AF prediagnosis tended to be older, male, and White with higher CHA 2 DS 2 -VASc scores matches the key findings of the Apple Heart Study [1], which enrolled a much larger number of participants (n=419,297).Second, we were able to make a novel discovery in that a wearable-assigned prediagnosis increases the likelihood of patients receiving anticoagulation therapy and an eventual AF diagnosis, and we identified statistically meaningful anticoagulant prescription differences.
Prior work has applied various methods of weakly supervised learning to some form of medical surveillance [16,17,[23][24][25].Most relevantly, Callahan et al [23] implemented a surveillance framework for hip implants, and Sanyal et al [25] implemented one for insulin pumps.To the best of our knowledge, however, our work is the first to apply a weakly supervised approach to consumer wearable surveillance.Without prescription records, consumer wearable surveillance can be challenging to scale.

Limitations
We acknowledge that the STARR data set is confined to a small health care system in a single geographic region, which is known [13] to serve populations with higher percentages of male, White, and older individuals.We recommend other institutions to monitor their patient population by developing their own surveillance framework using our weakly supervised methodology.In fact, work is already underway to adapt this approach for use at Palo Alto Veterans Affairs.
We could not establish causality between prediagnoses and patient characteristics.The fact that patients who are older, with more comorbidities; White; and male had a higher likelihood of receiving an AF prediagnosis may very well reflect that they are health conscious and use wearables more frequently.

Conclusions
By providing prediagnoses, consumer wearables have the potential to affect subsequent diagnoses and downstream health care.Postmarket surveillance of wearables is necessary to understand the impact but is hindered by the lack of codified terms in EHRs to capture wearable use.By applying a weakly supervised methodology to efficiently identify wearable-assigned AF prediagnoses from clinical notes, we demonstrate that such a surveillance system could be built.
The cohort study conducted using the constructed system carried enough statistical power to verify the key findings of the Apple Heart Study, which enrolled a much larger number of patients, where patients who received a prediagnosis tended to be older, male, and White with higher CHA 2 DS 2 -VASc scores.We also made a novel discovery in that a prediagnosis from a wearable increases the likelihood for anticoagulant prescription and an eventual AF diagnosis.At the index diagnosis, the existence of a prediagnosis from a wearable did not distinguish patients based on clinical characteristics but did correlate with anticoagulant prescription.
Our work establishes the feasibility and efficacy of an EHR-based surveillance system for consumer wearable devices.Further work is necessary to generalize these findings for patient populations at other sites.

Authors' Contributions
RMY, BTV, KNP, JAF, and NHS contributed to concept and design.JAF contributed to the acquisition of data.RMY, BTV, KNP, JAF, AZ, TP, and ND contributed to the analysis and interpretation of data.RMY and BTV contributed to the drafting of the manuscript.RMY, KNP, JAF, AZ, TP, ND, and NHS contributed to critical revision of the manuscript for important intellectual content.RMY contributed to statistical analysis.NHS contributed to the provision of patients or study materials, obtaining funding, and supervision.JAF and NHS contributed to administrative, technical, or logistic support.

Conflicts of Interest
KNP receives research grants from the American Heart Association and the American College of Cardiology and is a consultant for Evidently and 100Plus.JAF is a research consultant for Snorkel AI.

Multimedia Appendix 1
Labeling guideline developed as part of the test set generation.

Figure 1 .
Figure1.Labeler model generation process.Labeling heuristics were expressed as code-based labeling functions.Snorkel[16] then applied the labeling functions to the sample clinical notes and fit a generative model on the predictions of the labeling functions.The resulting labeler model probabilistically assigns a label to a clinical note based on whether the note mentions the patient receiving an AF prediagnosis from the wearable device.AF: atrial fibrillation.

Figure 2 .
Figure2.Classifier generation process.The labeler model was used to probabilistically assign labels for a large number of unlabeled clinical notes, which were then used to fine-tune a classifier to detect whether a patient received an AF prediagnosis from a wearable device.AF: atrial fibrillation; NLP: natural language processing.

Figures 3 and 4
Figures 3 and 4 show the comparisons of the best-performing (by F 1 -score) classifiers from each training set size.

Figure 3 .
Figure 3. Classifier receiver operating characteristic (ROC) curve across varying training set sizes.For each training set, the best-performing (by F 1 -score) run was chosen among 3 runs with different random seeds.For each run, the best-performing classifier snapshot was chosen.

Figure 4 .
Figure 4. Classifier precision-recall curve across varying training set sizes.For each training set, the best-performing (by F 1 -score) run was chosen among 3 runs with different random seeds.For each run, the best-performing classifier snapshot was chosen.

Textbox 4 .
Anticoagulant medications analyzed in this study.

Table 1 .
Labeling function (LF) and labeler model performance a .Averages taken from 10-fold cross-validation on the test set of 600 manually labeled notes.Italic numbers indicate the best observed performance for each metric.

Table 2 .
Classifier performance across varying training set sizes a .For each training set, average values are reported across 3 runs with different random seeds.For each run, the classifier snapshot with highest a

Table 4 .
Characteristics of patients without a prior atrial fibrillation diagnosis a .Measured on the date of the index note.b Statistically significant at α=.05.c CHA 2 DS 2 -VASc: congestive heart failure, hypertension, age ≥75 years, diabetes, stroke, vascular disease, age 65-74 years, sex category.

Table 5 .
Characteristics of patients with a clinician-assigned atrial fibrillation diagnosis a .Measured on the date of the index atrial fibrillation diagnosis.Medications that were not prescribed are omitted.b CHA 2 DS 2 -VASc: congestive heart failure, hypertension, age ≥75 years, diabetes, stroke, vascular disease, age 65-74 years, sex category.c Statistically significant at α=.05.