TY  - JOUR
AU  - Acitores Cortina, Jose Miguel
AU  - Fatapour, Yasaman
AU  - Brown, Kathleen LaRow
AU  - Gisladottir, Undina
AU  - Zietz, Michael
AU  - Bear Don't Walk IV, Oliver John
AU  - Peter, Danner
AU  - Berkowitz, Jacob S
AU  - Friedrich, Nadine A
AU  - Kivelson, Sophia
AU  - Kuchi, Aditi
AU  - Liu, Hongyu
AU  - Srinivasan, Apoorva
AU  - Tsang, Kevin K
AU  - Tatonetti, Nicholas P
PY  - 2025
DA  - 2025/3/27
TI  - Biases in Race and Ethnicity Introduced by Filtering Electronic Health Records for “Complete Data”: Observational Clinical Data Analysis
JO  - JMIR Med Inform
SP  - e67591
VL  - 13
KW  - health disparities
KW  - data quality
KW  - observational research
KW  - electronic health records
KW  - racial and ethnic biases
AB  - Background: Integrated clinical databases from national biobanks have advanced the capacity for disease research. Data quality and completeness filters are used when building clinical cohorts to address limitations of data missingness. However, these filters may unintentionally introduce systemic biases when they are correlated with race and ethnicity. Objective: In this study, we examined the race and ethnicity biases introduced by applying common filters to 4 clinical records databases. Specifically, we evaluated whether these filters introduce biases that disproportionately exclude minoritized groups. Methods: We applied 19 commonly used data filters to electronic health record datasets from 4 geographically varied locations comprising close to 12 million patients to understand how using these filters introduces sample bias along racial and ethnic groupings. These filters covered a range of information, including demographics, medication records, visit details, and observation periods. We observed the variation in sample drop-off between self-reported ethnic and racial groups for each site as we applied each filter individually. Results: Applying the observation period filter substantially reduced data availability across all races and ethnicities in all 4 datasets. However, among those examined, the availability of data in the white group remained consistently higher compared to other racial groups after applying each filter. Conversely, the Black or African American group was the most impacted by each filter on these 3 datasets: Cedars-Sinai dataset, UK Biobank, and Columbia University dataset. Among the 4 distinct datasets, only applying the filters to the All of Us dataset resulted in minimal deviation from the baseline, with most racial and ethnic groups following a similar pattern. Conclusions: Our findings underscore the importance of using only necessary filters, as they might disproportionally affect data availability of minoritized racial and ethnic populations. Researchers must consider these unintentional biases when performing data-driven research and explore techniques to minimize the impact of these filters, such as probabilistic methods or adjusted cohort selection methods. Additionally, we recommend disclosing sample sizes for racial and ethnic groups both before and after data filters are applied to aid the reader in understanding the generalizability of the results. Future work should focus on exploring the effects of filters on downstream analyses. 
SN  - 2291-9694
UR  - https://medinform.jmir.org/2025/1/e67591
UR  - https://doi.org/10.2196/67591
DO  - 10.2196/67591
ID  - info:doi/10.2196/67591
ER  -