Abstract
Background: Considering sex and gender improves research quality, innovation, and social equity, while ignoring them leads to inaccuracies and inefficiency in study results. Despite increasing attention on sex- and gender-sensitive medicine, challenges remain with accurately representing gender due to its dynamic and context-specific nature.
Objective: This work aims to contribute to the implementation of a standard for collecting and assessing gender-specific data in German university hospitals and associated research facilities.
Methods: We carried out a review to identify and categorize state-of-the-art gender scores. We systematically assessed 22 publications regarding the applicability and practicability of their proposed gender scores. Specifically, we evaluated the use of these gender scores on German research data from routine clinical practice, using the Medical Informatics Initiative core dataset (MII CDS).
Results: Different methods for assessing gender have been proposed, but no standardized and validated gender score is available for health research. Most gender scores target epidemiological or public health research where questions about social aspects and life habits are already part of the questionnaires. However, it is challenging to apply concepts for gender scoring on clinical data. The MII CDS, for example, lacks all variables currently being recorded in gender scores. Although some of the required variables are indeed present in routine clinical data, they need to become part of the MII CDS.
Conclusions: To enable gender-specific retrospective analysis of routine clinical data, we recommend updating and expanding the MII CDS by including more gender-relevant information. For this purpose, we provide concrete action steps on how gender-related variables can be captured in routine clinical practice and represented in a machine-readable way.
International Registered Report Identifier (IRRID): RR2-10.2196/57669
doi:10.2196/74162
Keywords
Introduction
Background
Taking sex and gender (see and ) into account improves the quality of research and care, and it supports social equity []. Considering sex and gender from the outset can lead to new discovery and foster innovation. Ignoring sex and gender, however, leads to inaccuracies, inefficiencies, and difficulties generalizing results. This is, for example, highlighted by a recent study showing that men with low femininity reported a significant decrease in anxiety during the COVID-19 pandemic, meanwhile women with low femininity reported a significant increase [].
Biological sex is defined through biological attributes, such as:
- Genetics (ie, chromosomes, gene expression)
- Hormone levels
- Anatomy (ie, internal and external sex organs/reproductive organs)
Deviations regarding the hormonal or chromosomal attributes are possible, where the external appearance can be intersex and it is not possible to decide.
Gender refers to socially constructed roles, behaviors, expectations, and norms associated with being a woman, man, girl, or boy. It is formed through biological and psychosexual factors and an individual’s social biography, and it is shaped by one’s role in society. Gender influences how people perceive themselves and others and how they act and interact.
Many health care decisions are influenced by the patients’ sex and gender, through social norms and experiences. Sex- and gender-sensitive medicine is receiving increasing attention in international research []. However, several publications have shown that women are still not considered equally to men and are poorly represented due to biased study designs [-]. It is well known today that this practice leads to an increase in inequality in medical treatment due to inappropriate therapies. Moreover, sex- and gender-specific reporting is often deficient, which leads to a lack of reproducibility and reduced effectiveness of research studies [].
Even though the situation has improved, it can still be challenging to represent psychosocial gender aspects, as gender is a context-specific construct that is dynamic and multidimensional and can differ across time, geographical regions, and societies [].
Gender scores are developed and used to assess gender roles and gender identity. They can be created and applied retrospectively to extract frequently missing gender information from already existing data. For this purpose, gender-sensitive variables in the existing data are determined by expert knowledge or statistical algorithms. Gender scores can then be calculated to predict social gender roles. For example, Lacasse et al [] and Nauman et al [] each developed a retrospective gender score and applied it to specific datasets from population studies. Examples for the corresponding gender-sensitive variables are specific professions, like in health care or construction, and typical personality traits, like agreeableness or neuroticism; loneliness or stress was considered. In general, retrospective scores cannot capture participants’ current gender identity (see ).
Gender identity is determined through individual self-perception and the sexual identity of a person, which may differ from someone’s physical appearance or biological sex.
Prospective assessments can enhance the collection of gender-specific data from the outset of a study by incorporating further social data or information regarding the identity. Fraser et al [] and Lagos and Compton [], for example, evaluated a prospective 2-step gender identity measure on data from huge studies. The 2-step measure differentiates between a person’s current gender identity and birth-assigned sex (see and ). The given examples show that gender scores are mostly used in surveys and studies and hardly used on routine clinical data.
BAS refers to the information that can be found on official documents, like a birth certificate, and is assigned based on the child’s external anatomy or sex organs. BAS is often measured by a binary choice of response (ie, male or female). Biological sex is not a dichotomy, nor is BAS, where diversity is given.
All 38 university hospitals in Germany are part of the Medical Informatics Initiative [] (MII), which serves to close the gap between research and routine care. A core dataset (CDS) is being developed by the MII consortia and the participating university hospitals to foster interoperability and allow a shared use of routine clinical data across Germany. Data Integration Centers collect and process routine clinical data and make them accessible to researchers for secondary use. If and how this established dataset considers gender and gender score applications are open questions.
It is important to note that the discussion about the applicability and state of gender scores does not only affect Germany. Critical assessments of sex and gender scores reveal that the problem affects different western cultures [-]. Ignoring gender differences in clinical research is a widespread global issue that undermines the validity and applicability of scientific findings across diverse populations []. A study conducted in Australia using the Bem Sex Role Inventory (BSRI) [] found that the majority of participants received different gender scores in at least 1 of the 3 consecutive years, highlighting the complexity and fluid nature of gender identity []. This is particularly concerning because the BSRI yielded inconsistent results even among a relatively stable population of patients older than 75 years — where minimal shifts in habits or identity would be expected over time. Despite being based on outdated social stereotypes, the BSRI is one of the most used gender scores, and other scores are partly based on it, as we later describe.
In this context, where classic gender scores are struggling, new tools are being considered to annotate medical data according to the gender of patients, thanks to the rise of machine learning methods []. The results of such methods can then be systematically used to mitigate biases in research, as done by Neufang et al [], who developed an artificial intelligence model to improve the fairness of attention deficit hyperactivity disorder diagnosis with respect to gender. It is important to understand whether these algorithms are also applicable to general clinical data datasets, such as the MII CDS, to allow large-scale integration of gender debiasing in future research.
The issues outlined in the previous paragraphs indicate that more work needs to be done to update gender scores, taking into account the changes in commonly defined social rules that have occurred in the last century. In practice, outdated scores, such as the BSRI, are often used without considering proper validation and potential biases []. Moreover, a gold standard is still missing for how sex and gender should be considered in research with clinical research data. A practicable score should be balanced, comprehensive, and easily applicable. Interestingly, no gender score has so far been established as an international standard.
We therefore conducted a systematic evaluation of gender scores regarding their applicability in health research using German efforts on Data Integration in Medicine as a benchmark. As the world’s third-largest economy and one of the World Health Organization’s top two donors, we expected the German MII CDS to be a good candidate for our evaluation. In this article, the term gender score refers to the assessment of gender in general (social gender and gender identity), and the term gender identity measure refers exclusively to gender identity.
Related Work
The focus of our scoping review protocol published in JMIR Review Protocols [] was the applicability and practicability of gender scores in health research. Related to the work described in this paper, Horstmann et al [] published a review on recently used instruments for the operationalization of sex and gender in health research, and the work by Miani et al [] summarizes epidemiological aspects in gender scores. The reviews included publications until 2020 and 2021, respectively. For this work, we identified the latest scores and therefore deliberately limited our search to publications between 2019 and 2024. Of the 18 articles from our primary result set, 13 were published after 2020.
In parallel to our work on the systematic review of recent gender scores, another systematic review about the operationalization of gender via composite gender scores in epidemiological studies was published in 2024 by Ballering et al []. The authors searched 3 databases (PubMed, Web of Science, and CINAHL) and identified 24 articles with a total of 26 gender scores developed in Europe or North America. They aggregated information on gender scores regarding author, year of publication, methodology, cohort information, and the included variables of the respective gender score and found that many variables overlapped across multiple studies.
Ballering et al [] criticized theory-driven approaches — developed solely based on expert knowledge — as experts are not free from bias, potentially enhancing sexism or other biases. The authors further claim that data-driven approaches — using statistical models to develop gender scores — only predict female or male sex through identified psychosocial variables and do not consider gender itself. Psychosocial variables that do not associate with sex are excluded, and the quality of the gender scores is strongly dependent on the quality of the dataset. Overall, Ballering et al [] concluded that data-driven approaches might be good to personalize health care, as gendered variables point out differences between women and men that could lead to improved treatment decisions and patient care.
The work by Ballering et al [] is a great foundation for our work, and we are able to contribute further insights on the applicability of gender scores on routine clinical data. We had an overlap of 13 articles on both sides, with 13 of 22 articles in our review and 13 of 24 articles in the review by Ballering et al [] (ie, there was an overlap of more than 50%: 59% ours and 54% theirs). We found further articles regarding gender identity that can be applied from the outset of a study or data collection, which could be of great use to implement gender information in routine clinical practice.
Aim of This Work
In the scope of this work, we identified and categorized state-of-the-art gender scores, systematically assessed their applicability and practicability, and evaluated their applicability on German research data from routine clinical practice (see the Applicability of Gender Scores on the MII CDS section).
The process of our work is illustrated in . This work aimed to contribute to the necessary implementation of a standard for collecting and assessing gender-specific data in German university hospitals and the respective research facilities; therefore, we formulated 4 action steps for an extension in health research to enable gender-specific analyses (see ), which are shown at the bottom of the figure and will be described in detail in the Discussion section.

Methods
Scoping Review
We conducted a scoping review to identify articles that reported on the development and implementation of gender scores in health-related studies (see ). It resulted in 22 articles, which were identified through a title and abstract screening followed by a full-text evaluation. Moreover, we developed a data charting form to extract information on article type and applicability of the gender score in health research. The review protocol was published in JMIR Research Protocols [].
Assessment of Applicability and Practicability
To assess the applicability and practicability of gender scores, we followed a systematic approach (see ). In the scope of this work, we extended the data charting process to fully assess the applicability of gender scores on routine clinical data.
Social Gender and Gender Identity
First, we distinguished between scores for gender identity (see ) versus social gender (see and ). Gender identity (see ) is usually measured in a prospective way because it is a purely subjective perception and it is not possible to extract this information when it was not collected. However, social gender includes norms that are expected in society, stereotypes, and certain behaviors (see ), and it can be measured prospectively and retrospectively. Typical behavior (eg, specific hobbies, occupational segregation, doing housework or odd jobs) can be analyzed and interpreted after data collection. A typical female or male lifestyle or behavior will lead to different exposure to risk factors for accidents or disease. Social expectations will also influence the process of recovery.
| Author(s) (year) | Description | Questions and answer option |
| Lagos and Compton (2021) [] | 2-step gender identity measure |
|
| Fraser et al (2020) [] | Open-ended gender measure | What is your gender? (open-ended item) |
| McGuire et al (2019) [] | Genderqueer Identity Scale |
|
| Lombardi and Banik (2016) [] | 2-step gender identity measure |
|
| Author(s) (year) | Gender score variables | |
| Pohrt et al (2022) [] |
| |
| Tibubos et al (2022) [] | ||
| ||
| Demuth et al (2021) [] | The questionnaire is not available. | |
| Nielsen et al (2021) [] |
| |
| Pelletier et al (2015) [] |
| |
| Spence et al (1975) [] | ||
| ||
| ||
| ||
| Bem (1974) [] | ||
| ||
| ||
| ||
aOverlaps with the already published review by Ballering et al [].
bVariables that are potentially covered in routine clinical care in Germany.
cPAQ: Personal Attributes Questionnaire.
dBSRI: Bem Sex Role Inventory.
| Author(s) (year) | Gender score variables | |
| Cipriani et al (2024) [] |
| |
| Gisinger et al (2023) [] |
| |
| Teterina et al (2023) [] | ||
| ||
| ||
| Vader et al (2023) [] | ||
| ||
| de Breij et al (2022) [] |
| |
| Wandschneider et al (2022) [] |
| |
| Nauman et al (2021) [] |
| |
| Yuan et al (2021) [] |
| |
| Ballering et al (2020) [] |
| |
| Lacasse et al (2020) [] |
| |
| Smith and Koehoorn (2016) [] |
| |
aOverlaps with the already published review by Ballering et al [].
bVariables that are potentially covered in routine clinical care in Germany.
cICD-10-CA: International Statistical Classification of Diseases and Related Health Problems, Tenth Revision, Canada.
Prospective and Retrospective Scores
Second, we distinguished between prospective (see and ) and retrospective (see ) scores. It is advisable to include gender variables from the outset of a study; this results in a prospective gender score and is easy to apply to the study dataset. However, the huge amount of existing data drives the development of retrospective gender scores, where gender-specific information can be extracted in hindsight.
Cohort Type and Size
Third, it needs to be considered whether the gender scores themselves are representative. In practice, some gender scores are developed on restricted datasets, for instance, for specific patient groups or other samples that are not representative (eg, older age groups, workers, or specific diagnoses). This is underlined by the finding of Wandschneider et al [] who showed that gendered practices vary between eastern and western Germany, reflecting a different social development due to the historical political context. As gender norms, stereotypes, or behaviors vary between societies, the relevant gender variables can differ across social groups, countries, and regions or even between different societies within a country.
Furthermore, since retrospective scores depend on statistical modeling, we also considered the sample size on which the models were based. Another analysis we performed was to further classify the background under which the score was developed. We took into account whether the research group used patient or population data to categorize which gender score might be more appropriate for the MII CDS. We also considered the continent where the data collection took place, which is important for consistency due to varying gender-sensitive variables depending on the society or geographical location.
Usability
Fourth, we examined the validity, usability, and practicability of the gender scores. Validity is the main issue because it is of high importance that the score is representative regarding gender. Usability refers to how well the score is established. If a new standard is to be introduced, it is of high importance to use an accepted model to ensure that the scores are usable, comparable, and applicable to large-scale studies and assessments. Practicability refers to how difficult a score would be to integrate into health research.
Implementation
The fifth point we included in our assessment was the level of implementation of scores. To apply scores systematically on existing data, a theoretical model was not sufficient. Instead, it was necessary to have the executable code of a model to integrate it directly. Therefore, we further investigated the applicability of gender scores from the technical viewpoint to evaluate whether models can, realistically, be implemented in a clinical setting. We used the following categories to describe this aspect: code published — refers to a gender score for which an executable model was published; data published — refers to a gender score for which the underlying data are available and a score could be reproduced; statistical parameters published — refers to a model for which all statistical parameters were published and the model could be constructed from them; not published — a model for which not all relevant parameters are available. We considered “code published” the gold standard regarding applicability. “Data published” and “parameters published,” on the other hand, might contain sufficient information to construct an executable model. In the scope of this study, we considered code or data available “on request” as unavailable. However, in practice, there might still be limitations if the description of the article is not concise enough to reproduce all steps.
Interrelations
Based on the extracted data, we performed further analysis to better understand the currently available gender scores. We noticed early on that direct dependencies exist between gender scores. Therefore, we extracted and modeled different types of dependencies between gender scores (see ). This analysis allowed us to investigate how scores were developed and whether improvements were made to existing gender scores.

One relation between gender scores captures cases in which an existing gender score was used directly or adapted in another score (see B). For instance, the gender score from Pelletier et al [] includes the BSRI []. de Breij et al [] adapted the score from Smith and Koehoorn [] but changed “caring for children” to “informal caregiving” as the data were available in their dataset (see B).
Moreover, development relationships (see C) reflect when one gender score was developed based on the approach, for instance the statistical model, of another research group. For example, Gisinger et al [], with their statistical model of principal component analysis and logistic regression analysis, were inspired by the approach of Pelletier et al [].
Applicability on Clinical Routine Data
To assess whether gender scores are applicable on existing data and whether they can be integrated in clinical routine data, we used the technical specification of the MII Core Data Set - Module “Person” as a representative sample against which to evaluate the gender scores (see ). The MII CDS [] (see the Background section) is being collected across all 38 university hospitals in Germany. It contains interoperable, consolidated data items with a high degree of standardization in line with international standards to make collaborative data analysis practicable and efficient.
The CDS [] consists of basic and extension modules that are subject to iterative refinement. The basic modules are generic, and the extension modules are relevant for specific medical fields and disciplines such as oncology or intensive care. We used the Person Module, one of the early basic modules, which contains demographic information such as name, date of birth, address, and administrative gender as stated in official documents. The CDS has a comprehensive description of how the data are structured and how the data can be accessed, which allows us to directly compare it against gender scores. The information is available and can be compared based on a sample dataset []. Overall, this makes the CDS a suitable choice for studying how gender scores can be applied in realistic clinical data collection.
Retrospective scores required that we compared the variables contained in the gender scores against the variables available in routine clinical data. To perform this comparison, we systematically analyzed all variables contained in the sample dataset of the MII CDS against the variables included in the statistical models of the gender scores.
Ethical Considerations
We did not collect any personal data, so ethical approval was not needed.
Results
The full results of the review are presented in to maintain brevity in the main text.
Scoping Review
Our review results are categorized into prospective development (–) and retrospective development () and between gender scores (–) and gender identity measures (). Articles that overlapped with the already published review by Ballering et al [] are marked in the tables.
Assessment of Applicability and Practicability
Social Gender, Gender Identity, and Prospective and Retrospective Scores
We distinguished between articles targeting gender identity and articles targeting social gender. We further classified them into prospective and retrospective development and application.
The 4 identified gender identity measures (see ) are prospectively applicable and could be included in future studies for further data collection regarding gender identity. Participants need to be asked explicitly which gender they belong to; this information cannot be analyzed from existing data.
Moreover, we found 18 gender scores to measure social gender in different ways. Of these, 7 are prospective gender scores (see ) that could provide guidance on how to apply (social) gender to a study from the outset. Of the 18 scores, 11 are retrospective (see ) and could be applied on already existing data.
Cohort Type and Size
The background of the cohorts is of importance for the practicability. A score contains different variables, depending on whether it was developed on data from population studies or on patient data. The majority of datasets are based on population studies and thus on specific information that is typically collected in these studies. For a complete comparison of cohorts, see our results in .
Only 2 gender scores were developed exclusively based on patient data. Teterina et al [] focused on patients who entered the emergency department with traumatic brain injuries and published an innovative approach to distinguishing gender by diagnostic codes. These diagnostic codes also describe the background of the accident or incident, enabling conclusions to be drawn about gender. Cipriani et al [] focused on the effects of gender on emergency patients with psychological crisis. Therefore, they used variables assessing childhood trauma, education, sleep quality, and living situation.
Moreover, the article by McGuire et al [] assessed gender identity during transition through the collection of patient and population data.
However, all of these studies were restricted by the specific patient cohort group and focused on specific conditions. This hampers their applicability on the general public and their usefulness as broadly applied gender scores for routine clinical data.
In addition, the used population cohorts were neither comprehensive nor inclusive. Many studies used a cohort with a specific age group [,,,,,] or with a specific background like being retired [] or being a worker [,]. This is a major weakness for generalizing results, as a gender score based on a limited cohort is not representative of society as a whole and therefore lacks comprehensiveness.
Furthermore, the majority of cohorts used for developing gender scores are based in the same geographic region, primarily Europe and North America. Only one gender identity measure was used on a cohort in Oceania []. We could not find a single gender score publication with a cohort from Asia, South America, or Africa. As a result, the developed scores are not representative for every cohort group and are not universally applicable.
With respect to the size of the cohorts, which forms the basis for the statistical development of retrospective gender scores and influences the reliability of statistical models, we noted that some works used huge cohorts or datasets with n>100,000 [,,] and the outstanding cohort size of approximately 700,000 individuals in the study by Smith and Koehoorn []. Other gender scores worked with smaller cohort sizes of n<1000 [,,].
Usability
We examined the validity of scores, meaning how representative the work is regarding gender. Gender can be different in each society, so it is a major task to implement a gold standard. Some research groups used sex as the dependent variable and, therefore, gender as a predictor for sex (eg, [-,]). This approach can be problematic since gender should be considered beyond biological sex.
Statistical regression approaches can be systematically validated by using a control set to test the generalization of the regression model (eg, by systematic evaluation on a dedicated validation set [,,,]). Researchers who developed a number of scores using the approach by Pelletier et al [] did not specify whether they performed a systematic validation of their developed scores [,,]. Nauman et al [] did not use a validation, and Yuan et al [] reported a performance value without describing the validation setting they used.
,
Exclusively theory-driven gender scores, where variable scaling is determined by human experts, were not evaluated since there is no gold standard for gender [,,]. Looking at usability, referring to how well established a score already is, some authors did not mention any practical application of their scores [], whereas other scores were often cited, adapted, and reapplied [,,,]. Although highly cited scores seem to be widely accepted and used, they might still not be well suited to represent gender. The most cited score was the BSRI, which is also part of other gender scores (eg, [,]; see ), and was often criticized and considered outdated [].
Regarding practicability, we considered the scope and complexity of questionnaires and how universally applicable variables are on existing data. Some variables were very specific for the used cohort, whereas others frequently appeared in similar studies. Most gender identity measures are well-suited for clinical data collection, being concise and easy to understand, except for the score from McGuire et al [], which involves 23 questions and focuses on gender transition, making it less suitable. BSRI [] and PAQ [] are lengthy prospective gender scores, same as other gender scores containing them. Tibubos et al [] developed a shorter form of the PAQ with only 8 variables. We also listed different forms of data collection, from self-reporting questionnaires (eg, [,]) to interviews (eg, [,,]). The length of forms could be especially relevant if participants complete a self-reporting questionnaire. Interviews tend to be more time-consuming for the staff conducting them.
Regarding retrospective scores, the length does not have such a big impact, but some scores are much shorter than the majority [,]. A minimalistic score can be a trade-off between prediction quality and the number of necessary variables. This was, for instance, investigated by Ballering et al [], who compared the performance between gender scores including 9 and 85 predictive variables.
Implementation
Only 3 research groups made their analysis code publicly available (code published) and provided reproducible gender scores [,,].
Two studies published their data [] or used publicly available data [] (data published), which are necessary to evaluate their conclusion. All other works did not publish data, which might be justified by data protection regulations. However, anonymized data could still be helpful for recreating a study.
Statistical parameters were published in most retrieved papers (statistical parameters published), some more complete than others. However, in the majority of cases, our assessment showed that reconstructing a gender score from the provided parameters would be challenging.
We identified one paper that did not publish the gender score (not published) []. They adapted the scores from Nauman et al [] and Pelletier et al [] to form a new gender score, but they did not provide any specific information regarding their methodology.
Overall, repeatability was made difficult by many of the research groups, and the use of existing scores was made complicated. A systematic overview of the reproducibility of each study is given by our results in .
Interrelations
We registered different dependencies and relationships between gender scores and visualized them (see ). Of the retrieved articles, 7 were developed with totally new approaches or methods (see A).
Other scores adapted or included content-related parts of already existing scores (see B). For example, Pelletier et al [] included the BSRI masculinity and femininity scores []. Spence et al [] were inspired by Rosenkrantz et al [] for the development of the PAQ in 1975, and thereupon, Tibubos et al [] developed a short form of the PAQ in 2022. It is notable that the BSRI [] and PAQ [] each form the basis of 2 other scores.
Sometimes research groups implemented the statistical or methodological approach of another team (see C). For instance, the statistical approach of Pelletier et al [] was frequently used as the basis for current gender scores, for example, by Wandschneider et al [], Lacasse et al [], Gisinger et al [], Pohrt et al [], Cipriani et al [], and Yuan et al [].
Applicability on Clinical Routine Data
Evaluation Overview
We tested all retrieved scores and their published variables for whether they are available in the MII CDS. Initially, we planned to only test retrospective scores developed on clinical data. However, we found that this would limit the number of possible scores too much.
The evaluation of the applicability on clinical routine data required a careful and considerate approach, as variable names are often ambiguous and it is usually not possible to match them directly. This can, for instance, be seen in variables related to care work, which include “responsibility for caring for other people in the household” [], “dinner is always prepared by someone else” [], “caregiver strain” [], “taking care of sick people” [], “informal caregiving” [], and “responsibility for caring for children” []. Even though some gender scores might be well described and could be used for other datasets, transferring the variables appropriately to a different dataset proved to be complex.
Applicability of Gender Scores on the MII CDS
In the comparison between the variables contained in gender scores and within the MII CDS, we used a two-step approach: We first tried to match the exact gender score variables against the MII CDS, and in case this remained unsuccessful, we extended the search to similar or generic terms.
We found that no variables of any gender score are covered in the MII CDS. This finding is surprising since a high number of variables covered in gender scores, such as occupation, level of education, civil status, children, household composition and size, and social support, are part of social history, which is routine clinical information.
Our principle finding regarding this question was that no gender score can be applied upon the MII CDS. One exception is the score from Teterina et al [], which includes diagnostic codes and disease-specific data, which could be matched against the MII CDS. This score, however, is highly specific and not usable as a generalized approach. Currently, the MII CDS only encodes the administrative gender as male, female, undifferentiated, or diverse []. Administrative gender refers to the gender that is recorded or recognized in official documents (eg, the ID card). It may differ from the birth-assigned sex.
Overall, most scores were applied on population studies or included in studies from the outset. These gender scores frequently include personality traits and psychological assessments (eg, stress, loneliness scales), which are not part of routine clinical practice and, therefore, also not part of the MII CDS.
To summarize, there are no fitting variables for the MII CDS, and it is impossible to apply an existing gender score to the dataset in order to include gender-specific analysis on clinical routine data.
Discussion
Principal Findings
This work identified and reviewed state-of-the-art gender scores and gender identity measures. These were systematically assessed regarding their applicability and practicability in health research. Last, we tested their applicability on routine clinical data using the MII CDS as a reference dataset.
We found that no gender score is applicable on the MII CDS, because the variables required for the gender scores are not part of the CDS. However, many variables that are commonly included in gender scores are also assessed during a patient’s clinical stay. For instance, variables regarding personal background (eg, living with your partner or children, professional field, or self-employment) are often asked during a clinic stay to verify if specific therapies or treatments are feasible. Physicians need to know if patients have responsibilities at home, if they have support, and if they have a stable social environment. A person’s background is gender-dependent and influences not only clinical routine but also research data and therapy guidelines (eg, risk factors dependant on lifestyle). This is underlined by the information in , which presents examples of current social history documentation practices in a German hospital compared against typical variables used in gender scores. Several variables related to gender scores can be directly inferred from these documentations; however, they are currently not included in the MII CDS. We therefore argue that the current lack of applicability of gender scores on clinical routine data results from missing structured data collection in or extraction methods from clinical information systems. Since the MII CDS is an ongoing effort to standardize research data from routine clinical practice, we recommend that the Person Module of the MII CDS should be updated and expanded to create a sufficient knowledge base for gender health research. Specifically, we formulated 4 action steps that will enable gender-specific analyses (see ): analyze social history data, collect social history data in a machine-readable form, integrate a 2-step gender identity measure during patient admission, and develop custom gender score approaches for clinical data.
Recommended Action Steps
Analyze Social History Data
Social history is part of routine clinical data collected for many patients. However, the captured information is oftentimes not structured. It can be found in the doctor’s notes or additional study data if patients are included in clinical trials. As a first step toward evaluating what information is currently available, we suggest analyzing the full-text information in the patient records using natural language processing and large language models for data extraction. It will be interesting to learn what conclusions and predictions one could draw from the available information and how this information corresponds to the described gender scores. Typical social anamnesis texts are given in , showcasing the low information content. A structured analysis of social history data is still ongoing, but the preliminary check of the information content already indicates that the practice of documentation must be significantly improved.
Collect Social History Data in a Machine-Readable Form
We recommend that key components of patients’ social histories — such as occupation, education, civil status, number of children, household composition and size, social support — be collected in a structured, machine-readable format. Instead of recording social history solely as narrative text in physician reports, clinical information systems should provide individual fields for each variable, allowing clinicians to select or enter information directly. This approach would reduce the effort required for data processing and aggregation, enable semantic annotation, and facilitate automated export and comparison across datasets.
Several national-level initiatives already exemplify such practices. Scandinavian countries, for instance, collect structured social and demographic data that can be integrated with health care data [,]. Similarly, the Canadian Institute for Health Information promotes the standardized collection of social and behavioral data across health care settings, including variables such as family composition, living arrangements, social support, and socioeconomic status. Existing research has also demonstrated how such information can be effectively captured using discrete, machine-readable fields rather than free-text formats [].
A widely recognized standard that could be further adapted for this purpose is HL7 FHIR [], which provides a structured framework for encoding social history elements in clinical IT systems. Future work should build on these existing resources — as well as our overview for gender scores — to develop a system that supports the structured capture of variables necessary for calculating gender scores. Although it is important to build future work upon existing standards, ideally, the analysis of clinical data items determining gender should be conducted before designing structured data capture forms for patient social history.
Integrate a 2-Step Gender Identity Measure During Patient Admission
Moreover, to include as many gender dimensions as possible, it would be beneficial to collect more gender-specific data. A promising approach would be to implement the 2-step gender identity measure [] with open-ended answers during patient admission. Implementing a prospective approach, with minimal effort, would provide valuable insights into gender identity in everyday clinical practice. Although the DIVERGesTOOL [] was originally developed as an extension of the 2-step approach for the German research context, it may also be well suited for use in patient admission settings. It adds a third question regarding whether differences in sex development have ever been medically diagnosed.
However, collecting such data must be approached with care, because asking gender-related questions at patient admission can involve sensitive issues and carries potential risks. For transgender, nonbinary, or gender-nonconforming individuals, disclosing gender identity may raise concerns about discrimination or biased treatment. Patients frequently withhold this information due to previous negative experiences and a lack of trust in health care providers [,]. Inadequate staff training and poorly designed forms can lead to misgendering or exclusion. To ensure respectful and safe care, it is essential to provide staff with proper training, use inclusive language, and implement supportive documentation systems.
Develop Custom Gender Score Approaches for Clinical Data
Researchers should think beyond the common gender score approaches and develop a gender score that is suitable to clinical data and not only population studies.
An innovative approach was carried out by Teterina et al [], who used diagnostic codes of patients with traumatic brain injuries in Canadian emergency departments. Researchers should try similar approaches on general — nondisease-specific — clinical data with diagnostic or medication variables. Furthermore, developing a minimalistic retrospective score could prove advantageous in balancing prediction quality with the number of required variables, making the scores applicable even when limited information is available (eg, []).
Even though gender is gaining importance in health research, it is still far from being implemented. Our investigation shows that a gold standard is missing that accounts for variations across social groups (eg, including geographical regions, age, time). Due to the lack of a gold standard, some research groups use gender as a predictor for sex when developing gender scores, which contradicts the purpose of introducing gender in research — to move beyond biological sex and gather more detailed information. Solely theory-driven approaches, where experts determine the used variables and their scaling, cannot be validated.
Last, repeatability is a major challenge when aiming at implementing existing gender scores. Due to inconsistent research group practices, existing gender scores are often described only theoretically without providing detailed variables, questions, or complete information on variable weighting. We therefore urge authors to provide complete documentation and executable models when developing gender scores.
To achieve the necessary progress, every inclusion of gender into health research is beneficial, even if it might not be comprehensive or sufficient to cover all aspects.
Limitations
Our collection of gender scores is based on a scoping review conducted to provide a broad overview of the topic. One limitation of this approach is that only one reviewer conducted the full-text screening and the categorization of the retrieved articles. However, during data extraction, more than 30% (7/22, 32%) of the publications were initially double-checked by a second domain expert []. The reviewers were in full agreement on all reviewed articles; thus, a single reviewer continued the data extraction. Moreover, our review does not claim to provide a complete list of all existing gender scores but does provide general statements on the current state of gender scores and their applicability in health research.
Moreover, the MII CDS might not be fully representative to test against population-based gender scores. Most gender scores were developed on population or specific patient data, while the MII CDS contains routine clinical data. However, it is important to include gender not only in population research but also in clinical research. This is why we selected the MII CDS as a target dataset for routine clinical data to test the practicability in this setting. As a result, we were able to show that gender scores are not yet a feasible instrument for routine clinical data and are able to make recommendations on how to update and extend the MII CDS in the future.
Comparison With Prior Work
Ballering et al [] highlighted that gender scores should be the minimum effort in epidemiological studies, but the community should go further than that because the retrospective application of a gender score implies a lack of gender consideration from the outset of the study (eg, the study design). Our findings found several shortcomings with currently available retrospective gender scores. Therefore, our results suggest that self-reported gender assessments are more precise and better suited to assess gender, confirming the theory of Ballering et al [].
Moreover, we can confirm the findings of Ballering et al [], specifically that the most common variables in gender scores are related to occupation, income, education, civil status, caregiving and household responsibilities, and ways of spending (leisure) time. Furthermore, our results showed that, even though several of these variables are collected in everyday clinic work, they are not processed and stored in a machine-readable way.
Conclusions
Considering sex and gender enhances equity and research quality. Despite the high importance of the topic, it remains challenging to include gender into health research for several reasons. Ethical implications such as properly defining the concept of “gender” as opposed to “sex,” identifying appropriate ways of asking about a patient’s gender, and raising awareness of the importance with nurses and doctors are nontechnical reasons why the health gender gap remains. Technical implementation is hampered by the absence of a generally applicable clinical gender score. Structured, computer-processable data and metadata about gender-specific aspects in a patient’s social history is also lacking.
In this work, we identified, categorized, and systematically assessed state-of-the-art gender scores from epidemiology and clinical studies. We evaluated their applicability and practicability on a German national research dataset for routine clinical practice, the MII CDS. We found that gender cannot be predicted based on the MII CDS, even though several of the frequently used variables are part of routine clinical practice in Germany.
However, we see an urgent need to include gender-relevant information to the MII CDS in order to narrow the gap between routinely collected clinical data and available research data (see ). Therefore, further work is necessary to enable gender-specific analysis and to routinely collect more gender-specific data in clinics, for example, during patient admission.
Despite different approaches for assessing gender, no standardized and validated gender score that could be used retrospectively in clinical research exists.
Our study is limited to the German clinical landscape, and our evaluation of possible scores is based on a scoping review. Future investigations into the literature and common practices at clinics inside and outside Germany might give further insights and add to the action items suggested in this work.
Funding
This work was funded by the German Federal Ministry of Education andResearch as part of the project “Inclusive Excellence in Medicine” (01FP23G10A,01FP23G10B) and the “Network University Medicine” (01KXX2121).
Data Availability
Further data underlying our publication are available in .
Authors' Contributions
Conceptualization: LS, SS, EK, DW
Data curation: LS
Formal analysis: LS
Funding acquisition: SS, EK, DW
Investigation: LS
Methodology: LS, SS, EK, DW
Supervision: SS, EK, DW
Validation: HB, SS, EK, DW
Visualization: LS
Writing – original draft: LS, DW
Writing – review & editing: HB, DL, SS, EK, DW
Conflicts of Interest
None declared.
Multimedia Appendix 1
Detailed information from the assessment of the retrieved gender scores.
XLSX File, 18 KBMultimedia Appendix 2
Three patients showcasing the current social history documentation in clinical practice in a German hospital. Footnotes and color coding indicate the comparison with typical variables used in gender scores. The text examples have been constructed for illustration.
PNG File, 303 KBReferences
- Tannenbaum C, Ellis RP, Eyssel F, Zou J, Schiebinger L. Sex and gender analysis improves science and engineering. Nature New Biol. Nov 2019;575(7781):137-146. [CrossRef] [Medline]
- Arcand M, Bilodeau-Houle A, Juster RP, Marin MF. Sex and gender role differences on stress, depression, and anxiety symptoms in response to the COVID-19 pandemic over time. Front Psychol. 2023;14:1166154. [CrossRef] [Medline]
- Opinion: intersexuality. German Ethics Council. 2012. URL: https://www.ethikrat.org/fileadmin/Publikationen/Stellungnahmen/englisch/opinion-intersexuality.pdf [Accessed 2025-11-22]
- Gender and health. WHO. 2024. URL: https://www.who.int/health-topics/gender [Accessed 2024-09-13]
- Oertelt-Prigione S. Chapter 32 - the operationalization of gender in medicine. In: Legato MJ, editor. Principles of Gender-Specific Medicine. 2023:503-512. [CrossRef]
- Geller SE, Koch AR, Roesch P, Filut A, Hallgren E, Carnes M. The more things change, the more they stay the same: a study to evaluate compliance with inclusion and assessment of women and minorities in randomized controlled trials. Acad Med. Apr 2018;93(4):630-635. [CrossRef] [Medline]
- Potluri T, Engle K, Fink AL, Vom Steeg LG, Klein SL. Sex reporting in preclinical microbiological and immunological research. MBio. Nov 14, 2017;8(6):e01868-17. [CrossRef] [Medline]
- Shah K, McCormack CE, Bradbury NA. Do you know the sex of your cells? Am J Physiol Cell Physiol. Jan 1, 2014;306(1):C3-18. [CrossRef] [Medline]
- Wandschneider L, Sauzet O, Razum O, Miani C. Development of a gender score in a representative German population sample and its association with diverse social positions. Front Epidemiol. 2022;2:914819. [CrossRef] [Medline]
- Lacasse A, Pagé MG, Choinière M, et al. Conducting gender-based analysis of existing databases when self-reported gender data are unavailable: the GENDER Index in a working population. Can J Public Health. Apr 2020;111(2):155-168. [CrossRef] [Medline]
- Nauman AT, Behlouli H, Alexander N, et al. Gender score development in the Berlin Aging Study II: a retrospective approach. Biol Sex Differ. Jan 18, 2021;12(1):15. [CrossRef] [Medline]
- Fraser G, Bulbulia J, Greaves LM, Wilson MS, Sibley CG. Coding responses to an open-ended gender measure in a New Zealand national sample. J Sex Res. Oct 2020;57(8):979-986. [CrossRef] [Medline]
- Lagos D, Compton D. Evaluating the use of a two-step gender identity measure in the 2018 General Social Survey. Demography. Apr 1, 2021;58(2):763-772. [CrossRef] [Medline]
- Ammon D, Kurscheidt M, Buckow K, et al. Arbeitsgruppe Interoperabilität: Kerndatensatz und Informationssysteme für Integration und Austausch von Daten in der Medizininformatik-Initiative. Bundesgesundheitsbl. Jun 2024;67(6):656-667. [CrossRef]
- Clayton JA, Tannenbaum C. Reporting sex, gender, or both in clinical research? JAMA. Nov 8, 2016;316(18):1863-1864. [CrossRef] [Medline]
- Thibaut F. Gender does matter in clinical research. Eur Arch Psychiatry Clin Neurosci. Jun 2017;267(4):283-284. [CrossRef] [Medline]
- Nielsen MW, Stefanick ML, Peragine D, et al. Gender-related variables for health research. Biol Sex Differ. Feb 22, 2021;12(1):23. [CrossRef] [Medline]
- Morgan R, Yin A, Kalbarczyk A, et al. Reconsidering tools for measuring gender dimensions in biomedical research. Biol Sex Differ. Nov 25, 2024;15(1):96. [CrossRef] [Medline]
- Bem SL. The measurement of psychological androgyny. J Consult Clin Psychol. Apr 1974;42(2):155-162. [CrossRef] [Medline]
- Pelletier R, Ditto B, Pilote L. A composite measure of gender and its association with risk factors in patients with premature acute coronary syndrome. Psychosom Med. Jun 2015;77(5):517-526. [CrossRef] [Medline]
- Neufang S, Li F, Akhrif A, Beyan OD. Toward a fair, gender-debiased classifier for the diagnosis of attention deficit/hyperactivity disorder- a machine-learning based classification study. BMC Med Inform Decis Mak. Aug 5, 2025;25(1):290. [CrossRef] [Medline]
- Donnelly K, Twenge JM. Masculine and feminine traits on the Bem Sex-Role Inventory, 1993–2012: a cross-temporal meta-analysis. Sex Roles. May 2017;76(9-10):556-565. [CrossRef]
- Schindler L, Beelich H, Röll S, Katsari E, Stracke S, Waltemath D. Applicability of retrospective and prospective gender scores for clinical and health data: protocol for a scoping review. JMIR Res Protoc. Jan 20, 2025;14(1):e57669. [CrossRef] [Medline]
- Horstmann S, Schmechel C, Palm K, Oertelt-Prigione S, Bolte G. The operationalisation of sex and gender in quantitative health-related research: a scoping review. Int J Environ Res Public Health. Jun 18, 2022;19(12):7493. [CrossRef] [Medline]
- Miani C, Wandschneider L, Niemann J, Batram-Zantvoort S, Razum O. Measurement of gender as a social determinant of health in epidemiology-a scoping review. PLoS ONE. 2021;16(11):e0259223. [CrossRef] [Medline]
- Ballering AV, Olde Hartman TC, Rosmalen JGM. Gender scores in epidemiological research: methods, advantages and implications. Lancet Reg Health Eur. Aug 2024;43:100962. [CrossRef] [Medline]
- McGuire JK, Beek TF, Catalpa JM, Steensma TD. The Genderqueer Identity (GQI) scale: measurement and validation of four distinct subscales with trans and LGBQ clinical and community samples in two countries. Int J Transgend. 2019;20(2-3):289-304. [CrossRef] [Medline]
- Lombardi E, Banik S. The utility of the two-step gender measure within trans and cis populations. Sex Res Social Policy. Sep 2016;13(3):288-296. [CrossRef] [Medline]
- Pohrt A, Kendel F, Demuth I, et al. Differentiating sex and gender among older men and women. Psychosom Med. Apr 1, 2022;84(3):339-346. [CrossRef] [Medline]
- Tibubos AN, Otten D, Beutel ME, Brähler E. Validation of the Personal Attributes Questionnaire-8: gender expression and mental distress in the German population in 2006 and 2018. Int J Public Health. 2022;67:1604510. [CrossRef] [Medline]
- Demuth I, Banszerus V, Drewelies J, et al. Cohort profile: follow-up of a Berlin Aging Study II (BASE-II) subsample as part of the GendAge study. BMJ Open. Jun 23, 2021;11(6):e045576. [CrossRef] [Medline]
- Spence JT, Helmreich R, Stapp J. Ratings of self and peers on sex role attributes and their relation to self-esteem and conceptions of masculinity and femininity. J Pers Soc Psychol. Jul 1975;32(1):29-39. [CrossRef] [Medline]
- de Breij S, Huisman M, Boot CRL, Deeg DJH. Sex and gender differences in depressive symptoms in older workers: the role of working conditions. BMC Public Health. May 21, 2022;22(1):1023. [CrossRef] [Medline]
- Smith PM, Koehoorn M. Measuring gender when you don’t have a gender measure: constructing a gender index using survey data. Int J Equity Health. May 28, 2016;15(1):82. [CrossRef] [Medline]
- Gisinger T, Azizi Z, Alipour P, et al. Sex and gender aspects in diabetes mellitus: focus on access to health care and cardiovascular outcomes. Front Public Health. 2023;11:1090541. [CrossRef] [Medline]
- Cipriani E, Samson-Daoust E, Giguère CE, et al. A step-by-step and data-driven guide to index gender in psychiatry. PLoS ONE. 2024;19(1):e0296880. [CrossRef] [Medline]
- Teterina A, Zulbayar S, Mollayeva T, Chan V, Colantonio A, Escobar M. Gender versus sex in predicting outcomes of traumatic brain injury: a cohort study utilizing large administrative databases. Sci Rep. Oct 27, 2023;13(1):18453. [CrossRef] [Medline]
- Vader SS, Lewis SM, Verdonk P, Verschuren WMM, Picavet HSJ. Masculine gender affects sex differences in the prevalence of chronic health problems - The Doetinchem Cohort Study. Prev Med Rep. Jun 2023;33:102202. [CrossRef] [Medline]
- Yuan J, Sang S, Pham J, Kong WJ. Gender modifies the association of cognition with age-related hearing impairment in the Health and Retirement Study. Front Public Health. 2021;9:751828. [CrossRef] [Medline]
- Ballering AV, Bonvanie IJ, Olde Hartman TC, Monden R, Rosmalen JGM. Gender and sex independently associate with common somatic symptoms and lifetime prevalence of chronic disease. Soc Sci Med. May 2020;253:112968. [CrossRef]
- Rosenkrantz P, Bee H, Vogel S, Broverman I. Sex-role stereotypes and self-concepts in college students. J Consult Clin Psychol. Jun 1968;32(3):287-295. [CrossRef] [Medline]
- The Medical Informatics Initiative’s core data set. Medical Informatics Initiative Germany. URL: https://www.medizininformatik-initiative.de/en/medical-informatics-initiatives-core-data-set [Accessed 2025-03-03]
- MII core data set - datasets. Art Decor. URL: https://art-decor.org/art-decor/decor-datasets--mide [Accessed 2025-03-03]
- Ludvigsson JF, Almqvist C, Bonamy AKE, et al. Registers of the Swedish total population and their use in medical research. Eur J Epidemiol. Feb 2016;31(2):125-136. [CrossRef] [Medline]
- Pedersen CB. The Danish civil registration system. Scand J Public Health. Jul 2011;39(7 Suppl):22-25. [CrossRef] [Medline]
- Health inequalities data tool. Government of Canada. URL: https://health-infobase.canada.ca/health-inequalities/Indicat [Accessed 2025-08-08]
- HL7/fhir-sdoh-clinicalcare. GitHub. URL: https://github.com/HL7/fhir-sdoh-clinicalcare [Accessed 2025-08-08]
- Horstmann S, Schmechel C, Becher E, Oertelt-Prigione S, Palm K, Bolte G. DIVERGesTOOL – Entwicklung einer Toolbox zur Erfassung geschlechtlicher Vielfalt in der quantitativen Gesundheitsforschung. Bundesgesundheitsbl. Sep 2024;67(9):1054-1061. [CrossRef]
- Banerjee SC, Staley JM, Alexander K, Walters CB, Parker PA. Encouraging patients to disclose their lesbian, gay, bisexual, or transgender (LGBT) status: oncology health care providers’ perspectives. Transl Behav Med. Oct 8, 2020;10(4):918-927. [CrossRef] [Medline]
- Koch A, Rabins M, Messina J, Brennan-Cook J. Exploring the challenges of sexual orientation disclosure among lesbian, gay, bisexual, transgender, queer individuals. J Nurse Pract. Nov 2023;19(10):104765. [CrossRef]
Abbreviations
| BSRI: Bem Sex Role Inventory |
| CDS: core dataset |
| MII: Medical Informatics Initiative |
| PAQ: Personal Attributes Questionnaire |
Edited by Alexandre Castonguay; submitted 19.Mar.2025; peer-reviewed by Bob Hoyt, Heber Anandan; final revised version received 24.Aug.2025; accepted 02.Sep.2025; published 08.Jan.2026.
Copyright© Lea Schindler, Hilke Beelich, Elpiniki Katsari, Daniele Liprandi, Sylvia Stracke, Dagmar Waltemath. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 8.Jan.2026.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.

