Summary
Background
Previous studies in medical imaging have shown disparate abilities of artificial intelligence (AI) to detect a person’s race, yet there is no known correlation for race on medical imaging that would be obvious to human experts when interpreting the images. We aimed to conduct a comprehensive evaluation of the ability of AI to recognise a patient’s racial identity from medical images.
Methods
Using private (Emory CXR, Emory Chest CT, Emory Cervical Spine, and Emory Mammogram) and public (MIMIC-CXR, CheXpert, National Lung Cancer Screening Trial, RSNA Pulmonary Embolism CT, and Digital Hand Atlas) datasets, we evaluated, first, performance quantification of deep learning models in detecting race from medical images, including the ability of these models to generalise to external environments and across multiple imaging modalities. Second, we assessed possible confounding of anatomic and phenotypic population features by assessing the ability of these hypothesised confounders to detect race in isolation using regression models, and by re-evaluating the deep learning models by testing them on datasets stratified by these hypothesised confounding variables. Last, by exploring the effect of image corruptions on model performance, we investigated the underlying mechanism by which AI models can recognise race.
Findings
In our study, we show that standard AI deep learning models can be trained to predict race from medical images with high performance across multiple imaging modalities, which was sustained under external validation conditions (x-ray imaging [area under the receiver operating characteristics curve (AUC) range 0·91–0·99], CT chest imaging [0·87–0·96], and mammography [0·81]). We also showed that this detection is not due to proxies or imaging-related surrogate covariates for race (eg, performance of possible confounders: body-mass index [AUC 0·55], disease distribution [0·61], and breast density [0·61]). Finally, we provide evidence to show that the ability of AI deep learning models persisted over all anatomical regions and frequency spectrums of the images, suggesting the efforts to control this behaviour when it is undesirable will be challenging and demand further study.
Interpretation
The results from our study emphasise that the ability of AI deep learning models to predict self-reported race is itself not the issue of importance. However, our finding that AI can accurately predict self-reported race, even from corrupted, cropped, and noised medical images, often when clinical experts cannot, creates an enormous risk for all model deployments in medical imaging.
Funding
National Institute of Biomedical Imaging and Bioengineering, MIDRC grant of National Institutes of Health, US National Science Foundation, National Library of Medicine of the National Institutes of Health, and Taiwan Ministry of Science and Technology
Introduction
Bias and discrimination in artificial intelligence (AI) systems has been studied in multiple domains,
- Bender EM
- Gebru T
- McMillan-Major A
- Shmitchell S
On the dangers of stochastic parrots: can language models be too big? FAccT ’21.
,
- Angwin J
- Larson J
- Mattu S
- Kirchner L
,
- Koenecke A
- Nam A
- Lake E
- et al.
Racial disparities in automated speech recognition.
,
- Buolamwini J
- Gebru T
Gender shades: intersectional accuracy disparities in commercial gender classification.
including in many health-care applications, such as detection of melanoma,
- Adamson AS
- Smith A
Machine learning and health care disparities in dermatology.
,
- Navarrete-Dechent C
- Dusza SW
- Liopyris K
- Marghoob AA
- Halpern AC
- Marchetti MA
Automated dermatological diagnosis: hype or reality?.
mortality prediction,
- Sarkar R
- Martin C
- Mattie H
- Gichoya JW
- Stone DJ
- Celi LA
Performance of intensive care unit severity scoring systems across different ethnicities in the USA: a retrospective observational study.
and algorithms that aid the prediction of health-care use,
- Obermeyer Z
- Powers B
- Vogeli C
- Mullainathan S
Dissecting racial bias in an algorithm used to manage the health of populations.
in which the performance of AI is stratified by self-reported race on a variety of clinical tasks.
- Seyyed-Kalantari L
- Zhang H
- McDermott MBA
- Chen IY
- Ghassemi M
Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations.
Several studies have shown disparities in the performance of medical AI systems across race. For example, Seyyed-Kalantari and colleagues showed that AI models produce significant differences in the accuracy of automated chest x-ray diagnosis across racial and other demographic groups, even when the models only had access to the chest x-ray itself.
- Seyyed-Kalantari L
- Zhang H
- McDermott MBA
- Chen IY
- Ghassemi M
Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations.
Importantly, if used, such models would lead to more patients who are Black and female being incorrectly identified as healthy compared with patients who are White and male. Moreover, racial disparities are not simply due to under-representation of these patient groups in the training data, and there exists no statistically significant correlation between group membership and racial disparities.
- Seyyed-Kalantari L
- Liu G
- McDermott M
- Chen IY
- Ghassemi M
CheXclusion: fairness gaps in deep chest X-ray classifiers.
In related work, several groups reported that AI algorithms can identify various demographic patient factors. One study
- Yi PH
- Wei J
- Kim TK
- et al.
Radiology “forensics”: determination of age and sex from chest radiographs using deep learning.
found that an AI model could predict sex and distinguish between adult and paediatric patients from chest x-rays, while other studies
- Eng DK
- Khandwala NB
- Long J
- et al.
Artificial intelligence algorithm improves radiologist performance in skeletal age assessment: a prospective multicenter randomized controlled trial.
reported reasonable accuracy at predicting the chronological age of patients from various imaging studies. In ophthalmology, retinal images have been used to predict sex, age, and cardiac markers (eg, hypertension and smoking status).
- Rim TH
- Lee G
- Kim Y
- et al.
Prediction of systemic biomarkers from retinal photographs: development and validation of deep-learning algorithms.
,
- Munk MR
- Kurmann T
- Márquez-Neila P
- Zinkernagel MS
- Wolf S
- Sznitman R
Assessment of patient specific information in the wild on fundus photography and optical coherence tomography.
,
- Poplin R
- Varadarajan AV
- Blumer K
- et al.
Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning.
These findings, which show that demographic factors that are strongly associated with disease outcomes (eg, age, sex, and racial identity), are also strongly associated with features of medical images and might induce bias in model results, mirroring what is known from over a century of clinical and epidemiological research on the importance of covariates and potential confounding.
- Greenland S
- Robins JM
Identifiability, exchangeability, and epidemiological confounding.
,
- Greenland S
- Pearl J
- Robins JM
Confounding and collapsibility in causal inference.
Many published AI models have conceptually amounted to simple bivariate analyses (ie, image features and their ability to predict clinical outcomes). Although more recent AI models have begun to consider other risk factors that conceptually approach multivariate modelling, which is the mainstay of clinical and epidemiological research, key demographic covariates (eg, age, sex, and racial identity) have been largely ignored by most deep learning research in medicine.
Research in context
Evidence before this study
We used three different search engines to do our review. For PubMed, we used the following search terms: “(((disparity OR bias OR fairness) AND (classification)) AND (x-ray OR mammography)) AND (machine learning [MeSH Terms]).” For IEEE Xplore, we used the following search terms: “((disparity OR bias OR fairness) AND (mammography OR x-ray) AND (machine learning))”. For ACM, we used the following search terms: “[Abstract: mammography x-ray] AND [Abstract: classification prediction] AND [All: disparity fairness]”. All queries were limited to dates between Jan 1, 2010, and Dec 31, 2020. We included any studies that were published in English, focused on medical images, and that were original research. We also reviewed commentaries and opinion articles. We excluded articles that were not written in English or that were outside of the medical imaging domain. To our knowledge, there is no published meta-analysis or systematic review on this topic. Most published papers focused on measuring disparities in tabular health data without much emphasis on imaging-based approaches.
Although previous work has shown the existence of racial disparities, the mechanism for these differences in medical imaging is, to the best of our knowledge, unexplored. Pierson and colleagues noted that an artificial intelligence (AI) model that was designed to predict severity of osteoarthritis using knee x-rays could not identify the race of the patients. Yi and colleagues conducted a forensics evaluation on chest x-rays and found that AI algorithms could predict sex, distinguish between adult and paediatric patients, and differentiate between US and Chinese patients. In ophthalmology, retinal scan images have been used to predict sex, age, and cardiac markers (eg, hypertension and smoking status). We found few published studies that explicitly targeted the recognition of racial identity from medical images, possibly because radiologists do not routinely have access to, nor rely on, demographic information (eg, race) for diagnostic tasks in clinical practice.
Added value of this study
In this study, we investigated a large number of publicly and privately available large-scale medical imaging datasets and found that self-reported race is accurately predictable by AI models trained with medical image pixel data alone as model inputs. First, we showed that AI models are able to predict race across multiple imaging modalities, various datasets, and diverse clinical tasks. This high level of performance persisted during external validation of these models across a range of academic centres and patient populations in the USA, as well as when the models were optimised to do clinically motivated tasks. Second, we conducted ablations that showed that this detection was not due to trivial proxies, such as body habitus, age, tissue density, or other potential imaging confounders for race (eg, underlying disease distribution in the population). Finally, we showed that the features learned appear to involve all regions of the image and frequency spectrum, suggesting the efforts to control this behaviour when it is undesirable will be challenging and demand further study.
Implications of all the available evidence
In our study, we emphasise that the ability of AI to predict racial identity is itself not the issue of importance, but rather that this capability is readily learned and therefore is likely to be present in many medical image analysis models, providing a direct vector for the reproduction or exacerbation of the racial disparities that already exist in medical practice. This risk is compounded by the fact that human experts cannot similarly identify racial identity from medical images, meaning that human oversight of AI models is of limited use to recognise and mitigate this problem. This issue creates an enormous risk for all model deployments in medical imaging: if an AI model relies on its ability to detect racial identity to make medical decisions, but in doing so produced race-specific errors, clinical radiologists (who do not typically have access to racial demographic information) would not be able to tell, thereby possibly leading to errors in health-care decision processes.
Findings regarding the possibility of confounding of racial identity in deep learning models suggest a possible mechanism for racial disparities resulting from AI models: that AI models can directly recognise the race of a patient from medical images. However, this hypothesis is largely unexplored
- Wawira Gichoya J
- McCoy LG
- Celi LA
- Ghassemi M
Equity in essence: a call for operationalising fairness in machine learning for healthcare.
and, in contrast to other demographic factors (eg, age and sex), there is a widely held, but tacit, belief among radiologists that the identification of a patient’s race from medical images is almost impossible, and that most medical imaging tasks are essentially race agnostic (ie, the task is not affected by the patient’s race). Given the possibility for discriminatory harm in a key component of the medical system that is assumed to be race agnostic, understanding how race has a role in medical imaging models is of high importance
- Tariq A
- Purkayastha S
- Padmanaban GP
- et al.
Current clinical applications of artificial intelligence in radiology and their best supporting evidence.
as many AI systems that use medical images as the primary inputs are being cleared by the US Food and Drug Administration and other regulatory agencies.
FDA cleared AI algorithms.
,
- Benjamens S
- Dhunnoo P
- Meskó B
The state of artificial intelligence-based FDA-approved medical devices and algorithms: an online database.
,
- Tadavarthi Y
- Vey B
- Krupinski E
- et al.
The State of radiology AI: considerations for purchase decisions and current market offerings.
In this study, we aimed to investigate how AI systems are able to detect a patient’s race to differing degrees of accuracy across self-reported racial groups in medical imaging. To do so, we aimed to investigate large publicly and privately available medical imaging datasets to examine whether AI models are able to predict an individual’s race across multiple imaging modalities, various datasets, and diverse clinical tasks.
Methods
Definitions of race and racial identity
Race and racial identity can be difficult attributes to quantify and study in health-care research
- Krieger N
Shades of difference: theoretical underpinnings of the medical controversy on black/white differences in the United States, 1830–1870.
and are often incorrectly conflated with biological concepts (eg, genetic ancestry).
- Cooper R
- David R
The biological concept of race and its application to public health and epidemiology.
In this modelling study, we defined race as a social, political, and legal construct that relates to the interaction between external perceptions (ie, “how do others see me?”) and self-identification, and specifically make use of self-reported race of patients in all of our experiments. We variously use the terms race and racial identity to refer to this construct throughout this study.
Datasets
We obtained public and private datasets (table 1, appendix p 2) that covered several imaging modalities and clinical scenarios. No one single race was consistently dominant across the datasets (eg, the proportion of Black patients was between 6% and 72% across the datasets). For all datasets, ethical approval was obtained from the relevant institutional ethical boards.
Table 1Summary of datasets used for race prediction experiments
CXP=CheXpert dataset. DHA=Digital Hand Atlas. EM-CS=Emory Cervical Spine radiograph dataset. EM-CT=Emory Chest CT dataset. EM-Mammo=Emory Mammogram dataset. EMX=Emory chest x-ray dataset. MXR=MIMIC-CXR dataset. NLST=National Lung Cancer Screening Trial dataset. RSPECT=RSNA Pulmonary Embolism CT dataset.
Investigation of possible mechanisms of race detection
We conduced three main groups of experiments to investigate the cause of previously established AI performance disparities by patient race. These experiments were: (1) to assess the ability of deep learning AI models to recognise race from medical images, including the ability of these models to generalise to new environments and across multiple imaging modalities; (2) to examine possible confounding anatomic and phenotype population features as explanations for these performance scores, and (3) to investigate the underlying mechanisms by which AI models can recognise race. The full list of experiments are summarised in table 2 and the appendix (pp 22–23).
Table 2Summary of experiments conducted to investigate mechanisms of race detection in Black patients
BMI=body-mass index. CXP=CheXpert dataset. DHA=Digital Hand Atlas. EM-CS=Emory Cervical Spine radiograph dataset. EM-CT=Emory Chest CT dataset. EM-Mammo=Emory Mammogram dataset. EMX=Emory CXR dataset. MXR=MIMIC-CXR dataset. NLST=National Lung Cancer Screening Trial dataset. RSPECT=RSNA Pulmonary Embolism CT dataset.
We did not present measures of performance variance or null hypothesis tests because these data are uninformative given the large dataset sizes and the large effect sizes reported (ie, even in experiments in which a hypothesis could be defined, all p values were <0·001).
Race detection in radiology imaging
To investigate the ability of deep learning systems to detect race from radiology images, first, we developed models for the detection of racial identity on three large chest x-ray datasets—MIMIC-CXR (MXR),
- Johnson AEW
- Pollard TJ
- Berkowitz SJ
- et al.
MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports.
CheXpert (CXP),
- Irvin J
- Rajpurkar P
- Ko M
- et al.
CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison.
and Emory-chest x-ray (EMX) with both internal validation (ie, testing the model on an unseen subset of the dataset used to train the model) and external validation (ie, testing the model on a completely different dataset than the one used to train the model) to establish baseline performance. Second, we trained racial identity detection models for non-chest x-ray images from multiple body locations, including digital radiography, mammograms, lateral cervical spine radiographs, and chest CTs, to evaluate whether the model’s performance was limited to chest x-rays.
After establishing that deep learning models could detect a patient’s race in medical imaging data, we generated a series of competing hypotheses to explain how this process might occur. First, we assessed differences in physical characteristics between patients of different racial groups (eg, body habitus
- Wagner DR
- Heyward VH
Measures of body composition in blacks and whites: a comparative review.
or breast density
- del Carmen MG
- Halpern EF
- Kopans DB
- et al.
Mammographic breast density and race.
). Second, we assessed whether there was a difference in disease distribution among patients of different racial groups (eg, previous studies provide evidence that Black patients have a higher incidence of particular diseases, such as cardiac disease, than White patients).
Office of Minority Health
Heart disease and African Americans.
,
- Graham G
Disparities in cardiovascular disease risk in the United States.
Third, we assessed whether there were location-specific or tissue-specific differences (eg, there is evidence
0·001).p>
Read More