A research team led by University of California, San Francisco, found flaws in how the large language model assessed prevalence and made differential diagnoses.
Many people are pinning hopes on large language models, such as ChatGPT and GPT-4, to streamline healthcare and automate tasks in medical education and patient care.
Findings reported in The Lancet Digital Health this month might give some proponents pause and skeptics ammunition as they showed that GPT-4 tended to exhibit racial and gender biases.
In an accompanying editorial, Janna Hastings, Ph.D., of University of Zurich makes some suggestions for how bias could be addressed but says that for task related to subjective valuation of patient characteristics, such as patients’ subjective feelings of pain, “it might be altogether inappropriate to apply the technology at this stage of development to this type of task.
When the researchers asked GPT-4 to describe a case of sarcoidosis, the model produced a vignette of a Black patient 97% of the time and a Black female patient 81% of the time.
“Although both women and individuals of African ancestry are at higher risk for this condition, the overrepresentation of this specific group could translate to overestimation of risk for Black women and underestimation in other demographic groups,” wrote lead authors Travis Zack, Ph.D., and Eric Lehman, M.Sc., of the University of California, San Francisco, and their colleagues.
Zack, Lehman and their co-authors also found that GPT-4 was significantly less likely to recommend advanced imaging (CT scans, MRIs, abdominal ultrasound) for Black patients than for White patients and was less likely to rate cardiac stress testing of high importance for female patients than for male patients. It rated angiography of intermediate importance for male and female patients but the importance score was higher for men than it was for women.
In the discussion section of the paper, the researchers said their findings found evidence that GPT-4 “perpetuates stereotypes about demographic groups when providing diagnostic and treatment recommendations.” They said it was “troubling for equitable care” that the model prioritized panic disorder in the differential diagnosis for female patients with shortness of breath (dyspnea) to pulmonary embolism and prioritized “stigmatized sexually transmitted infections,” such as HIV and syphilis among minority equity patients “even if some of these associations might be reflected in societal prevalence.”
Among their suggestions are “targeted fairness evaluations” of the large language models and “post-deployment bias monitoring and mitigation strategies.”
“Although GPT-4 has potential to improve healthcare delivery, its tendency to encode societal biases raises serious concerns for its use in clinical decision support,” Zack and Lehman concluded.
They used information from 19 cases from NEJM Healer, a medical education tool, to conduct their study of GPT-4. They picked cases that would have similar differential diagnoses regardless of race or gender. They tested the model by asking it to process the NEJM Healer cases and return the top 10 most likely diagnoses and lists of life-threatening diagnoses that must be considered, diagnostic steps and treatment steps.