Because personal information can be discovered rather easily through ‘reidentification,’ additional steps are needed to protect the privacy of people’s healthcare data.
All Jessilyn Dunn, Ph.D., wanted was a clear-cut answer.
As an assistant professor of biomedical engineering at Duke University, a critical first step in her research is to contact Duke’s information technology (IT) department to get an assessment of the privacy risks associated with the data sets she plans to use. Such assessments are necessary to properly secure and store the data upon which her research relies.
But the IT department’s responses were never fully satisfying.
“They couldn’t seem to give us a straight answer on what the risk level of the data was,” she recalls. “And so they would default to the highest risk level, which would mean it would be much more expensive to store the data.”
Frustrated, Dunn decided to take matters into her own hands. She resolved to undertake a thorough analysis of existing scientific literature to better understand the privacy risks associated with the use of healthcare data.
She found an answer. It wasn’t what she expected.
Reidentifying the deidentified
At the heart of Dunn’s question is the concept of data “deidentification”: decoupling health data from personal information, so the stripped-down data can be used and shared in scientific research without compromising the privacy of the individuals who generated the data. The concept is critically important in the age of digital health because millions of people around the globe now use wearable health devices, prescription digital therapeutics and other technologies that generate constant streams of very valuable — but also very personal — health data.
Optimally, such a massive amount of data could be used to improve everything from digital health algorithms to drug development to, ultimately, patient outcomes. However, most academic institutions and government agencies will only allow the use of such data if it does not risk patient privacy.
The Health Insurance Portability and Accountability Act (HIPAA) laid out 18 personal health information identifiers— name, birth date, address, phone number and so on — that could be used to link data to individuals. The standard practice in deidentifying information is to remove those identifiers from a given data set before utilizing or sharing the data. In that way, deidentifying data is pretty straightforward, Dunn says.
However, if one takes a broader view, things get more complicated. Deleting columns on a spreadsheet is simple enough, but Dunn and colleagues wondered whether doing so would actually achieve the goal of making it difficult or impossible to reidentify the people who generated the data.
“So I guess I would say the technical definition (of deidentified data) and the functional definition may be slightly different,” she says.
To find out if HIPAA-based deidentification actually works, Dunn and colleagues conducted a systematic review of literature pertaining to the reidentification of people based on health data sets, primarily of data generated from wearable devices. They found a total of 72 studies that met their inclusion criteria.
It did not take long to see a trend. Dunn says a former student, Lucy Chikwetu, M.Sc., did the data collection. “As she was going through the data, I would mention that it seems like there are a lot more reidentification possibilities than we initially thought,” she says.
Still, she knew she could not draw conclusions until all the data were compiled. Once collected, though, her fears were confirmed. The studies showed that reidentification from deidentified data was possible between 86% and 100% of the time, suggesting a high risk of reidentification. In some cases, just a few seconds of sensor data could be used to reidentify a person by matching the sensor data in a deidentified data set to the same data in another data set with identifying information. In an era when the generation of data is far outpacing privacy controls and regulation, finding matching data sets with identifying information is not as hard as it might seem, Dunn says.
Dunn and her colleagues laid out a scenario in which an employee participates in her company’s wellness program, which involves tracking daily step counts and heart rate. Separately, in this hypothetical, the patient participates in a stroke prevention study that also uses step and heart-rate data. Following study protocols, the patient discloses to the researchers that she has HIV, a fact that she has not told her employer.
Although the study deidentifies the information before publishing the data, her employer could obtain the study data set, match it to their wellness program data and learn that the employee has HIV. All of that might be perfectly legal, Dunn says. However, it would also expose the employee to potential discrimination due to her HIV diagnosis.
Given the ubiquity of personal data on the internet and the wide array of data peddlers — legal and otherwise — the risks to patient privacy are quite high.
And although the hypothetical outlined in the study referred to step counts and heart-rate data, Dunn says a wide variety of data can be leveraged to identify people. For example, a studypublished in 2019 in the journal IEEE Access of 46 participants found that just 2 seconds of electroencephalogram data could be used to correctly identify 95% of patients.
Another study, also published in IEEE Access, showed that accelerometer and gyroscope data gathered from daily activities like toothbrushing could be used to identify the patients who generated the data. Dunn and her colleagues reported their findings in February 2023 in The Lancet Digital Health.
Cause for concern, not retreat
The study has major implications for biomedical research. It means that a central premise upon which digital health innovation has been built — that big data can be used without risking participant privacy — is shaky. In publishing the analysis, Dunn says her goal is to shed light on privacy issues, not to bring digital health research to a halt. “We want to avoid a situation where we post data without understanding the consequences,” she says.
Improving data privacy
Instead of encouraging people to avoid sharing data, she says, she hopes to start a conversation about finding new ways to share data safely. A number of proposals are already being developed to improve the privacy of personal health data. One potential part of the solution, Dunn says, is so-called “secure enclaves,” hardware being rolled out that provides an extra layer of protection by cordoning off sensitive information from users or applications that do not need access to it.
Another notion is the use of federated learning systems. “So rather than bringing data to your algorithms, you bring your algorithms to data,” Dunn says. Such a system would allow algorithms to learn from and analyze data without needing to compromise the security of the data.
Another important point, Dunn says, is for researchers to be more discerning and to think more critically about who needs access to which data. Rather than making an entire data set publicly available to anyone, they could offer tiers of access, making sure the right data get into the “right hands, for the right reasons,” she says.
Sharing data involves significant risks, she says, but it also brings the potential for monumental scientific advancement.
“And so we have to be able to think about both sides of that and make informed decisions,” Dunn comments. “There’s a lot we have
Jared Kaltwasser is a writer in Iowa and a frequent contributor toManaged Healthcare Executive.