There are plenty of studies about tracking diseases (such as influenza) using digital data sources, which is awesome! However, many of these studies focus solely on matching the trends in the digital data sources (for example, searches on disease-related terms, or how frequently certain disease-related terms are mentioned on social media over time, etc.) to data from official sources such as the Centers for Disease Control and Prevention. Although this approach is useful in telling us about the possible utility of these data, there are several limitations. One of the main limitations is the difficulty in distinguishing between data generated by healthy individuals and individuals who are actually sick. In other words, how can we tell whether someone who searches Google or Wikipedia for influenza is sick or just curious about the flu?
Researchers at Penn State University have developed a system that seeks to deal with this limitation. We spoke to the lead author, Todd Bodnar, about the study titled, On the Ground Validation of Online Diagnosis with Twitter and Medical Records. According to Bodnar, the study was born from a desire to “approach digital disease detection from a new angle” given the recent criticism of the reliability of Google Flu Trends (GFT). He and his co-authors were interested in focusing on whether people are sick or not, rather than the average rate of influenza in the population.
In the paper, they present a novel approach for disease detection at the individual level using social media data. “We started with data from people that we knew were actually sick. We collaborated with our university's health services to find people that were diagnosed with influenza. From there, we took their twitter data and tried to develop an automated diagnosis system that matched the doctor's diagnosis,” writes Bodnar in an email correspondence.
In this initial study, 104 twitter “seed” (meaning the initial or primary accounts being examined) accounts were included. The authors collected 37,599 tweets from these accounts in addition to 30,950,958 tweets from accounts of individuals that were followed by or that followed the seed accounts. To classify individuals as “sick” or “not sick” using these data, they developed classification schemes based on the presence or absence of flu-related keywords (flu, influenza, sick, cough, cold, medicine, fever); manual labeling of tweets based on hints indicating illness (such as “another doctor’s appointment…”); the rate at which individuals tweet, since illness can influence changes in tweeting behavior; and analysis of tweets by individuals that were followed by or that followed the seed accounts. In total, the researchers used five methods for detecting whether an individual was sick.
So what did they find? Bodnar states that, “about half of the active Twitter users we surveyed actually discussed being sick on Twitter. We were able to diagnose the other half accurately by data-mining more subtle clues from their Twitter stream. For example, if someone says that she's going to a party, she's probably not sick. On the other hand, a reduction of tweeting rates by more than one standard deviation results in a 28.54 percent increase in likelihood of illness.” Bodnar also says that, “the system matched the professional diagnosis more than 99 percent of the time.” Basically, the authors show that by using social media data (specifically tweets), they can tell whether can individual is actually sick or not.
Obviously there are several ethical and technical challenges to doing a study like this since it involves personal data, which can be very sensitive. (Check out this recent DDD article, One Researcher’s Take on Twitter, Research and Privacy, for a discussion of some of the ethical issues in the field.) The researchers had to submit their study proposal to what is called an IRB, or an institutional review board. Institutions that conduct research generally have a board of individuals who review research proposals and determine whether or not they are ethically sound.
“We're actually planning on applying this to HIV in the future. It's a more complex problem. We probably couldn't diagnose people that weren't aware that they're HIV-positive, but could use it as a stepping point for looking at other behaviors such as promiscuity or anti-retroviral usage,” writes Bodnar. The authors are also working on applying this method on a larger scale.
And for those of you who are students and interested in Digital Disease Detection, Bodnar advises: “Be interested in listening to whispers in the data, but at the same time, don't look at the moon and claim to see a face!”