Perhaps more than ever before, massive amounts of health data are available at the click of a mouse. HealthMap, for example, utilizes an automated process that updates around the clock in order to integrate and distribute online information about current and emerging infectious diseases. Similarly, Google Flu Trends, aggregates the search engine’s data to estimate real-time flu activity across the globe. With frequent analyses of large databases, the phrase “Big Data” is used more and more often among researchers and others in academia. It refers to the structured and unstructured data that is characterized by an exponential growth rate.
Big Data certainly contributes to improvements in the statistical and computational methods for a wide range of fields, but it also seems to challenge the traditional scientific approach for data assessment. Instead of first generating hypotheses about what particular information may reveal, sometimes people examine raw data and apply statistical algorithms to find patterns and correlations where science cannot. In fact, in a report titled “The Promise and Peril of Big Data,” Stefaan Verhulst, Chief of Research at the Markle Foundation, asserted that increased data collection doesn’t necessarily yield more knowledge. This is because researchers fall into a trap where they get caught in exploring the seemingly never-ending possibilities associated with analyzing big data. For this reason, when it comes to Big Data, there are instances when “less is more,” as it becomes challenging to understand what kind of data points you need in order to develop a theory or make decisions.
A research letter by David Scales et al. presents an interesting example of how important it is to consider the potential value of a specific research project and the big picture (no pun intended here) that the results or conclusions would help create for professionals and others seeking knowledge on a topic or issue. The letter, which was published in the American Journal of Preventive Medicine, was motivated by the lack of comprehensive national public surveillance data for tuberculosis (TB) in the United States. Prior to the research done by Scales and his colleagues, the Centers for Disease Control and Prevention (CDC) only published TB data at the state-level. Scales and his team claimed that a more nationwide data set could be beneficial for examining TB trends across states, Metropolitan Statistical Areas, and counties. The researchers produced a county-level, interactive map of TB rates in the United States (accessible at healthmap.org/tb) by getting states to agree to the use of their data and to sharing it with the public. The main takeaway from the calculation of incidence rates using five-year (2006-2010) county-level case counts was that more than 600 counties have TB rates above the 2011 national rate.
To gain a better sense of the different perspectives on the usefulness of Big Data, I got in touch with Dr. Edward Nardell and Dr. Eric Rubin. They are both professors in the Department of Immunology and Infectious Diseases at the Harvard School of Public Health. I asked them how a county-level map of TB rates could impact case finding and treatment of TB in the short- and long-term, and their responses raised interesting points. Dr. Nardell, a former TB Control Officer with the Health Department in Boston, claimed that although the CDC does not provide TB case rates by county, its report “gives a pretty comprehensive picture of TB in the U.S,” and in his experience, it was the “local state data that mattered most.” He added that if there were cross-state issues, the Department would discuss the concerns with the states directly. Dr. Rubin felt that getting spatiotemporal data to aid in TB control efforts may be particularly helpful in areas with under-reporting, but he also believed that a county-level map may not be so useful for the United States. This is mainly because people in the field are already familiar with the hotspots for disease and the results Scales and his colleagues obtained through their research. As he indicated, “rural cases are rare in the U.S and happen along the border; conversely, cities with large populations of new immigrants from endemic areas have the highest rates. No surprises here.”
I interviewed Scales to learn more about why he and his colleagues strongly recommended that all states publish their county-level TB data online. He emphasized that since the spread of TB depends on the proximity between infected and uninfected individuals, having data on a more granular level is one step towards the ideal data set for TB surveillance. In his view, the Holy Grail would be to get records composed of demographic and census track data, and he maintains that his team’s research would not be the “end of the story.” Rather, the availability of county-level data for TB could lead into more TB research that would likely make a difference in the field by making it feasible for researchers to update their data while saving a lot of time. As he declared, “In the right hands, someone may be able to find potentially surprising correlations.” For example, the census has data on how many people take public transportation in a county, so we can investigate on a national level whether people who take public transportation are at a higher risk for contracting TB. Therefore, “What’s novel about this paper is not [so much] the results. It’s the data” that it provides to the public. When I asked Scales if there is going to be an end to the quest for Big Data, he replied that it will be finite because the possibilities are limited by privacy issues. However, he did stress that we have a lot more data to obtain and a lot more work to do before hitting a patient confidentiality roadblock or another boundary that would put an end to studies involving Big Data.