If you are one of the 255 million active monthly users of the social networking site Twitter, you may be surprised to learn that you are contributing to science. Researchers all over the world use tweets to study infectious diseases, politics, linguistics, natural disasters, and more. Digital data mining has been described as holding ‘unparalleled potential for epidemiology.’ Although useful as a data source, some worry that these studies may violate privacy.
Unlike other social networking sites, Twitter has an application programming interface (API) that allows automatic collection of streaming data. As people send tweets, their messages, user information, and profile information are packaged up and sent through the API in real time, which can then be collected by anyone who has set up a program to listen.
Most studies that use data from Twitter simply count keywords, like the number of times ‘sneeze’ is tweeted each day. However, more elaborate study designs are possible: for example, Twitter’s optional geotagging feature appends the user’s exact latitude and longitude to the tweet. These coordinates can be used to identify a person’s home, work and school locations, or to construct activity schedules.
The ethics of these practices is unclear. The Obama Administration has addressed similar problems in the corporate sector by conducting a Big Data and Privacy Review, and by issuing the Consumer Privacy Bill of Rights [pdf]. Both are efforts to protect people from invasions of privacy by companies using their data for marketing and sales.
In the research world however, the issue is far from settled. Some researchers argue that tweets are public by design, and that any use of that data is appropriate. Others feel that Twitter users might not know or understand that their tweets are being used for research purposes, or may not understand how personal information can be curated beyond what they intended to release.
The truth is that tweets are collected and scrutinized by researchers (and businesses) for a variety of purposes. Sometimes those purposes preserve privacy, and sometimes they don’t. There are numerous scientific papers that include usernames and tweet texts in their entirety, which can easily be entered into a search engine to find the original author. There are also efforts to infer a user’s age, sex, and location by analyzing tweet texts for clues. Tweet metadata can also sometimes be used to piece together the identity of Twitter users, which researchers could then use to look up additional information on Facebook, LinkedIn, or the White Pages. These practices are unusually invasive, and violate the privacy of the people providing the data.
Normally, research involving people is approved and monitored by Institutional Review Boards (IRB). However, studies using publicly available data are not required to seek approval from IRB, so they are currently unregulated. When IRB rules were formulated, platforms for users to unwittingly volunteer their non-anonymous data for research did not exist. The advent of social networking sites like Twitter necessitate an update of the rules governing human research.
My colleague and I have suggested a number of guidelines for ethically using Twitter for research, including not publishing identifiable information; never using Twitter data to find and curate data from other sources (i.e. ‘snowballing’); and coordinating with Institutional Review Boards before starting projects that require collecting data from specific people, rather than in aggregate.
As use of Twitter for research purposes increases, so to will privacy concerns. Twitter users who wish to protect their privacy may do so refraining from posting identifiable information; ensuring that geotagging is turned off; or by enabling the ‘protect my tweets’ setting.