Exploratory Analysis of Social Media Data Using Basic Natural Language Processing (NLP): A Case Study to Characterize Foodborne Illness

Ordun, Catherine

BACKGROUND: Social media data contains a variety of useful information characterizing how people communicate sickness, potential sources of illness, location of infection, and medical attention-seeking behavior. Enhancing understanding of these communications provide opportunities for improved detection and monitoring of illness and development of targeted public health interventions. Natural Language Processing (NLP) is often utilized to mine social media. Although a rich method for analysis of unstructured data, NLP requires specialized skillsets, time, and cost.

METHODS: We developed a “basic NLP” technique using open source software to analyze a corpus of 127,013 tweets from 12/23/2011 to 8/5/2012 using Excel and R statistics via stacked regular expressions and Latent Dirichlet Allocation (LDA) topic modeling using Mallet. For users with limited time and budget, this technique stands on its own by providing significant overall value in its ability to provide immediate, insightful and actionable information to characterize social media data.

RESULTS:

We identified 5 standard forms of syntax used to express food poisoning complaints through Twitter.
We filtered 6 geolocation data categories commonly reported in Twitter data. Our results corroborated published research, and through our technique we identified that while 24% of tweets enabled geolocation, only 1.2% of tweets were actually geotagged. Less than 1% of all tweets contained all 6 geolocation fields analyzed. Of these 1%, a nominal number provided geographic context clues that allowed us to validate geolocation.
Topic modeling results implied that users tweeted more frequently about what may have caused their food poisoning (7%), as opposed to medical attention seeking behavior (2.2%), or geolocation/geotagged information (1.2%). Users infrequently reported medical-attention seeking behavior - topic modeling implied that 2.2% of tweets mentioned visiting a doctor, hospital, taking drugs, or other remedies.
Random sampling (95% CI) indicated that nearly 9% of tweets mentioned a restaurant implicated in food poisoning, followed by a food product (7.9%). Random sampling also indicated that users frequently mention symptoms (5.7%), only 2.0% of tweets indicate visiting a hospital or medical specialist, and a very nominal number mention taking drugs, medication, or other remedies.

CONCLUSIONS: These results allow users to identify knowledge gaps, develop use cases, and better understand social media data. This presentation will discuss specifically Twitter data and demonstrate our techniques of “basic NLP” using Excel, R, and Mallet to provide immediate, insightful, and actionable information to characterize social media data.