Text analysis for public health
Another day in the global pandemic. Average Joes are busy tweeting about it, politicians give interviews on the latest plans, and newspapers publish article after article on vaccination levels, case counts, and the booster shot. That’s a ton of information. So much in fact, that it would be pretty nice to have some computer assisted help to sort through it.
Enter stage right: text analysis. Just what is it, and in the midst of COVID-19, how can it be used to advance public health?
Text analysis is a family of analytic techniques used to identify patterns and meaning from unstructured text, that is, text that a computer can’t readily understand. Aka, most qualitative data. And there is a lot of that sort of data floating around. We’re talking tweets, Reddit posts, and emails, but also electronic health records (EHRs), books, and even academic research. You’ll probably agree that in that list alone, there’s a lot of valuable data!
This post is meant to serve as an overview of text analysis techniques in public health. To do that, I’ll briefly discuss major text analysis approaches, highlight a few broad applications in the field, present two case studies for a deeper dive, and finally, identify some important limitations.
I should note this post is not: 1) a how-to guide for text analysis, 2) an in-depth discussion of the latest and greatest machine learning text-analysis techniques, or 3) a systematic review of public health text analysis applications. Rather, I hope this cracks the door open for newcomers to text analysis through tangible examples applied to the COVID-19 pandemic.
Common Approaches to Text Analysis
Text analysis comes in many flavors. Many analytic approaches exist for different goals, and many analysis techniques interact or build off of each other. I describe some common techniques below. Take a look at the recommended resources at the end of this post for more details.
N-grams: You may have seen some word cloud looking visuals created through N-gram analysis. An N-grams approach identifies which words are common neighbors or pairings in order to pick out ideas frequently discussed together. Other text analysis techniques build off these N-grams, like topic modeling. Which brings us to…
Topic modeling: Aims to identify the primary themes of a body of text and facilitates analysis of how they relate to each other. This is done using a statistical measurement of “distance” or “similarity” between words.
Sentiment analysis: In its simplest form, sentiment analysis assesses positive, neutral, or negative feelings towards a subject. More complicated analysis identifies a broader variety of tones, like anger and happiness.
Named entity recognition: Allows speedy identification of people, phrases, and organizations that come up again and again in a body of text. This allows the user some insight into possible influence of these entities.
One note here: text analysis is not a replacement for traditional qualitative methods. Rather, it should augment content analysis, discourse analysis, and other qualitative approaches (and vice versa).
Applications in Public Health: Selected Highlights
Public health research and practice is already taking advantage of text analysis, but there’s room to mature and many new avenues continue to open.
Public health surveillance, for one, has dabbled in text analysis for over a decade. By surveillance, I mean collecting data on emerging disease trends and existing disease burdens to inform policy and programs. Some epidemiologists are using social media data for syndromic surveillance, or tracking of symptom patterns in the population prior to a firm diagnosis.1 Syndromic surveillance is essentially an early warning system that can sound the alarm on things like flu outbreaks or opioid use spikes.
Another major application is assessing public perceptions of a health policy or program. The public voice now available on social media provides a massive amount of data that can be used to gauge how, for example, city residents feel about the COVID-19 vaccine and what might be contributing to vaccine hesitancy.2 One of the biggest advantages is the ease of access to this sort of data. Rapid social media collection and analysis can provide insights that complement a survey of the same population which will inevitably take longer to roll out and complete.
This is far from an exhaustive list, but hopefully it provides a taste of how we can use these tools to support public health’s mission. To see text analysis approaches in action, I provide two case studies from recent COVID-19 research below.
Case Study One: Assessing How Canadian Media Covers COVID-19
Early in the pandemic, Poirier et al. (2020) used text analysis to investigate how mainstream Canadian news media framed the COVID-19 emergency.3 Briefly, framing refers to how the media discusses a subject, in this case, the coronavirus: what topics are brought up in relation to the virus, how often, and in what tone. News media is a major influence on public interpretation of many subjects, including health.
This article uses a popular topic modeling technique (latent Dirichlet allocation) to identify topics, interpreted here as frames. The researchers ultimately pull out six major frames: “Chinese Outbreak”, “Economic Crisis”, “Health Crisis”, “Helping Canadians”, “Social impact”, and “Western Deterioration.” Then they compare how often these frames show up in each media outlet, and the changes in popularity of each framing over a period of months.
So, here, topic modelling provides insights into how the Canadian public may be thinking about COVID-19, something very useful for public health authorities to know as they shape public service announcements, press releases, and communications campaigns.
Case Study Two: Analyzing US Congress Members’ Facebook Communications on COVID-19
An article published just two months ago investigates partisan difference and impact of tone in U.S. Congress members Facebook posts about COVID-194. Political elites, like the media, play a major role in shaping public understandings of a health crisis and setting the framing around related policy measures.
Similar to Case Study One, this article uses topic modelling to identify the primary topics of the posts. They add sentiment analysis to assess the tone (positive, negative, or neutral) of each message and analyze how topics are spoken about by the two political parties. Finally, the authors assess the reaction and spread of the posts. What they found was evidence of a polarized political response, greater reactions to more polarized content, and greater spread for more negative comments. They also identified significant differences in how each party discussed the topics and how frequently they discussed the various subjects.
Intuitively, this probably makes sense to you if you’ve been following American politics during the pandemic. What this article adds, in my opinion, is a birds eye view of political framing around COVID-19. In a glance, you can get a sense of how the parties compare in their discussion of the pandemic. It’s a snapshot of the national political landscape in which public health professionals try to create effective pandemic policies, and as we’ve seen, that landscape has a serious impact on those policies’ success.
As with any tool, there are some limitations. For one, computers don’t always get sarcasm and can struggle to account for humor. Social media’s informality is another challenge. Users often play a little fast and loose with spelling, throwing in abbreviations at will to meet Twitter’s character limits. And how do you interpret a random squid emoji? Even humans don’t know.
Learned bias is also an issue for the machine learning algorithms used in text analysis (and in machine learning algorithms in general). You’ve likely heard some talk of this, and for good reason as biased algorithms can easily reproduce inequity in our society. In the case of text analysis, algorithms are usually trained on a steady diet of human produced text. Humans have all sorts of implicit and explicit biases on a range of topics, including race, gender, and religion. They express those biases in their written work, so the ethical text analyst needs to account for such bias and try to limit its influence. For more, see this article by the Brookings Institute, or this one on Vox.
As relatively long standing issues, some workarounds exist to address these problems. However, that doesn’t mean they’re fully resolved. Some say we’re in the infancy of machine learning, and as such we’re likely to see better techniques in the future. Maybe one day, someone will even figure out what the squid emoji means.
Clearly, text analysis can contribute to public health. Indeed, it already has. Whether in surveillance, public communication strategy, or political landscape analysis, there are a lot of public health needs where these techniques can (and do) play a role. As they become more accessible, maybe we will come to see text analysis in the standard toolbox of government epidemiology divisions, non-profits, and healthcare systems. Time will tell. In the meantime, I hope this article encourages you to dig into text analysis and see what it can do for you.
For more, check out the following websites.
- R for Humanities – a great guide to using text analysis for humanities research, from framing your research question down to the finer details of implementing these techniques in R. General approaches apply to other fields as well.
- Text Mining with R, Julia Silge and David Robinson
- Digital Humanities at Berkeley – Text Analysis Resources
Last but not least, D-Lab regularly hosts its very own Introduction to Text Analysis Workshop! Keep an eye on the calendar for upcoming sessions.
1. Eysenbach G. Infodemiology and Infoveillance: Framework for an Emerging Set of Public Health Informatics Methods to Analyze Search, Communication and Publication Behavior on the Internet. J Med Internet Res. 2009;11(1):e11. doi:10.2196/jmir.1157
2. Zencity, Goldsmith S, Bennett Midland. Sentiment Analysis as a Local Public Health Tool: Using Community Insights to Combat COVID-19 Vaccine Hesitancy.; 2021.
3. Poirier W, Ouellet C, Rancourt M, Béchard J, Dufresne Y. (Un)Covering the COVID-19 Pandemic: Framing Analysis of the Crisis in Canada. Can J Polit Sci. 2020;53(2):365-371. doi:10.1017/S0008423920000372
4. Box-Steffensmeier JM, Moses L. Meaningful messaging: Sentiment in elite social media communication with the public on the COVID-19 pandemic. Sci Adv. 2021;7(29):eabg2898. doi:10.1126/sciadv.abg2898
5. Scales D. Opportunities and Challenges for Developing Syndromic Surveillance Systems for the Detection of Social Epidemics. Online J Public Health Inform. 2020;12(1):e6. https://doi.org/10.5210/ojphi.v12i1.10579.