Skip to Main Content

Text Analysis

This guide provides a list of tools, resources, and a hands-on training activity for learning about text analysis.

Twitter Data for Text Analysis

Restrictions on Twitter Data

Below are a few options for extracting Twitter data for text analysis. For any method you choose, bear in mind some restrictions on what you can do with extracted data.

You can use the data for a text analysis and share your results online or in a written publication (e.g. write about or create a graphic describing different trends). 

You can't distribute scraped Tweets. In other words, don't post the Excel database of Tweets on a research website, or similar platform for others to see and download. This is to protect privacy. Read Twitter's policy on Content Redistribution to Third Parties to fully understand these rules, as well as potential workarounds if you're an academic researcher or educator. 

You can distribute a dataset which only lists Tweet IDs. Tweet IDs are unique identifiers tied to usernames, Tweets, and direct messages. With a list of Tweets IDs and a hydrator tool (e.g. DocNow's Tweet Hydrator) another person can reproduce your dataset. This is legal and a good option for individuals who want their raw data (i.e. their Excel database) accessible for others to review/reuse. 

Making a Twitter Dataset using TweetSets 

TweetsSets is a great option for beginners interested in using Twitter data, and it legally shares Tweets via ID. This website is managed by GW University Libraries. No coding experience is necessary. Because these are pre-made datasets, you will be limited to Tweets with certain subjects and dates.

Choose a topic of interest (e.g. "climate change", "2016 election", etc.) and from there, filter out results based on particular keywords, dates, etc. In order to adhere to Twitter's policy on Content Redistribution to Third Parties, TweetSets will provide you with a file of Tweet IDs. Using DocNow's Tweet Hydrator, join these ID numbers to ID numbers on Twitter. The resultant Excel file lists usernames, tweet contents, hashtags, dates, etc.

Here is an excellent tutorial created by the Programming Historian

Making a Twitter Dataset with a Web Scraper

When you rely on Tweet ID datasets posted online, your options are limited. You might have specific key terms or a timeframe that has not been captured by TweetSets or other similar resources. Webscrapers are tools (most often built in Python) that will automatically search for your key term(s) on a website and store it in a format you specify. They can accommodate complex search-strings, integrating hashtags, exclude certain phrases, isolate Tweets from specific regions or within a particular timeframe.

Web scrapers do require programming knowledge, even if you find a ready-made option on GitHub. Check out this example created by programmers at Microsoft. And remember: you can always share the results of your text analysis, but you need to read Twitter's policy on Content Redistribution to Third Parties before you consider sharing a downloadable version of the actual tweets.