Research Guides: Text Analysis: Finding, Cleaning, and Uploading Textual Data

Text Analysis

Finding, Cleaning, and Uploading Textual Data

Finding Data

Text analysis requires a digital text file. Your text can be anything: a novel, journal article, tweets, etc. Here are some things to keep in mind:

Journal Databases There are some tools with datasets embedded and ready for analysis. For example, Scopus pulls directly from Elsevier's databases. You don't need to upload anything!
Twitter Tweets are a great way to assess public opinion. Downloading Tweets requires some coding knowledge. Alternatively, there are some ready-made Twitter datasets. Check out this page for more information about Twitter data.
Scraping Websites Websites like TripAdvisor and Yelp are another common data source. You can copy/paste reviews into an Excel sheet. A faster option is web scraping -- this means writing a code that will automatically scroll through the website, copy text you're interested in, and then paste it in an Excel sheet. Web scraping tools are often written in Python, which you can learn at a library workshop. FYI: make sure you check any rules/limitations before you scrape. Most companies are OK with it if you're using the data for an educational purpose.

Data Format

Every tool is different. Verify what data formats are acceptable.
PDF, Excel, and txt files are commonly acceptable data formats. If you use a PDF, it needs to be in true format. A good way to check this is by trying to copy and paste from the PDF into Word.
If you have a scanned image of typed text, or handwriting, you will need to type it out! OCR Machine Learning is a way for computers to recognize handwriting, but that's a complex process and beyond the scope of this guide.

Be Flexible

Text analysis tools are all designed differently. You might need to experiment with a few, especially if you're interested in comparing across multiple authors/texts. Understand whether and how a tool can make that distinction.

Say I want to find out how many people tweeted negative statements about "climate change" in 2020. I decide to save 100 tweets in one Excel file, where every row records a different person's tweet. I decide to use Voyant and upload the file. I discover a problem: Voyant assumes one file = one author. If I want Voyant to identify and compare across different authors, I need to save every tweet (i.e. every row) in a separate Excel file and upload them all. Or, I could use a different tool, like Orange, which can distinguish multiple authors in one digital file.

Clean your Data

"Cleaning" means prepping the dataset -- removing extraneous rows and columns or information; deleting problematic icons and symbols; reformatting dates and times; adding column headings.

Not sure where to start? Try uploading your dataset to whatever text analysis tool you chose. Your tool might "analyze" text you're not interested in (e.g. page numbers, words like "chapter", etc.).

The library can also help! Try an Excel workshop. There are also several librarians who can help you: Ford Fishman, Margarita Corral, and Natalie Susmann.

And remember -- text analysis is an iterative process. You will likely clean your data, upload it, and then discover more cleaning is needed. This happens to everyone!