Research Guides: *Linguistics and Computational Linguistics: Corpora &amp; Text Mining

Linguistic Corpora

Child Language Data Exchange System (CHILDES)
Database of transcribed audio recordings of conversations with children.
Corpus of Contemporary American English (COCA)
The corpus contains more than 560 million words of text (20 million words each year 1990-2017) and it is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts.
Corpus of Historical American English (COHA)
Corpus contains more than 400 million words of text from the 1810s-2000s. It's the largest structured corpus of historical English.
Michigan Corpus of Academic Spoken English (MiCASE)
An online, searchable collection of transcripts of academic speech events recorded at the University of Michigan.
Santa Barbara Corpus of Spoken American English
Corpus of spoken English. Includes transcriptions, audio, and timestamps which correlate transcription and audio at the level of individual intonation units.
Linguistic Data (re3data.org)
Data repository that links to external sites that have linguistic corpora and other data.

HathiTrust Research Center

HathiTrust Research Center
Provides tools for text mining the 16 million+ volumes in the HathiTrust. The collection spans the history of printed text, primarily in English, but also in German, French, Spanish, and Russian, among over 400 other languages. The collection contains both fiction, from early novels to present-day works, and nonfiction, including a robust government documents collection.

HathiTrust Digital Library
Digital repository of the immense collections of many universities and institutions, full text searchable.

JSTOR Data for Research

The JSTOR Data for Research service provides the public with data and text mining access to JSTOR content.

JSTOR Data for Research
site offers faceted searching, topic modeling, and data visualization tools. Researchers can view and download document-level data sets that may include n-grams, metadata, word frequencies, citations, and full text. Datasets of up to 25,000 documents (metadata and/or n-grams only) can be created using the self-service option. Larger datasets and full-text datasets can be obtained by special request.

ScienceDirect & Scopus

Text mining access to content in our ScienceDirect and Scopus databases is available through an API.

Elsevier: Text and Data Mining
Site includes some overviews on text mining, video tutorials, and information on how to text mine content from Elsevier databases.

Chronicling America

The Library of Congress: Chronicling America collection provides access to information about historic newspapers and select digitized newspapers in the United States. The Library of Congress designed several different views of the data they provided, all of which are publicly visible. Each uses common web protocols, and access is not restricted in any way. You do not need to apply for an API key to use them.

Library of Congress: Chronicling America API

Text Creation Partnership Corpora

The Text Creation Partnership has created standardized and accurate XML/SGML-encoded editions of early printed books from ProQuest’s Early English Books Online, Gale Cengage’s Eighteenth Century Collections Online, and Readex’s Evans Early American Imprints.

Early English Books Online - TCP
The Text Creation Partnership has produced approximately 73,000 accurate, searchable, full-text transcriptions of early print books from England, Ireland, Scotland, Wales and British North America and works in English printed elsewhere from 1473-1700.
Eighteenth Century Collections Online - TCP
English-language and foreign-language titles printed in the United Kingdom during the 18th century, along with works from the Americas. About 2% (3000 titles) of the total ECCO collection is available for text mining.
Evans Early American Imprints Collection - TCP
About 5,000 texts are available.