Chapter 8 – Working with Text
In this last chapter, we will expand on the work we did in chapter 7 and work with more complex text data than just simple vectors. Here, we will work with entire texts, from political speeches to work of literature. Working with full documents, as opposed to ready-made vectors, will require a lot more cleaning and pre-processing than we have done so far. We will use the tidytext package to help us get to tidy. We also do more than pattern-matching.
We will also use data extracted from the Gutenberg Project, a website well worth your time, where you can find the full text of literary works that in the public domain.
For those of you in Sociology 1205, this chapter covers the unit 8 data wrangling module. You will find the tutorial and exercise scripts in your R Studio Cloud environment.
Learning Objectives
In this chapter, we will cover the following topics:
- pre-processing steps in text mining;
- turning a document into a corpus;
- tidying a corpus;
- creating a term document matrix;
- creating a wordcloud;
- finding works from the Gutenberg Project;
- extracting and downloading works from the Gutenberg project;
- text mining works of literature.
Part 1 – Preprocessing steps to tidy text
As mentioned, tidying text data often involves multiple steps. In addition, preparing data in order to generate a word cloud requires even more preprocessing. This is described in the tutorial below:
Part 2 – Producing a word cloud
Part 3 – Additional text mining from a term document matrix
Part 4 – Text tidying and text mining for literary work
In this second tutorial, we will use the tidytext and gutenbegr packages to tidy and mine works from the Gutenberg Project.
Part 5 – Finding and extracting text data
In this second part, we will learn how to find and extract work from the Gutenberg Project.
Part 6 – One last example
Key functions used in this chapter
Part 1
- read_lines(): the function that load a text dataset into the environment;
- Corpus(): the function that turns a text into a corpus;
- inspect(): the function that displays information about a corpus;
- tm_map(): the function that applies transformations to a corpus;
- TermDocumentMatrix(): the function that constructs a term-document matrix from a corpus;
- wordcloud2(): the function that creates a word cloud from a text data frame;
- WCtheme(): the function that applies a theme to a word cloud;
- findFreqTerms(): the function that finds frequent terms in a document;
- findMostFreqTerms(): the function that finds the most frequent terms in a document and returns a vector;
- findAssocs(): the function that finds associations between words and compute their correlation coefficient.
Part 2
- gutenberg_download(): the function that downloads works from the Gutenberg Project;
- unnest_tokens(): the function that tokenizes a column;
- anti-join(): the function that returns all the items in one dataset that have no match in another dataset;
- gutenberg_works(): the function that finds works on the Gutenberg Project;
- gutenberg_metadata(): the function that retrieves metadata from works on the Gutenberg Project;
- theme_few(): the function that applies a theme to a ggplot, based on the aesthetics of Stephen Few;
- theme_tufte(): the function that applies a theme to a ggplot, based on the aesthetics of Edward Tufte;
- theme_hc(): the function that applies a theme to a ggplot, based on a highchart aesthetics.