Chapter 8 – Working with Text

Christine Monnier

Chapter 8 – Working with Text

In this last chapter, we will expand on the work we did in chapter 7 and work with more complex text data than just simple vectors. Here, we will work with entire texts, from political speeches to work of literature. Working with full documents, as opposed to ready-made vectors, will require a lot more cleaning and pre-processing than we have done so far. We will use the tidytext package to help us get to tidy. We also do more than pattern-matching.

We will also use data extracted from the Gutenberg Project, a website well worth your time, where you can find the full text of literary works that in the public domain.

For those of you in Sociology 1205, this chapter covers the unit 8 data wrangling module. You will find the tutorial and exercise scripts in your R Studio Cloud environment.

Learning Objectives

In this chapter, we will cover the following topics:

pre-processing steps in text mining;
turning a document into a corpus;
tidying a corpus;
creating a term document matrix;
creating a wordcloud;
finding works from the Gutenberg Project;
extracting and downloading works from the Gutenberg project;
text mining works of literature.

Part 1 – Preprocessing steps to tidy text

As mentioned, tidying text data often involves multiple steps. In addition, preparing data in order to generate a word cloud requires even more preprocessing. This is described in the tutorial below:

Part 2 – Producing a word cloud

Part 3 – Additional text mining from a term document matrix

Part 4 – Text tidying and text mining for literary work

In this second tutorial, we will use the tidytext and gutenbegr packages to tidy and mine works from the Gutenberg Project.

Part 5 – Finding and extracting text data

In this second part, we will learn how to find and extract work from the Gutenberg Project.

Part 6 – One last example

Key functions used in this chapter

Part 1

read_lines(): the function that load a text dataset into the environment;
Corpus(): the function that turns a text into a corpus;
inspect(): the function that displays information about a corpus;
tm_map(): the function that applies transformations to a corpus;
TermDocumentMatrix(): the function that constructs a term-document matrix from a corpus;
wordcloud2(): the function that creates a word cloud from a text data frame;
WCtheme(): the function that applies a theme to a word cloud;
findFreqTerms(): the function that finds frequent terms in a document;
findMostFreqTerms(): the function that finds the most frequent terms in a document and returns a vector;
findAssocs(): the function that finds associations between words and compute their correlation coefficient.

Part 2

gutenberg_download(): the function that downloads works from the Gutenberg Project;
unnest_tokens(): the function that tokenizes a column;
anti-join(): the function that returns all the items in one dataset that have no match in another dataset;
gutenberg_works(): the function that finds works on the Gutenberg Project;
gutenberg_metadata(): the function that retrieves metadata from works on the Gutenberg Project;
theme_few(): the function that applies a theme to a ggplot, based on the aesthetics of Stephen Few;
theme_tufte(): the function that applies a theme to a ggplot, based on the aesthetics of Edward Tufte;
theme_hc(): the function that applies a theme to a ggplot, based on a highchart aesthetics.

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License