Chapter 7 – Working with Character Strings
In this chapter, we are shifting gear considerable. Instead of working with datasets that need to be tidied with the tools we learn with dplyr and tidyr, we are turning our attention to character strings since a lot of data comes in the form of text of some kind. We need specific tools to work with text, words, and character strings. This is where the stringr package, out of the tidyverse, comes in handy.
For those of you in Sociology 1205, this chapter covers the unit 7 data wrangling module. You will find the tutorial and exercise scripts in your R Studio Cloud environment.
Working with character strings requires quite a bit of cleaning up to get to tidy. In this chapter, we will work on some simple character strings to get us started with what is called text mining. We will also be introduced to regular expressions, or regex.
Learning Objectives
In this chapter, we will cover the following topics:
- introduction to the core concepts of text mining;
- introduction to the key functions of the stringr() package;
- introduction to regular expressions (regex);
- pattern-matching within character strings;
Part 1 – Introduction to text mining
Part 2 – Introduction to stringr and regular expressions
Part 3 – Text mining practice 1
Part 4 – Text mining practice 2
Part 5 – Text mining practice 3
Part 6 – Text mining practice 4
Part 7 – Text mining practice 5
Key functions used in this chapter
- str_c(): the function that joins multiple strings into one;
- length(): the function that returns the length of a vector;
- str_length(): the function that returns the length of every element in a character vector;
- which.min(): the function that indexes the smallest element in a vector;
- which.max(): the function that indexes the largest element in a vector;
- mean(): the function that returns the mean of a vector;
- median(): the function that returns the median of a vectors;
- hist(): the function that plots a histogram;
- boxplot(): the function that plots a boxplot;
- str_sub(): the function that extracts parts of a string;
- str_detect(): the function that finds patterns to match and returns a logical vector of the results;
- str_subset(): the function that extracts matching patterns and returns a vector of the results;
- str_view(): the function that returns a vector in the viewer in R Studio, highlighting the elements that match a pattern;
- str_count(): the function that counts the elements of a vector that match a pattern;
- str_to_upper(): the function that capitalizes the characters in a vector;
- str_to_lower(): the function that removes that capitalization in a vector;
- str_to_title(): the function that capitalizes the first character of each element in a vector;
- str_split(): the function that splits a character string;
- str_replace(): the function that replaces elements of a character string;
- barplot(): the function that creates a barplot;
Before moving on to the next chapter or the exercise (for those of you in Sociology 1205), test your understanding with this quiz.