Chapter 3 – Clustering
In this chapter, we introduce a form of unsupervised learning: clustering. We start with a multi-part lecture to understand the basic concepts and approach to clustering, followed by tutorials on the two main forms of clustering.
For those of you in Sociology 1205. this chapter covers the Doing Data Science module for Unit 3. You will find the scripts for the tutorials and exercise in our R Studio Cloud workspace.
Learning Objectives
In this chapter, we cover the following topics:
- understanding unsupervised learning and its purposes;
- understanding clustering as a form of unsupervised learning;
- understanding the concept of clusters;
- applying the concepts of distance to clustering;
- understanding the importance of scaling data;
- compare the differences between hierarchical clustering and k-means clustering;
- apply the steps in the clustering process in R;
- interpret the results of the clustering outputs in R.
Lecture – Part 1
Lecture – Part 2
Lecture – Part 3
Tutorial 1 – Part 1
Tutorial 1 – Part 2
Tutorial 2 – Part 1
Tutorial 2 – Part 2
Key functions used in this chapter
- scale(): the function that sets multiple variables on a common scale;
- dist(): the function that takes a dataset and creates a distance matrix;
- hclust(): the function that runs a hierarchical clustering algorithm;
- NbClust(): the function that determines the best number of clusters;
- barplot(): the function that creates a bar plot;
- cutree(): the function that specifies the parameter of hierarchical clustering by groups or height;
- aggregate(): the function that splits the data into subgroups and computes descriptive statistics for each subgroup;
- rect.hclust(): the function that draws a rectangle around hierarchical clusters;
- kmeans(): the function that runs a k-means clustering algorithm;
- round(): the function that rounds numbers to specified digits;
- adjustedRandIndex(): the function that computes the rand index;