K-Means and Hierarchical Clustering Techniques

Vignette on implementing K-Means and Hierarchical clustering methods using California housing data; created as a class project for PSTAT197A in Fall 2023.

Contributors

Manuri Alwis, Jade Thai, Jennifer Rink, Kayla Katakis, and Kylie Maeda

Vignette Abstract

Clustering is an unsupervised machine learning method of identifying and grouping similar observations in a dataset without looking at the outcome. It is typically used to classify data into groups that allows for ease in understanding and manipulating. The data used in this vignette is California Housing Data from 1990 for each district in the state. The variables include the median income, median age of houses, average number of rooms, average number of bedrooms, population, average occupants, latitude, longitude, and median house value. The objective of the vignette is to learn about different clustering methods by utilizing them on California housing data. In this project, two types of clustering: k-means and hierarchical. With k-means, the data is partitioned into k clusters in an attempt to minimize variance within each cluster, while hierarchical abstractly clusters by similarity. In this analysis, it was discovered that hierarchical clustering is not conducive to large datasets even when employing the dimensionality-reducing technique Principle Component Analysis. Many clusters were overlapping and difficult to distinguish in the 2 Dimensional space. With k-means, the clusters formed were representative of existing economic and geographic regions of California. For example, a cluster focused on coastal regions of the Bay Area, Los Angeles, and San Diego had the most expensive houses and highest median income. Overall, clustering methods are a highly valuable tool in data analysis when used in the right scenarios, and an in-depth analysis of these methods are the basis of this vignette.

Repository Contents

In this repository, we have our main vignette, which contains our data exploration and clustering methods. We also have 3 sub-folders:

data: contains the data used, files include housing.csv, and HCLUST_bootstrap.rds which is a bootstrapping object for the hierarchical clustering method
scripts: one script vignette-script.R that contains all the code used in the vignette markdown with line annotations
- drafts: two scripts used to carry out clustering techniques k-means.R and hierarchical.R
images: contains .png files of plots and images used in the final vignette

The repository also contains this README.md file, which contains the description, contributors, abstract, contents, and references for this project.

Reference List

California Housing Dataset: https://www.kaggle.com/datasets/camnugent/california-housing-prices

K-Means Clustering:

(LEDU), Education Ecosystem. "Understanding K-Means Clustering in Machine Learning." Medium, Towards Data Science, 12 Sept. 2018, https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1

"ML Clustering: When to Use Cluster Analysis, When to Avoid It." Explorium, 6 Aug. 2023, www.explorium.ai/blog/machine-learning/clustering-when-you-should-use-it-and-avoid-it/#.

Hierarchical Clustering:

Bock, Tim. "What Is Hierarchical Clustering?" Displayr, 13 Sept. 2022, www.displayr.com/what-is-hierarchical-clustering/#:\~:text=Hierarchical%20clustering%2C%20also%20known%20as,broadly%20similar%20to%20each%20other.

Patlolla, Chaitanya Reddy. "Understanding the Concept of Hierarchical Clustering." Medium, Towards Data Science, 29 May 2020, https://towardsdatascience.com/understanding-the-concept-of-hierarchical-clustering-technique-c6e8243758ec

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
data		data
images		images
scripts		scripts
.gitignore		.gitignore
FinalProject_Group4.Rproj		FinalProject_Group4.Rproj
README.md		README.md
vignettedocument.Rmd		vignettedocument.Rmd
vignettedocument.html		vignettedocument.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

K-Means and Hierarchical Clustering Techniques

Contributors

Vignette Abstract

Repository Contents

Reference List

Further Readings

About

Releases

Packages

Contributors 5

Languages

PSTAT197-F23/vignette-clustering-methods

Folders and files

Latest commit

History

Repository files navigation

K-Means and Hierarchical Clustering Techniques

Contributors

Vignette Abstract

Repository Contents

Reference List

Further Readings

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages