-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathL123_kmeans_Final.Rmd
212 lines (159 loc) · 5.54 KB
/
L123_kmeans_Final.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
---
title: "kmeans Clustering"
author: "Bert Gollnick"
output:
html_document:
toc: true
toc_depth: 2
toc_float: true
code_folding: hide
number_sections: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r}
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(stats))
suppressPackageStartupMessages(library(readr))
```
# Data Understanding
We will work on a dataset on Zoo animals. We want to cluster the animals by their attributes.
More information on the data can be found [here](https://archive.ics.uci.edu/ml/datasets/zoo)
We work with these attributes:
1. animal name
2. hair
3. feathers
4. eggs
5. milk
6. airborne
7. aquatic
8. predator
9. toothed
10. backbone
11. breathes
12. venomous
13. fins
14. legs
15. tail
16. domestic
17. catsize
18. type
Our target variable "type" has 7 classes with integer values of 1 to 7.
These numbers represent these groups:
1 -- (41) aardvark, antelope, bear, boar, buffalo, calf, cavy, cheetah, deer, dolphin, elephant, fruitbat, giraffe, girl, goat, gorilla, hamster, hare, leopard, lion, lynx, mink, mole, mongoose, opossum, oryx, platypus, polecat, pony, porpoise, puma, pussycat, raccoon, reindeer, seal, sealion, squirrel, vampire, vole, wallaby,wolf
2 -- (20) chicken, crow, dove, duck, flamingo, gull, hawk, kiwi, lark, ostrich, parakeet, penguin, pheasant, rhea, skimmer, skua, sparrow, swan, vulture, wren
3 -- (5) pitviper, seasnake, slowworm, tortoise, tuatara
4 -- (13) bass, carp, catfish, chub, dogfish, haddock, herring, pike, piranha, seahorse, sole, stingray, tuna
5 -- (4) frog, frog, newt, toad
6 -- (8) flea, gnat, honeybee, housefly, ladybird, moth, termite, wasp
7 -- (10) clam, crab, crayfish, lobster, octopus, scorpion, seawasp, slug, starfish, worm
# Data Preparation
## Data Import
```{r}
# if file does not exist, download it first
file_path <- "./data/zoo.csv"
if (!file.exists(file_path)) {
dir.create("./data")
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/zoo/zoo.data"
download.file(url = url,
destfile = file_path)
}
zoo_raw <- read.csv(file = file_path)
```
## Setting Column Names
```{r}
colnames(zoo_raw) <- c("animal_name", "hair", "feathers", "eggs", "milk", "airborne", "aquatic", "predator", "toothed", "backbone", "breathes", "venomous", "fins","legs", "tail", "domestic", "catsize", "class_type" )
```
## Data Manipulation
```{r}
# summary(zoo_raw)
```
There are no NA's that need to be handled. There are 17 attributes and one class attribute.
There are 101 observations which belong to these classes.
```{r}
table(zoo_raw$class_type)
```
```{r}
file_class <- "./data/class.csv"
class_names <- read_csv(file_class) %>% as_tibble()
class_names %>%
dplyr::select(Class_Number, Class_Type)
```
In clustering a distance is measured. Due to this, it is important to apply scaling to the data.
```{r}
zoo_scaled <- zoo_raw %>%
dplyr::select(-animal_name, -class_type) %>%
scale()
```
# Model
## Number of Clusters
In this we have the domain knowledge and know that there are seven classes. Assume we don't know this. We want to find out if the Elbow method guides us to the same conclusion.
```{r}
set.seed(123)
k_max <- 15
within_cluster_sum_squares <- tibble(k = 1:k_max,
wcss =NA)
for (i in within_cluster_sum_squares$k) {
within_cluster_sum_squares$wcss[i] <- kmeans(x = zoo_scaled[, 1:16], centers = i, nstart = 10)$tot.withinss
}
```
We visualise the results with **ggplot()**.
```{r}
g <- ggplot(within_cluster_sum_squares, aes(k, wcss))
g <- g + geom_line()
g <- g + scale_x_continuous(breaks = 1:k_max)
g <- g + labs(title = "Elbow Method\nfor Determining optimum Cluster Number")
g <- g + theme_bw()
g
```
Based on this method, four clusters would be the best bet. We stick to 7 due to our domain model.
It would actually interesting to see which classes would be left out to conclude with four rather than seven.
## Model Creation
```{r}
mod_cluster <- kmeans(x = zoo_scaled, centers = 7)
```
x specifies the data. k specifies the number of clusters.
## Model Evaluation
```{r}
zoo_raw$cluster <- mod_cluster$cluster
```
```{r}
g <- ggplot(zoo_raw, aes(y = cluster,
x = class_type,
col = factor(class_type)))
g <- g + geom_jitter()
g <- g + scale_x_continuous(breaks = 1:7)
g <- g + scale_y_continuous(breaks = 1:7)
g <- g + labs(x = "Actual Classes",
y = "kmeans Cluster",
title = "Actual Classes and kmeans Clusters")
g <- g + theme_bw()
g <- g + scale_color_discrete(name = "Actual Classes")
g
```
The actual numbers of the classes are irrelevant. It is only relevant, that the kmeans cluster perfectly matches one and only one actual class.
We see, it works perfect for Actual Class = 2. Here all points are covered correctly. Also for other classes it works very well.
We can also see it in the table view.
```{r}
table(pred = zoo_raw$cluster, actual = zoo_raw$class_type)
```
It works perfectly for classes 2, 4, and 6. For the other classes there are small errors.
Let's investigate some of the misclassifications.
```{r}
# wrongly predicted values
zoo_raw$animal_name[zoo_raw$class_type==1 & zoo_raw$cluster == 6]
# correct class - Birds
zoo_raw$animal_name[zoo_raw$class_type==2]
```
# Acknowledgement
Thanks to the author of this paper for providing information on the dataset.
Creator:
Richard Forsyth
Donor:
Richard S. Forsyth
8 Grosvenor Avenue
Mapperley Park
Nottingham NG3 5DX
0602-621676