15-cluster.Rmd

# Cluster analysis

The purpose of cluster analysis is to **categorize objects into some homogeneous groups** so that objects within the same groups are more closely related than objects in different groups. It is different from classification in the sense that we do not have any information on possible existing classes and we can not compare clusters to classes. The assignment of objects in cluster analysis is based on the **differences between observations according to a set of variables**, although it is also possible to cluster observations only on one variable.

See section 14.3 up to subsection to 14.3.4 in @hastie_elements_2017 for an introduction.

## Cluster analysis

Clustering methods can be divided into *model-based* (e.g. *mixture models*) and *distance-based* (*combinatorial*) methods. Model-based methods assume that an underlying model can explain clusters in data. An example are mixture models according to which objects belong to i.i.d. samples that originate from populations described by (Gaussian) probability density functions. Objects can thus be clustered by locating the densities in data. However, here we consider only clustering based on distances between objects and clusters.

The distance-based methods can be further divided into *hierarchical clustering* and *partitioning* (*K-means*) methods. While partitioning is intrinsically a *divisive* process, *hierachical clustering* can be applied *agglomeratively* as well as *divisively*. These two methods of clustering are explained below.

Although the most **common application of cluster analysis is to categorize observations, it can also be applied when the aim is to group variables**. In the latter case distances by observations are considered, so we treat observations as variables and vice versa. 

As with the previous multivariate statistics methods, it is important to **be mindful of variances** of variables. Variables with higher variances will be given more leverage in deciding clusters. If this is deemed undesirable, data should be standardized prior to clustering. Conversely, sometimes it might be desirable to give more weight to some variables

### Distance measures

The assignment of objects into groups should be such that observations are more similar within groups than between groups. This means that **clustering is done according to measures of *similarities* or *dissimilarities*, i.e. *distances***. For data matrix $X : n \times p$ with $n$ observations and $p$ variables the distances $d$ between objects $i$ and $j$ can be described as a *distance or proximity matrix* $D : n \times n$ where $d_{ij} = d(x_i,x_j)$. Usually the distances between observations in matrix $D$ must be calculated from data matrix $X$ using one of the many distance measures.

There are various measures for expressing similarity but **most common are Euclidean and Manhattan distances**. These can be calculated for numeric variables. *Euclidean distance* expresses a "direct" distance between two points because it is calculated according to the Pythagorean theorem. *Manhattan distance* can be thought of as a "city block" distance. The distances between objects $x_i$ and $x_j$ for all $p$ variables can be calculated respectively as follows:

$d_{Euclidean}(x_i,x_j) = [\sum^p_{k=1}(x_{ik} - x_{jk})^2]^{1/2}$,   
$d_{Manhattan}(x_i,x_j) = \sum^p_{k=1} |x_{ik} - x_{jk}|$.

Distance measures for ordinal and nominal variables also exist but are not explained here.

## Hierarchical clustering

See @hastie_elements_2017 section 14.3.12 up to (including) Agglomerative clustering.

Hierarchical clustering is a process where clusters are constructed **incrementally** according to similarities between objects and clusters. Grouping of objects can **start from combining objects into clusters in which case the process is *agglomerative* ("bottom-up")**. In contrast, **a *divisive* process ("top-down") begins with a single cluster and divides it into smaller clusters and eventually into objects**. The agglomerative  method is more common and also explained here.

Agglomerative hierarchical clustering has the following steps.

1. Initial number of clusters is $n$, so each cluster contains one object. 
2. Calculate distance matrix $D$ that expresses pairwise distances between clusters (initially objects).
3. Find the smallest distance and merge the objects with smallest distance into a single cluster.
4. Calculate a new distance matrix $D$ that now includes distance between the new cluster and all other clusters using a linkage method (see below).
5. Repeat the previous two steps until all objects are in a single cluster.

In Jamovi: `snowCluster > Hierarchical clustering method`.

### Linkage methods

Computing distance between two objects is simple because each object represents a single point in an Euclidean space. **Calculating distances becomes more complicated once we have clusters containing more than one point**. There are various approaches to finding the distance between cluster $IJ$ and all other clusters $K$:

- single linkage: $d_{IJ,k} = min(d_{i,K}, d_{j,K})$;
- complete linkage: $d_{IJ,k} = max(d_{I,K}, d_{J,K})$; 
- average linkage: $d_{IJ,k} = \sum_{i \in IJ} \sum_{k \in K} d_{ik} / (n_{ij}n_k)$; 
- Ward's method: compares the within-cluster and between-cluster squared distances.

Single linkage tends to link objects serially, resulting in clusters with large diameter where objects within a cluster are not similar. In contrast, complete linkage has the tendency to produce clusters with small diameter and as a result, an object can be closer to members of another cluster. Average linkage is a compromise between the two but is sensitive to the scale on which distances are measured.

### Visualization

The clusters in case of hierarchical clustering are **estimated incrementally, resulting in a nested structure**. This tree-shaped structure can be visualized by a *dendrogram* that provides a complete description of the clustering process.

In Jamovi: `snowCluster > Hierarchical clustering method > Plot | Plot dendrogram`.

### Number of clusters

The **number of clusters to be acquired can be decided** by using the dendrogram. Cutting the dendrogram at certain height results in a number of clusters that exist on that particular height. A rule that can be used to find the optimal number of clusters is to find the longest consecutive height and cut between the ends of that at the point where another heights are also the longest.

## K-means clustering

See @hastie_elements_2017 section 14.3.6.

This method groups objects by partitioning them **iteratively**. The **number of clusters $K$ has to be defined before estimation**. The goal is to partition objects $x$ into $K$ clusters so that distances between objects within cluster are small compared to distances to points outside the cluster. We can achieve this by assigning each object to  the closest centroid. This centroid is the mean point of clusters ( $\bar x_1,  ... \bar x_K$ ) and needs to be calculated for each of the $K$ clusters, hence the name K-means. The optimal mean vector can be found by minimizing the following function:

$$ESS = \sum^K_{k = 1} \sum_{c(i)=k} (x_i - \bar  x_k)^T(x_i - \bar x_k),$$

where $c(i)$ is the cluster containing $x_i$. Simply put, **we attempt to minimize the sum of distances of objects from centroid within all clusters**. This function $ESS$ is minimized iteratively.

K-means clustering involves the following steps:

1. We start with a distance matrix $D$ based on 
   - random assignment of objects to $K$ clusters with cluster means, or
   - some (random) cluster means.
2. Calculate squared Euclidean distance between each object and each cluster mean. Reassign each item to its nearest cluster mean, resulting in decreased $ESS$.
3. Update cluster means.
4. Repeat the previous two steps until objects can not be reassigned, so each object is closest to its own cluster mean.

In Jamovi: `snowCluster > K-means clustering method`.

An alternative is *K-medoids clustering* in which case the centroids are not mean values but represented by actual objects.

### Number of clusters (Gap statistic)

See @hastie_elements_2017 section 14.3.11.

The Gap statistic is a technique used to **determine the optimal or natural number of clusters**. The measure compares the sum of average distances within cluster to the same sum obtained from uniformly distributed data. Higher value indicates better fit of data to the $K$ but only clusters that substantially contribute to the value should be considered. The optimal number of clusters is at the value of the *Gap statistic $k$* where $k + se(k)$ is higher or equal to the estimate $k$ of next number of clusters. Essentially, we want the value of $k$ to be high without having too many clusters. If this optimal value indicates that $K = 1$, then the data does not contain any natural clusters and clustering methods are not suitable.

In Jamovi: `snowCluster > K-means clustering method > Plots | Optimal number of clusters`.

### Visualization

When we apply K-means clustering we do not obtain a nested structure and therefore **can not express the clustering as a dendrogram**. 

We can create a plot that shows the locations of objects and cluster centroids on a two-dimensional plot (Cluster plot in Jamovi). The dimensions can be found via PCA or multidimensional scaling.

In Jamovi: `snowCluster > K-means clustering method > Plots | Cluster plot`.

For the interpretation of clusters we can see the mean values of variables by clusters.

In Jamovi: `snowCluster > K-means clustering method > Plots | Plot means across clusters`.