In supervised machine learning tasks, the data is assigned to some set of classes. For example, here we are given a dataset wherein each observation is a set of physical attributes of an object. In an supervised task, the object column acts as the labels. The algorithm then uses these existing separations in the data to develop criteria for classifying unknown observations in the data.
label | Height | Width | Color | Mass | Round ? |
---|---|---|---|---|---|
Apple | 6cm | 7cm | Red | 330g | TRUE |
Orange | 6cm | 7cm | Orange | 330g | TRUE |
Lemon | 5cm | 4cm | Yellow | 150g | FALSE |
In contrast, in an unsupervised machine learning task there either are no labels or that information is just treated as another attribute of the observation. In our fruit example, the object type is now just another characteristic of the observation, and often is altogether unknown:
object | Height | Width | Color | Mass | Round ? |
---|---|---|---|---|---|
Apple | 6cm | 7cm | Red | 330g | TRUE |
Orange | 6cm | 7cm | Orange | 330g | TRUE |
Lemon | 5cm | 4cm | Yellow | 150g | FALSE |
An unsupervised algorithm is not told how the data is structured or separated (barring parameter tuning); instead the algorithm goes looking for stucture and separation in the data.
Clustering algorithms aim to group the observations in the data into categories (classes) based on some notion of how similar the observations are to each other. For example, given a basket of fruit, a clustering algorithm tries to group what it thinks are apples together into one class, and what it thinks are oranges into another.
Dimension reduction techniques aim to decrease the number of rows and columns in a dataset based on some criteria such as which variables most separate the observations. For example, given the height, width, color, mass, and roundness of the fruit attributes, one dimension reduction algorithm will try to determine the minimum number of attributes needed to tell the fruit apart - can we tell it's an apple with just the mass and color?
Generally speaking, in an unsupervised task there is no existing labeling to compare the results of the algorithm to; instead we often evaluate reliability through repeated experiments, computing the odds of our data being generated by our model, and visualizations.