- Supervised learning algorithm: training data is labeled
- Non-parametric method: no assumptions about the data
- Lazy learning algorithm: all computation is deferred until classification
- Instance-based learning algorithm: the function is approximated locally
- Majority voting algorithm: the class label is assigned to the majority of the k nearest neighbors
- Simplest of all classification algorithms
- Easy to implement
- No explicit training phase
- Algorithm does not perform any generalization of the training data
- Nonlinear decision boundaries between classes
- Large data set
Can be both quantitative and qualitative
- Outputs are categorical values (classes of the data)
- It explains a categorical variable using the majority votes of the k nearest neighbors.
- Being non-parametric, it makes no assumptions about the underlying data
- Select the parameter k based on the data
- Requires a distance metric to define proximity between data points (Eg: Euclidean distance, Mahalanobis distance, Hamming distance, etc.)
- Compute the distance metric between the test data point and all the labeled data points
- Order the labeled data points in increasing order of this distance metric
- Select the top k labeled data points and look at their class labels
- Find the class label that the majority of these k labeled data points have and assign it to the test data point
- Parameter selection (k)
The best choice of k depends on the data- Larger values of k reduce the effect of noise on classification, but make boundaries between classes less distinct
- Small values of k make classification boundaries more specific, but noise in the data is more likely to cause misclassification
- Prescence of noise
- Feature selection and scaling
- Remove irrelevant features
- When the no. of features is too large and redundant, feature extraction is required
- If features are carefully chosen, classification will be better
- Curse of dimensionality (as the no. of features increases, the no. of data points required to generalize accurately grows exponentially)
See the code here
- One of the simplest unsupervised learning algorithms
- A technique to parition N observations into K clusters (K <= N) in which each observation belongs to the cluster with the nearest mean
- Works well for all distance metrics where mean is defined (Eg: Euclidean distance)
Given N observations (
where
- Randomly choose two points as the initial cluster centers
- Compute the distance of each point from the cluster centers and group the closest ones
- Compute the new mean and repeat step 2
- If the change in mean is negligible OR no reassignment of points is required, stop. Else, repeat steps 2 and 3
-
Elbow method
- looks at percentage of variance explained as a function of no. of clusters (K)
- The point where marginal decrease plateaus is an indicator of the optimal no. of clusters
-
Dendogram
- Could converge to a local minima, therefore role of initial cluster centers is very important
- If the clusters are not spherical, then K-means can fail to identify the correct no. of clusters
See the code here