Author: Zuyang Cao
This project employed two machine learning methods to classify the fashion MNIST dataset:
- ML estimation with Gaussian assumption followed by Bayes rule
- K-Nearest-Neighbor
Two dimensionality reduction techniques are applied on both machine learning methods:
- PCA (Principal Component Analysis)
- LDA (Linear Discriminant Analysis)
Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.
Figure 1. Visualized Dataset
- pca_target_dim: Using PCA to reduce the data dimension to this number.
- components_number: Number of components (< n_classes - 1) for dimensionality reduction.
- neighbor_num: Number of neighbors taken into calculation.
- PCA
Figure 2. PCA training set 2D
Figure 3. PCA training set 3D
- LDA
Figure 4. LDA training set 2D
Figure 5. LDA training set 3D
For more visualization images, please refer to visualization folder.
- K-Neighbors
Figure 6. Accuracy and K Number
From Figure 2, it is clear that KNN reaches 100% accuracy on training set when K is set to 1. This is a typical overfitting circumstance. When increasing the K number, the accuracy on test set increased slightly and begin to be stable after K reaches 7. So the default K number in this project is set to 7 after this demo.
- Dimension Reduction Parameters
Figure 7. Accuracy with PCA and LDA
When the dimension number N is larger than 10, the PCA accuracy increases as N increases, however LDA accuracy is always the same. After referring to scikit-learn manual, components_number has a higher limit which is max(dimension_number - 1, class_number - 1). In this case, the class number is 10, so the higher limit for components number in LDA is 9. Which means in Figure 7, the LDA is always 9-dimensional. Thus, the accuracy keeps at a fixed value.
Figure 8. Accuracy with Low PCA and LDA Value
Setting N under 10, both PCA and LDA accuracies are monotonically increasing. From Figure 7 and Figure 8, the default number of N is set to 30 in this project to get a stable result.
The gaussian based Bayes classifier is a simple self built class, thus the accuracy maybe lower than the built-in classifier from scikit-learn or other libraries.
PCA dimension is set to 30 and LDA set to default in both methods.
Dataset | Bayes Accuracy | KNN Accuracy |
---|---|---|
LDA Training set | 75.12 % | 87.00 % |
PCA Training set | 71.91 % | 88.59 % |
LDA Testing set | 73.70 % | 83.06 % |
PCA Testing set | 71.58 % | 85.46 % |
For Bayes running output log sample, please refer to Travis building log. The running result is at the end of the log.