Gaussian-Bayes and KNN on Fashion MNIST Dataset

Author: Zuyang Cao

Overview

This project employed two machine learning methods to classify the fashion MNIST dataset:

ML estimation with Gaussian assumption followed by Bayes rule
K-Nearest-Neighbor

Two dimensionality reduction techniques are applied on both machine learning methods:

PCA (Principal Component Analysis)
LDA (Linear Discriminant Analysis)

Dataset

Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.

Figure 1. Visualized Dataset

Tunable Parameters

PCA Parameters

pca_target_dim: Using PCA to reduce the data dimension to this number.

LDA Parameters

components_number: Number of components (< n_classes - 1) for dimensionality reduction.

KNN Parameters

neighbor_num: Number of neighbors taken into calculation.

Results

Dimensionality Reduction Visualization

PCA

Figure 2. PCA training set 2D

Figure 3. PCA training set 3D

LDA

Figure 4. LDA training set 2D

Figure 5. LDA training set 3D

For more visualization images, please refer to visualization folder.

KNN with Different Parameters

K-Neighbors

Figure 6. Accuracy and K Number

From Figure 2, it is clear that KNN reaches 100% accuracy on training set when K is set to 1. This is a typical overfitting circumstance. When increasing the K number, the accuracy on test set increased slightly and begin to be stable after K reaches 7. So the default K number in this project is set to 7 after this demo.

Dimension Reduction Parameters

Figure 7. Accuracy with PCA and LDA

When the dimension number N is larger than 10, the PCA accuracy increases as N increases, however LDA accuracy is always the same. After referring to scikit-learn manual, components_number has a higher limit which is max(dimension_number - 1, class_number - 1). In this case, the class number is 10, so the higher limit for components number in LDA is 9. Which means in Figure 7, the LDA is always 9-dimensional. Thus, the accuracy keeps at a fixed value.

Figure 8. Accuracy with Low PCA and LDA Value

Setting N under 10, both PCA and LDA accuracies are monotonically increasing. From Figure 7 and Figure 8, the default number of N is set to 30 in this project to get a stable result.

Bayes vs KNN

The gaussian based Bayes classifier is a simple self built class, thus the accuracy maybe lower than the built-in classifier from scikit-learn or other libraries.

PCA dimension is set to 30 and LDA set to default in both methods.

Dataset	Bayes Accuracy	KNN Accuracy
LDA Training set	75.12 %	87.00 %
PCA Training set	71.91 %	88.59 %
LDA Testing set	73.70 %	83.06 %
PCA Testing set	71.58 %	85.46 %

For Bayes running output log sample, please refer to Travis building log. The running result is at the end of the log.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
data		data
include		include
utils		utils
visualization		visualization
.gitignore		.gitignore
.travis.yml		.travis.yml
Bayes_main.py		Bayes_main.py
KNN_main.py		KNN_main.py
LICENSE		LICENSE
README.md		README.md
dimension_visualize.py		dimension_visualize.py
knn_visualize.py		knn_visualize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gaussian-Bayes and KNN on Fashion MNIST Dataset

Overview

Dataset