-
Notifications
You must be signed in to change notification settings - Fork 0
Lab 2020
I will give directions for running this lab through repl.it, but you should feel free to use whatever Python IDE you want. If you're using something like PyCharm or Spyder, the packages you need are: numpy, pandas, sklearn, PIL, matplotlib, requests, progressbar.
- Go to repl.it, and make a new account. Make a new Python repl. (You can create a new Python repl by clicking here.)
- We will begin by training a neural net to classify images of black bears and polar bears.
- Paste the following code into your file:
from google_images_download import googleimagesdownload
import os
import numpy as np
from PIL import Image
from sklearn.decomposition import PCA
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neural_network import MLPClassifier
import random
# constants
QUERIES = ["polar bear", "black bear"] # used in image searches
TAGS = ["polarbear", "blackbear"] # used in filenames (no spaces)
IMG_DIR = "images"
#setup a standard image size; this will distort some images but will get everything into the same shape
STANDARD_SIZE = (200, 200)
def download_images(query, tag):
if os.path.isdir(IMG_DIR + "/" + tag) and len([f for f in os.listdir(IMG_DIR + "/" + tag)]) >= 3:
print("It looks like you've already downloaded images for '" + query + "'.")
print("Skipping download.")
print("To re-download images, remove all images in the " + query + " folder.")
return
settings = {}
settings["keywords"] = query
settings["limit"] = 30 # how many images to download
settings["output_directory"] = IMG_DIR
settings["no_numbering"] = True
settings["image_directory"] = tag
settings["format"] = "jpg"
settings["color_type"] = "full-color"
settings["size"] = "medium"
settings["type"] = "photo"
downloader = googleimagesdownload()
downloader.download(settings)
# make sure images can be loaded
imglist = [IMG_DIR + "/" + tag + "/" + f for f in os.listdir(IMG_DIR + "/" + tag) if not f.startswith('.')]
for img in imglist:
try:
Image.open(img)
except OSError:
print(img, "can't be opened--will delete.")
os.remove(img)
def img_to_matrix(filename, verbose=False):
"""
takes a filename and turns it into a numpy array of RGB pixels
"""
img = Image.open(filename)
if verbose==True:
print("changing size from %s to %s" % (str(img.size), str(STANDARD_SIZE)))
img2 = img.resize(STANDARD_SIZE, Image.BILINEAR)
img3 = list(img2.getdata())
img4 = [list(x) for x in img3]
img5 = np.array(img4)
return img5
def flatten_image(img):
"""
takes in an (m, n) numpy array and flattens it
into an array of shape (1, m * n)
"""
s = img.shape[0] * img.shape[1]
img_wide = img.reshape(1, s)
return img_wide[0]
def evaluate(classifier, test_set, correct_outputs):
right = 0
wrong = 0
predictions = classifier.predict(test_set)
for i in range(0, len(predictions)):
if predictions[i] == correct_outputs[i]:
right += 1
else:
wrong += 1
return right, wrong
def shuffle_data(X, y):
# horribly inefficient shuffling algorithm
for i in range(0, len(y)*2):
a = random.randint(0, len(y)-1)
b = random.randint(0, len(y)-1)
temp = X[a]
X[a] = X[b]
X[b] = temp
temp = y[a]
y[a] = y[b]
y[b] = temp
First, look at the CONSTANTS section of the code.
QUERIES
: this stores an list of queries you would use in an image search engine to find images of a particular type. These queries will be given to Google Images and the corresponding images automatically downloaded.
TAGS
: this stores a list of "tags," which should be short words that describe each type of query (don't use spaces, because whatever you put here will be used to make folders on your computer).
IMG_DIR
: this stores the name of the folder where the images should be downloaded.
At the end of your code, put these two lines:
download_images(QUERIES[0], TAGS[0])
download_images(QUERIES[1], TAGS[1])
Then run the file. This should download a bunch of images of black bears and polar bears and put them in your images directory. By default, the code downloads 30 photos of each bear, but this can be changed in the download_images
function.
Open all the images of black bears and make sure each one is a black bear. Do the same for the polar bears. Delete any images that do not appear to be of black bears or polar bears, or which your computer can't open.
Paste the following code at the end of your file:
DIRS = [IMG_DIR + "/" + TAGS[0], IMG_DIR + "/" + TAGS[1]]
tag0images = [DIRS[0] + "/" + f for f in os.listdir(DIRS[0]) if not f.startswith('.')]
tag1images = [DIRS[1] + "/" + f for f in os.listdir(DIRS[1]) if not f.startswith('.')]
images = tag0images + tag1images
labels = [TAGS[0]] * len(tag0images) + [TAGS[1]] * len(tag1images)
shuffle_data(images, labels)
data = []
for image in images:
img = img_to_matrix(image)
img = flatten_image(img)
data.append(img)
data = np.array(data)
This code:
- makes a list of image file names called
images
. - makes a list of the correct labels (black bear/polar bear) for each image, based on the tag in the filename.
- shuffles the two lists (keeping the correct labels with the right images) so that we don't have all the black bears first followed by all the polar bears.
- squashes each image into a 200x200 pixel grid (so there will be some distortion). Each pixel is represented by a tuple of RGB (red/green/blue) values between 0-255.
- each image is then "flattened" into a single 1-dimensional array of 200 * 200 * 3 = 120,000 values.
We store the data in a NumPy array. NumPy is a numerical/scientific computing library for Python, part of the larger SciPy package.
Type data.shape
in the iPython window (lower right corner).
Out[24]: (56, 120000)
This shows you this is a 2-D array of with 56 rows and 120,000 columns. There is one row for each image, indicating there were 56 images in our data set. (This number might be different for you, because although we told Python to download 60 images total [30 bears of each type], you may have had to delete some that weren't bears or couldn't be opened.) Each row represents a picture, which we know is 200-by-200 pixels (40,000 pixels per image), and each pixel contains 3 numbers (RGB), so that's 120,000 integers per image.
Type data[0]
Out[25]: array([130, 117, 53, ..., 85, 84, 46])
This shows you the first image in the array (or its RGB values).
So now we have 60 or so images, each one converted into an array of 120,000 values. Our goal is to train a neural net to take an image as input and output a "1" or "0" depending on if the image is a black bear or polar bear. The problem is that 120,000 different inputs to a neural net is way too many. We can reduce the dimensionality of our data while still preserving a lot of the information through techniques called dimensionality reduction. One algorithm for doing this is called PCA, or principal component analysis. Fortunately, SciPy has PCA built-in.
PCA is an algorithm that reduces a n-dimensional data set (i.e, the data has n features) to an m-dimensional set, where n > m. PCA tries to preserve as much of the "structure" of the data as possible. For instance, we can use PCA to reduce our 120,000-dimensional data set to TWO dimensions.
Paste this code at the end of the file:
pca = PCA(n_components=2)
X = pca.fit_transform(data)
df = pd.DataFrame({"x": X[:, 0], "y": X[:, 1], "label":labels})
colors = ["blue", "red"]
for label, color in zip(df['label'].unique(), colors):
mask = df['label']==label
plt.scatter(df[mask]['x'], df[mask]['y'], c=color, label=label)
plt.legend()
plt.show()
Run the code. You should see a graph appear that plots the polar bear and black pear images in two dimensions.
The scales on the axes are relatively meaningless, since these are the two dimensions that PCA has produced, which are
aggregates of the "most important" dimensions in the original 120,000-dimensional data.
But the point is that the data now appears (mostly) linearly separable! This is great! This means it should be possible to train a neural net (or even a single-layer perceptron network) to classify images of bears into polar bear/black bear categories.
If the graph does not show a clear division between the red dots and the blue dots, let Prof Kirlin know.
Paste in this code:
nn_trainer = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=())
labels_ints = [TAGS.index(x) for x in labels]
classifier = nn_trainer.fit(X, labels_ints)
print("Training and testing on same data (poor practice).")
results = evaluate(classifier, X, labels_ints)
print("Number predicted correctly:", results[0])
print("Number predicted incorrectly:", results[1])
print("Accuracy: ", results[0] / (results[0] + results[1]))
Run the code. This makes a neural net that is trained on all of the images (all polar bears and all black bears). It is tested on all the images as well. You should see the `evaluate' line print the number of images classified correctly, and the number classified incorrectly. Note that the NN initializes weights randomly, so if you run this code a second time, you might get a different NN, with different weights, that might classify things differently. You should get very good results here (not too many incorrectly-classified images). If your results are bad, just re-run the code, and it will probably get better results.
The first line of the code makes a neural network with no hidden layer. That's the hidden_layer_sizes=()
part.
If you want a NN with one hidden layer containing, for example, 5 nodes, you would use hidden_layer_sizes=(5,)
. Yes
you need the comma. If you want two hidden layers, use hidden_layer_sizes=(5,5)
. You can change those numbers to whatever you want.
It is normally considered a bad idea to train and test on exactly the same data. This is like taking an exam in a class where you were told ahead of time exactly what the questions will be --- you can just memorize the answers and not actually learn anything.
Instead, what we normally do is hold back some of our data to serve as a testing set that is independent of the training set.
Paste in this code:
# divide our data in half for training/testing
data_length = X.shape[0]
half_length = int(data_length / 2)
X_first_half = X[:half_length]
X_second_half = X[half_length:]
y_first_half = labels_ints[:half_length]
y_second_half = labels_ints[half_length:]
# train on first half, test on second
classifier = nn_trainer.fit(X_first_half, y_first_half)
print("\nTraining on first half, testing on first half (poor practice).")
results = evaluate(classifier, X_first_half, y_first_half)
print("Number predicted correctly:", results[0])
print("Number predicted incorrectly:", results[1])
print("Accuracy: ", results[0] / (results[0] + results[1]))
print("\nTraining on first half, testing on second half (good practice).")
results = evaluate(classifier, X_second_half, y_second_half)
print("Number predicted correctly:", results[0])
print("Number predicted incorrectly:", results[1])
print("Accuracy:", results[0] / (results[0] + results[1]))
Run this code. You should see pretty good accuracy on the training set (obviously because you trained on it), but the accuracy on the test set should be decent as well.
I want you to repeat these steps with two image categories of your choice now. Try to pick two categories of images that are easy to distinguish visually, but yet within the category, the objects all look similar.
Things to play around with
- Image categories
-
n_components
in PCA - Number of hidden layers and number of nodes in NN.
Your goal is to pick two categories you like and get the NN classifier to do well.
Before you submit, show me your code running, then upload it to Moodle.