Skip to content
Phil Kirlin edited this page Dec 1, 2020 · 26 revisions

Machine learning lab with real data

Setup

I will give directions for running this lab through repl.it, but you should feel free to use whatever Python IDE you want. If you're using something like PyCharm or Spyder, the packages you need are: numpy, pandas, sklearn, PIL, matplotlib, requests, progressbar.

  • Go to repl.it, and make a new account. Make a new Python repl. (You can create a new Python repl by clicking here.)

Adding a second file

  • In repl.it, you will start with a blank main.py file. The first thing we need to do is add a second file. In the Files sidebar on the left, click "add file" (looks like a file with a plus sign). Type in the name image_downloader.py. Then paste into that file the following code at this link.

Getting ready to download images

  • We will begin by training a neural net to classify images of black bears and polar bears.

  • Switch back to main.py.

  • Paste the following code into your file:

import os
import numpy as np
from PIL import Image
from sklearn.decomposition import PCA
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neural_network import MLPClassifier
import random
from image_downloader import simple_image_download

# constants
QUERIES = ["polar bear", "black bear"]  # used in image searches
SUBDIRS = [s.replace(" ", "_") for s in QUERIES] # same as the queries, but no spaces
IMG_DIR = "simple_images"

#setup a standard image size; this will distort some images but will get everything into the same shape
STANDARD_SIZE = (200, 200)

def download_images(query):
    subdir = query.replace(" ", "_")
    if os.path.isdir(IMG_DIR + "/" + subdir) and len([f for f in os.listdir(IMG_DIR + "/" + subdir)]) >= 3:
        print("It looks like you've already downloaded images for '" + query + "'.")
        print("  Skipping download.")
        print("  To re-download images, remove all images in the " + query + " folder.")
        return
    
    response = simple_image_download()
    print("Downloading images for", query, "(might take 30 seconds)")
    response.download(query, 30)
    # print(response.urls(subdir, 30))
    
    # make sure images can be loaded
    imglist = [IMG_DIR + "/" + subdir + "/" + f for f in os.listdir(IMG_DIR + "/" + subdir) if not f.startswith('.')]
    for img in imglist:
        try:
            Image.open(img)
        except OSError:
            print(img, "can't be opened--will delete.")
            os.remove(img)
    
    
def img_to_matrix(filename, verbose=False):
    """
    takes a filename and turns it into a numpy array of RGB pixels
    """
    img = Image.open(filename)
    if verbose==True:
        print("changing size from %s to %s" % (str(img.size), str(STANDARD_SIZE)))
    img2 = img.resize(STANDARD_SIZE, Image.BILINEAR)
    img3 = list(zip(list(img2.getdata(0)), list(img2.getdata(1)), list(img2.getdata(2))))
    img4 = [list(x) for x in img3]
    img5 = np.array(img4)
    return img5

def flatten_image(img):
    """
    takes in an (m, n) numpy array and flattens it 
    into an array of shape (1, m * n)
    """
    s = img.shape[0] * img.shape[1]
    img_wide = img.reshape(1, s)
    return img_wide[0]

def evaluate(classifier, test_set, correct_outputs):
    right = 0
    wrong = 0
    predictions = classifier.predict(test_set)
    for i in range(0, len(predictions)):
        if predictions[i] == correct_outputs[i]:
            right += 1
        else:
            wrong += 1
    return right, wrong

def shuffle_data(X, y):
    # horribly inefficient shuffling algorithm
    for i in range(0, len(y)*2):
        a = random.randint(0, len(y)-1)
        b = random.randint(0, len(y)-1)
        temp = X[a]
        X[a] = X[b]
        X[b] = temp
        temp = y[a]
        y[a] = y[b]
        y[b] = temp

Exploring the code

First, look at the CONSTANTS section of the code.

QUERIES: this stores a list of queries you would use in an image search engine to find images of a particular type. These queries will be given to Google Images and the corresponding images automatically downloaded.

SUBDIRS: this is just the queries with spaces replaced with underscores. The downloaded images will go into these directories.

IMG_DIR: this stores the name of the folder where the images should be downloaded.

Doing the downloads

At the end of your code, put these two lines:

download_images(QUERIES[0])
download_images(QUERIES[1])

Then run the file. This should download a bunch of images of black bears and polar bears and put them in your images directory. By default, the code downloads 30 photos of each bear, but this can be changed in the download_images function.

Open all the images of black bears and make sure each one is a black bear. Do the same for the polar bears. Delete any images that do not appear to be of black bears or polar bears, or which your computer can't open.

Turning images into numbers

Paste the following code at the end of your file:

DIRS = [IMG_DIR + "/" + SUBDIRS[0], IMG_DIR + "/" + SUBDIRS[1]]
tag0images = [DIRS[0] + "/" + f for f in os.listdir(DIRS[0]) if not f.startswith('.')]
tag1images = [DIRS[1] + "/" + f for f in os.listdir(DIRS[1]) if not f.startswith('.')]
images = tag0images + tag1images
labels = [SUBDIRS[0]] * len(tag0images) + [SUBDIRS[1]] * len(tag1images)

print("Transforming images into big arrays... (might take up to 30 seconds)")
shuffle_data(images, labels)
data = []
for image in images:
    img = img_to_matrix(image)
    img = flatten_image(img)
    data.append(img)
data = np.array(data)
print("Shape of the data =", data.shape)
print("First image as an array =", data[0])

This code:

  • makes a list of image file names called images.
  • makes a list of the correct labels (black bear/polar bear) for each image, based on the tag in the filename.
  • shuffles the two lists (keeping the correct labels with the right images) so that we don't have all the black bears first followed by all the polar bears.
  • squashes each image into a 200x200 pixel grid (so there will be some distortion). Each pixel is represented by a tuple of RGB (red/green/blue) values between 0-255.
  • each image is then "flattened" into a single 1-dimensional array of 200 * 200 * 3 = 120,000 values.

We store the data in a NumPy array. NumPy is a numerical/scientific computing library for Python.

Now run the code, and you should see:

Shape of the data = (60, 120000)

This shows you this is a 2-D array of with 60 rows and 120,000 columns. There is one row for each image, indicating there were 60 images in our data set. (This number might be different for you, because although we told Python to download 60 images total [30 bears of each type], you may have had to delete some that weren't bears or couldn't be opened.) Each row represents a picture, which we know is 200-by-200 pixels (40,000 pixels per image), and each pixel contains 3 numbers (RGB), so that's 120,000 integers per image.

Then you'll see:

First image as an array = [ 87 114  63 ...  13 137 177]

This shows you the first image in the array (or its RGB values).

Manipulating the data

So now we have 60 or so images, each one converted into an array of 120,000 values. Our goal is to train a neural net to take an image as input and output a "1" or "0" depending on if the image is a black bear or polar bear. The problem is that 120,000 different inputs to a neural net is way too many. We can reduce the dimensionality of our data while still preserving a lot of the information through techniques called dimensionality reduction. One algorithm for doing this is called PCA, or principal component analysis. Fortunately, NumPy has PCA built-in.

PCA

PCA is an algorithm that reduces a n-dimensional data set (i.e, the data has n features) to an m-dimensional set, where n > m. PCA tries to preserve as much of the "structure" of the data as possible. For instance, we can use PCA to reduce our 120,000-dimensional data set to TWO dimensions.

Paste this code at the end of the file:

print("Running PCA...")
pca = PCA(n_components=2)
X = pca.fit_transform(data)
df = pd.DataFrame({"x": X[:, 0], "y": X[:, 1], "label":labels})
colors = ["blue", "red"]
for label, color in zip(df['label'].unique(), colors):
    mask = df['label']==label
    plt.scatter(df[mask]['x'], df[mask]['y'], c=color, label=label)
plt.legend()
plt.savefig("plot.png")  # Can also replace with plt.show() if you're in PyCharm.

Run the code. You should see a file appear called plot.png in the left column. The plot illustrates the polar bear and black bear images, plotted by reducing the 120,000-dimensional data down to 2 dimensions. The scales on the axes are relatively meaningless, since these are the two dimensions that PCA has produced, which are aggregates of the "most important" dimensions in the original 120,000-dimensional data.

But the point is that the data now appears (mostly) linearly separable! This is great! This means it should be possible to train a neural net (or even a single-layer perceptron network) to classify images of bears into polar bear/black bear categories.

If the graph does not show a clear division between the red dots and the blue dots, let Prof Kirlin know.

Let's train a NN

Paste in this code:

print("Training a neural network...")
nn_trainer = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=())
labels_ints = [SUBDIRS.index(x) for x in labels]
classifier = nn_trainer.fit(X, labels_ints)
print("Training and testing on same data (poor practice).")
results = evaluate(classifier, X, labels_ints)
print("Number predicted correctly:", results[0])
print("Number predicted incorrectly:", results[1])
print("Accuracy: ", results[0] / (results[0] + results[1]))

Run the code. This makes a neural net that is trained on all of the images (all polar bears and all black bears). It is tested on all the images as well. You should see the `evaluate' line print the number of images classified correctly, and the number classified incorrectly. Note that the NN initializes weights randomly, so if you run this code a second time, you might get a different NN, with different weights, that might classify things differently. You should get very good results here (not too many incorrectly-classified images). If your results are bad, just re-run the code, and it will probably get better results.

If you get an error about ConvergenceWarning: lbfgs failed to converge (status=2): ABNORMAL_TERMINATION_IN_LNSRCH, try re-running the code until it goes away. :-)

What this code does

The first line of the code makes a neural network with no hidden layer. That's the hidden_layer_sizes=() part. If you want a NN with one hidden layer containing, for example, 5 nodes, you would use hidden_layer_sizes=(5,). Yes you need the comma. If you want two hidden layers, use hidden_layer_sizes=(5,5). You can change those numbers to whatever you want.

Why this is bad

It is normally considered a bad idea to train and test on exactly the same data. This is like taking an exam in a class where you were told ahead of time exactly what the questions will be --- you can just memorize the answers and not actually learn anything.

Instead, what we normally do is hold back some of our data to serve as a testing set that is independent of the training set.

Paste in this code:

# divide our data in half for training/testing
data_length = X.shape[0]
half_length = int(data_length / 2)
X_first_half = X[:half_length]
X_second_half = X[half_length:]
y_first_half = labels_ints[:half_length]
y_second_half = labels_ints[half_length:]

# train on first half, test on second
classifier = nn_trainer.fit(X_first_half, y_first_half)

print("\nTraining on first half, testing on first half (poor practice).")
results = evaluate(classifier, X_first_half, y_first_half)
print("Number predicted correctly:", results[0])
print("Number predicted incorrectly:", results[1])
print("Accuracy: ", results[0] / (results[0] + results[1]))

print("\nTraining on first half, testing on second half (good practice).")
results = evaluate(classifier, X_second_half, y_second_half)
print("Number predicted correctly:", results[0])
print("Number predicted incorrectly:", results[1])
print("Accuracy:", results[0] / (results[0] + results[1]))

Run this code. You should see pretty good accuracy on the training set (obviously because you trained on it), but the accuracy on the test set should be decent as well.

Repeating with different data

I want you to repeat these steps with two image categories of your choice now. Try to pick two categories of images that are easy to distinguish visually, but yet within the category, the objects all look similar.

Things to play around with

  • Image categories
  • n_components in PCA
  • Number of hidden layers and number of nodes in NN.

Your goal is to pick two categories you like and get the NN classifier to do well.