Skip to content

In this notebook , I investigate the use of the galaxy zoo image dataset as a potential data source for image classification.

Notifications You must be signed in to change notification settings

Bwhiz/Galaxy-Zoo-Image-Analysis

Repository files navigation

Galaxy-Zoo-Image-Analysis

In this notebook , I investigate the use of the galaxy zoo image dataset as a data for potential image classification.

So, this repo talks briefly about using the dataset hosted on Kaggle's Galaxy Zoo - The Galaxy Challenge for a classification problem.

The competition can be found here . The data used was generated by hundreds of thousands of volunteers, who were tasked with classifying the shapes of the images by eye. The morphology of the images (galaxies) were split into 37 categories by the crowd sourced volunteers who took part in the Galaxy Zoo 2 project.

The 37 categories were made into columns all having floating point numbers between 0 and 1 inclusive. These morphologies are related to probabilities for each category; a high number (close to 1) indicates that many users identified this morphology category for the galaxy with a high level of confidence. Low numbers for a category (close to 0) indicate the feature is likely not present. For a detailed description of the data visit the project's tree page.

At first this problem seemed like a straight up classification task but then reading through the competition's overview it is stated "...This competition asks you to analyze the JPG images of galaxies to find automated metrics that reproduce the probability distributions derived from human classifications. For each galaxy, determine the probability that it belongs in a particular class. Can you write an algorithm that behaves as well as the crowd does?". We are basically being asked to reproduce the classification of the features present in each galaxy given by participants who participated in this project.

Now this came as a disappointment at the time because I needed custom images for an image classification project around galaxies. So I thought to myself can I somehow explore this dataset and based on the morphologies given and the respective indices assigned to them by the participants of the project, can I create a column that can be used to classify each galaxies as if it were a straight up classification task? The following notebook shows my approach to answering this question as well as showing the hypothesis I worked with, methodology and conclusion gotten from my analysis of the dataset. Through the course of this project, I was able to:

  • carryout EDA on the dataset,
  • carry out some basic image preprocessings,
  • utilize the Kmeans clustering algorithm as well as an implementation of the kneedle algorithm for determining the optimal K value.
  • as well as include some helpful code blocks to handle certain contingencies that may be encountered while styling plots or trying to plot images based off a particular external file.

Feel free to reuse any portion of my codes, but be nice enough to leave a reference to this repo.

About

In this notebook , I investigate the use of the galaxy zoo image dataset as a potential data source for image classification.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published