Skip to content

Uses computer vision techniques to extract geometric and color features from raw image data and then use those features for ML models to predict the probability of malignant skin cancer based on an image of a skin lesion.

Notifications You must be signed in to change notification settings

danielkaijzer/Skin-Cancer-Classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Skin Cancer Classifier Project

Author

Daniel Kaijzer

Project Overview

This project is a web application that allows users to upload an image of a skin lesion and receive a prediction of the probability of the lesion being malignant. The application uses a machine learning model that was trained on a dataset of skin lesion images and associated metadata to make the predictions.

Usage

Running the Web Application

Launch the Streamlit application:

streamlit run app/webapp.py

Once launched:

  1. In the left column, enter:
  • Patient's age
  • Lesion diameter
  • Patient's sex
  • Anatomical location
  1. In the right column, upload the lesion image
  • Image requirements: Between 90x90 and 2000x2000 pixels
  • Square images work best, but not required
  1. Click "Analyze Lesion" to get:
  • Malignancy probability
  • Generated lesion mask
  • Risk level assessment (low/medium/high)

Using image_feature_extractor.py independently:

python src/image_feature_extractor.py
  1. You will get option to select from 5 test images
  2. Once an image is chosen, the program will process the images and compare differences between ground truth values and the values extracted directly from the image
  3. For more test data: https://www.kaggle.com/competitions/isic-2024-challenge/data

Project Structure

project_root/
├── app/
│   └── webapp.py              # Streamlit application
├── data/
│   └── test-metadata.csv      # Ground truth metadata for test images
├── docs/
│   └── DK_Skin_Cancer_Classifier_Presentation.pdf
├── models/
│   ├── encoder.pkl           # Required for inference
│   ├── model.pkl            # Serialized tabular-only model
│   └── feature_columns.json  # Required for inference
├── notebooks/
│   ├── eda.ipynb           # Exploratory data analysis
│   └── model.ipynb         # Model development process
├── src/
│   ├── __init__.py
│   └── image_feature_extractor.py  # Feature extraction module
├── Test_Images/            # Sample images from holdout test set and user input metadata for each image
├── requirements.txt        # Project dependencies
└── setup.py               # Package installation configuration

Model Development

The EDA.ipynb notebook contains basic exploratory data analysis performed on the training data CSV file. The model.ipynb notebook documents the process of creating the ML models and compares the performance of the tabular-only model and the hybrid model that combines tabular data with a CNN–the CNN creates an embedding matrix that is combined with tabular data for retraining my LGBM-based model.

Model Architecture

  1. Tabular-only Model:
  • This model uses only the tabular data (patient metadata and derived features) for prediction.
  • It's an ensemble of LightGBM models (VotingClassifier) with different random states for diversity.
  • Each LightGBM model is wrapped in a Pipeline with a RandomUnderSampler to handle class imbalance.
  • The models are trained using cross-validation with StratifiedGroupKFold to ensure proper data splitting.
  • Feature importances are calculated by averaging importances across all folds and models.
  • Current version of model has pAUC of 0.1695 on holdout test set. This is quite close to the best score from ISIC 2024 competition, 0.17264 .
  1. Hybrid Model (Tabular + CNN):
  • This model combines the tabular data with embeddings generated by a CNN.
  • The CNN (EfficientNet-B0) is pre-trained on ImageNet and fine-tuned on a balanced subset of your training data.
  • The CNN generates a 1792-dimensional embedding for each image.
  • The image embeddings are combined with the tabular features to create an enhanced feature set.
  • The same ensemble of LightGBM models is then trained on this enhanced feature set.
  • The CNN architecture and training process is defined in the train_embedding_model_with_hdf5 function.
  • NOTE: I have decided not use the model in my webapp because it does not perform as well (see slides for comparison). The code to build the model is in model.ipynb.

Image Feature Extractor program

This program uses OpenCV to extract various geometric and color features directly from the image files. This enables the streamlit web app to function without requiring too much user input. Here's a breakdown of its main components:

  1. create_masks function:
  • Converts the image to the LAB color space.
  • Applies thresholding on the L, A, and B channels to identify potential lesion regions.
  • Performs morphological operations (opening and closing) to refine the binary mask.
  • Selects the darkest contour as the lesion region.
  • Creates lesion and surrounding area masks.
  1. calculate_shape_features function:
  • Calculates shape-related features like area, perimeter, minor axis length, eccentricity, and area-perimeter ratio.
  • Uses OpenCV functions like contourArea, arcLength, minAreaRect, and moments for feature extraction.
  1. calculate_color_features function:
  • Calculates color-related features in the LAB color space.
  • Computes means and differences of L, A, and B values inside and outside the lesion.
  • Calculates derived features like hue, chroma, and color differences.

4.visualize_analysis function:

  • Creates visualizations to illustrate the analysis process.
  • Displays the original image, L and A channels, detected lesion contour, and masks.
  • Plots color distributions inside and outside the lesion.
  1. analyze_lesion function:
  • Main function that orchestrates the entire analysis pipeline.
  • Reads the image, creates masks, calculates scale factor, and computes shape and color features.
  • Calls the visualize_analysis function to generate visualizations. The image_feature_extractor.py file provides a comprehensive set of features that capture important characteristics of skin lesions. These features are then used along with patient metadata to train the ML models for malignancy prediction.

Requirements

  • See requirements.txt file

Acknowledgments

About

Uses computer vision techniques to extract geometric and color features from raw image data and then use those features for ML models to predict the probability of malignant skin cancer based on an image of a skin lesion.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published