Skin Cancer Classifier Project

Author

Daniel Kaijzer

Project Overview

This project is a web application that allows users to upload an image of a skin lesion and receive a prediction of the probability of the lesion being malignant. The application uses a machine learning model that was trained on a dataset of skin lesion images and associated metadata to make the predictions.

Usage

Running the Web Application

Launch the Streamlit application:

streamlit run app/webapp.py

Once launched:

In the left column, enter:

Patient's age
Lesion diameter
Patient's sex
Anatomical location

In the right column, upload the lesion image

Image requirements: Between 90x90 and 2000x2000 pixels
Square images work best, but not required

Click "Analyze Lesion" to get:

Malignancy probability
Generated lesion mask
Risk level assessment (low/medium/high)

Using `image_feature_extractor.py` independently:

python src/image_feature_extractor.py

You will get option to select from 5 test images
Once an image is chosen, the program will process the images and compare differences between ground truth values and the values extracted directly from the image
For more test data: https://www.kaggle.com/competitions/isic-2024-challenge/data

Project Structure

project_root/
├── app/
│   └── webapp.py              # Streamlit application
├── data/
│   └── test-metadata.csv      # Ground truth metadata for test images
├── docs/
│   └── DK_Skin_Cancer_Classifier_Presentation.pdf
├── models/
│   ├── encoder.pkl           # Required for inference
│   ├── model.pkl            # Serialized tabular-only model
│   └── feature_columns.json  # Required for inference
├── notebooks/
│   ├── eda.ipynb           # Exploratory data analysis
│   └── model.ipynb         # Model development process
├── src/
│   ├── __init__.py
│   └── image_feature_extractor.py  # Feature extraction module
├── Test_Images/            # Sample images from holdout test set and user input metadata for each image
├── requirements.txt        # Project dependencies
└── setup.py               # Package installation configuration

Model Development

The EDA.ipynb notebook contains basic exploratory data analysis performed on the training data CSV file. The model.ipynb notebook documents the process of creating the ML models and compares the performance of the tabular-only model and the hybrid model that combines tabular data with a CNN–the CNN creates an embedding matrix that is combined with tabular data for retraining my LGBM-based model.

Model Architecture

Tabular-only Model:

This model uses only the tabular data (patient metadata and derived features) for prediction.
It's an ensemble of LightGBM models (VotingClassifier) with different random states for diversity.
Each LightGBM model is wrapped in a Pipeline with a RandomUnderSampler to handle class imbalance.
The models are trained using cross-validation with StratifiedGroupKFold to ensure proper data splitting.
Feature importances are calculated by averaging importances across all folds and models.
Current version of model has pAUC of 0.1695 on holdout test set. This is quite close to the best score from ISIC 2024 competition, 0.17264 .

Hybrid Model (Tabular + CNN):

This model combines the tabular data with embeddings generated by a CNN.
The CNN (EfficientNet-B0) is pre-trained on ImageNet and fine-tuned on a balanced subset of your training data.
The CNN generates a 1792-dimensional embedding for each image.
The image embeddings are combined with the tabular features to create an enhanced feature set.
The same ensemble of LightGBM models is then trained on this enhanced feature set.
The CNN architecture and training process is defined in the train_embedding_model_with_hdf5 function.
NOTE: I have decided not use the model in my webapp because it does not perform as well (see slides for comparison). The code to build the model is in model.ipynb.

Image Feature Extractor program

This program uses OpenCV to extract various geometric and color features directly from the image files. This enables the streamlit web app to function without requiring too much user input. Here's a breakdown of its main components:

create_masks function:

Converts the image to the LAB color space.
Applies thresholding on the L, A, and B channels to identify potential lesion regions.
Performs morphological operations (opening and closing) to refine the binary mask.
Selects the darkest contour as the lesion region.
Creates lesion and surrounding area masks.

calculate_shape_features function:

Calculates shape-related features like area, perimeter, minor axis length, eccentricity, and area-perimeter ratio.
Uses OpenCV functions like contourArea, arcLength, minAreaRect, and moments for feature extraction.

calculate_color_features function:

Calculates color-related features in the LAB color space.
Computes means and differences of L, A, and B values inside and outside the lesion.
Calculates derived features like hue, chroma, and color differences.

4.visualize_analysis function:

Creates visualizations to illustrate the analysis process.
Displays the original image, L and A channels, detected lesion contour, and masks.
Plots color distributions inside and outside the lesion.

analyze_lesion function:

Main function that orchestrates the entire analysis pipeline.
Reads the image, creates masks, calculates scale factor, and computes shape and color features.
Calls the visualize_analysis function to generate visualizations. The image_feature_extractor.py file provides a comprehensive set of features that capture important characteristics of skin lesions. These features are then used along with patient metadata to train the ML models for malignancy prediction.

Requirements

See requirements.txt file

Acknowledgments

The datasets used for training the machine learning model was obtained from the ISIC Archive.
Link to ISIC 2024 Kaggle competition: https://www.kaggle.com/competitions/isic-2024-challenge/overview
My tabular model builds on the Kaggle notebook made by Farukcan Saglam: https://www.kaggle.com/code/greysky/isic-2024-only-tabular-data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Skin Cancer Classifier Project

Author

Project Overview

Usage

Running the Web Application

Using `image_feature_extractor.py` independently:

Project Structure

Model Development

Model Architecture

Image Feature Extractor program

Requirements

Acknowledgments

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
Test_Images		Test_Images
app		app
data		data
docs		docs
models		models
notebooks		notebooks
src		src
README.md		README.md
requirements.txt		requirements.txt

danielkaijzer/Skin-Cancer-Classifier

Folders and files

Latest commit

History

Repository files navigation

Skin Cancer Classifier Project

Author

Project Overview

Usage

Running the Web Application

Using image_feature_extractor.py independently:

Project Structure

Model Development

Model Architecture

Image Feature Extractor program

Requirements

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Using `image_feature_extractor.py` independently:

Packages