Daniel Kaijzer
This project is a web application that allows users to upload an image of a skin lesion and receive a prediction of the probability of the lesion being malignant. The application uses a machine learning model that was trained on a dataset of skin lesion images and associated metadata to make the predictions.
Launch the Streamlit application:
streamlit run app/webapp.py
Once launched:
- In the left column, enter:
- Patient's age
- Lesion diameter
- Patient's sex
- Anatomical location
- In the right column, upload the lesion image
- Image requirements: Between 90x90 and 2000x2000 pixels
- Square images work best, but not required
- Click "Analyze Lesion" to get:
- Malignancy probability
- Generated lesion mask
- Risk level assessment (low/medium/high)
python src/image_feature_extractor.py
- You will get option to select from 5 test images
- Once an image is chosen, the program will process the images and compare differences between ground truth values and the values extracted directly from the image
- For more test data: https://www.kaggle.com/competitions/isic-2024-challenge/data
project_root/
├── app/
│ └── webapp.py # Streamlit application
├── data/
│ └── test-metadata.csv # Ground truth metadata for test images
├── docs/
│ └── DK_Skin_Cancer_Classifier_Presentation.pdf
├── models/
│ ├── encoder.pkl # Required for inference
│ ├── model.pkl # Serialized tabular-only model
│ └── feature_columns.json # Required for inference
├── notebooks/
│ ├── eda.ipynb # Exploratory data analysis
│ └── model.ipynb # Model development process
├── src/
│ ├── __init__.py
│ └── image_feature_extractor.py # Feature extraction module
├── Test_Images/ # Sample images from holdout test set and user input metadata for each image
├── requirements.txt # Project dependencies
└── setup.py # Package installation configuration
The EDA.ipynb
notebook contains basic exploratory data analysis performed on the training data CSV file. The model.ipynb
notebook documents the process of creating the ML models and compares the performance of the tabular-only model and the hybrid model that combines tabular data with a CNN–the CNN creates an embedding matrix that is combined with tabular data for retraining my LGBM-based model.
- Tabular-only Model:
- This model uses only the tabular data (patient metadata and derived features) for prediction.
- It's an ensemble of LightGBM models (VotingClassifier) with different random states for diversity.
- Each LightGBM model is wrapped in a Pipeline with a RandomUnderSampler to handle class imbalance.
- The models are trained using cross-validation with StratifiedGroupKFold to ensure proper data splitting.
- Feature importances are calculated by averaging importances across all folds and models.
- Current version of model has pAUC of 0.1695 on holdout test set. This is quite close to the best score from ISIC 2024 competition, 0.17264 .
- Hybrid Model (Tabular + CNN):
- This model combines the tabular data with embeddings generated by a CNN.
- The CNN (EfficientNet-B0) is pre-trained on ImageNet and fine-tuned on a balanced subset of your training data.
- The CNN generates a 1792-dimensional embedding for each image.
- The image embeddings are combined with the tabular features to create an enhanced feature set.
- The same ensemble of LightGBM models is then trained on this enhanced feature set.
- The CNN architecture and training process is defined in the train_embedding_model_with_hdf5 function.
- NOTE: I have decided not use the model in my webapp because it does not perform as well (see slides for comparison). The code to build the model is in model.ipynb.
This program uses OpenCV to extract various geometric and color features directly from the image files. This enables the streamlit web app to function without requiring too much user input. Here's a breakdown of its main components:
create_masks
function:
- Converts the image to the LAB color space.
- Applies thresholding on the L, A, and B channels to identify potential lesion regions.
- Performs morphological operations (opening and closing) to refine the binary mask.
- Selects the darkest contour as the lesion region.
- Creates lesion and surrounding area masks.
calculate_shape_features
function:
- Calculates shape-related features like area, perimeter, minor axis length, eccentricity, and area-perimeter ratio.
- Uses OpenCV functions like contourArea, arcLength, minAreaRect, and moments for feature extraction.
calculate_color_features
function:
- Calculates color-related features in the LAB color space.
- Computes means and differences of L, A, and B values inside and outside the lesion.
- Calculates derived features like hue, chroma, and color differences.
4.visualize_analysis
function:
- Creates visualizations to illustrate the analysis process.
- Displays the original image, L and A channels, detected lesion contour, and masks.
- Plots color distributions inside and outside the lesion.
analyze_lesion
function:
- Main function that orchestrates the entire analysis pipeline.
- Reads the image, creates masks, calculates scale factor, and computes shape and color features.
- Calls the visualize_analysis function to generate visualizations. The image_feature_extractor.py file provides a comprehensive set of features that capture important characteristics of skin lesions. These features are then used along with patient metadata to train the ML models for malignancy prediction.
- See
requirements.txt
file
- The datasets used for training the machine learning model was obtained from the ISIC Archive.
- Link to ISIC 2024 Kaggle competition: https://www.kaggle.com/competitions/isic-2024-challenge/overview
- My tabular model builds on the Kaggle notebook made by Farukcan Saglam: https://www.kaggle.com/code/greysky/isic-2024-only-tabular-data