This repository contains the implementation of a mushroom classification project using logistic regression and random forest. The goal of the project is to classify mushrooms as either edible or poisonous based on a dataset of mushroom characteristics. The project is implemented in Python and follows a structured approach, including data extraction, transformation, modeling, and evaluation
The project is organized into the following directories and files:
dataset/
: Contains the raw mushroom dataset.etl/
: Contains scripts for data extraction, transformation, and loading.model/
: Contains the implementation of the logistic regression and random forest models.Report.pdf
: Contains the LaTeX report of the project.
To get started with this project, follow these steps:
-
Clone the repository:
git clone https://github.com/dembA7/Binary-Classification.git
-
Install the required packages:
You can install the required Python packages using pip. Make sure you have
pandas
,numpy
,matplotlib
,seaborn
, andscikit-learn
installed.pip install pandas numpy matplotlib seaborn scikit-learn
-
Prepare the data:
Run the ETL script to prepare the data:
python etl/etl.py
-
Train the model:
Run the main script to train and evaluate both the logistic regression and random forest models:
python model/main.py
The dataset used for this project is the Mushroom Dataset from the UCI Machine Learning Repository. It contains various attributes related to mushrooms, such as color, odor, and habitat, and is used to predict whether a mushroom is edible or poisonous.
The logistic regression model is implemented in model/regression.py
and is used to classify mushrooms based on the features provided. The implementation includes:
- Sigmoid Function: For computing probabilities.
- Cost Function: Cross-entropy loss function.
- Gradient Descent: For optimizing the model parameters.
- Evaluation: Accuracy, confusion matrix, and cost evolution.
The random forest model is implemented in model/forest.py
and builds multiple decision trees to improve the accuracy and robustness of the classification. The implementation includes:
- Bagging: Sampling subsets of data and features for each tree.
- Majority Voting: Combining tree predictions for classification.
- Evaluation: Accuracy and confusion matrix.
The model/tuning.py
script performs a grid search for optimizing the learning rate and number of epochs for the logistic regression model, using a validation set. It identifies the best set of hyperparameters to maximize accuracy.
The model/graphs.py
file contains scripts for generating various visualizations, including:
- Confusion Matrix: Visualizing the confusion matrix.
- Probability Distributions: Visualizing the confidence of predictions.
- Cost Evolution: Showing the progression of the cost function during training.
After setting up the environment and running the scripts, you can find the following outputs:
- Model Metrics: Accuracy, cost and other performance metrics.
- Visualizations: Plots for probability distributions, cost evolution and matrix confusions.
This project is licensed under the GPL-3.0 License - see the LICENSE file for details.
- The Mushroom Dataset from the UCI Machine Learning Repository.
- Various online resources and documentation used for implementing the model and visualizations.