- Overview
- Models Used
- Data Preprocessing
- Models Training and Evaluation
- Data Visualization
- Findings
- Ouput
- Conclusion
This project aims to classify Iris flowers into three species—setosa, versicolor, and virginica—based on their sepal and petal measurements using machine learning techniques. The dataset comprises 150 samples evenly distributed among these species, making it a standard benchmark for introductory classification tasks.
Two primary models were employed:
- Logistic Regression: A linear model suitable for binary and multi-class classification tasks.
- Random Forest Classifier: An ensemble learning method effective for handling complex classification problems.
The Iris dataset was loaded from a CSV file containing 150 records and 5 attributes: sepal length, sepal width, petal length, petal width, and species.
- Summary Statistics: Provided insights into the distribution and variation of sepal and petal measurements.
- Pair Plot: Visualized relationships between features across different species.
- Correlation Heatmap: Showed feature correlations, aiding in feature selection.
- Splitting Data: The dataset was split into training (80%) and testing (20%) sets.
- Logistic Regression: Trained a linear model for classification.
- Random Forest Classifier: Trained an ensemble model to handle complex relationships.
-
Best Parameters: The optimal parameters found for the Random Forest Classifier were {'max_depth': 20, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 200}. These parameters were selected based on cross-validation to maximize accuracy.
-
Best Random Forest Accuracy: The model achieved an accuracy of 100% on the test dataset, indicating that it correctly classified all Iris flowers.
The classification report provides a detailed breakdown of how well the model performed for each species:
- Precision: Measures the accuracy of positive predictions.
- Recall: Indicates how well the model captures instances of a class.
- F1-score: Harmonic mean of precision and recall, providing a single metric to evaluate the model's performance.
- Support: Number of samples in each class.
For example:
- Iris-setosa: The model correctly classified all 10 samples of Iris-setosa, achieving perfect precision, recall, and F1-score.
- Iris-versicolor: Similarly, all 9 samples of Iris-versicolor were correctly classified.
- Iris-virginica: All 11 samples of Iris-virginica were also classified correctly.
The overall accuracy of 100% indicates that the model successfully learned the patterns in the data and accurately classified the Iris flowers into their respective species.
- Pair Plot: Visualizes relationships between sepal length, sepal width, petal length, and petal width across different species.
- Correlation Heatmap: Shows the correlation coefficients between these features, aiding in feature selection and understanding feature importance.
- Summary statistics provided insights into the distribution and variation of sepal and petal measurements.
- Pair plots visually represented the clustering of different species based on their measurements.
- The correlation heatmap highlighted significant relationships between certain features, influencing classification accuracy.
- Both Logistic Regression and Random Forest Classifier achieved perfect accuracy of 100% on the test dataset.
- Precision, recall, and F1-score metrics confirmed the models' ability to effectively distinguish between Iris species.
The output from the models includes:
- Best Parameters for Random Forest Classifier: {'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 10, 'n_estimators': 50}
- Best Random Forest Accuracy: 1.0
- Classification Reports for Logistic Regression and Random Forest Classifier, showing precision, recall, and F1-score metrics for each Iris species.
This project demonstrated the application of machine learning models to classify Iris flowers based on their morphological measurements with high accuracy. The selected models, Logistic Regression and Random Forest Classifier, performed exceptionally well, showcasing their effectiveness for such classification tasks. By leveraging data preprocessing, visualization, and thorough evaluation techniques, this project provides a robust framework for introductory classification tasks.