Skip to content

Piyush-Bhor/titanic-kaggle

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Titanic Kaggle Competition - README

This repository contains code for participating in the Titanic competition on Kaggle. The objective of the competition is to predict whether passengers aboard the Titanic survived or not, based on various features such as age, sex, ticket class, and more.

Prerequisites

Before running the code, ensure you have the following Python libraries installed:

  • pandas
  • numpy
  • scikit-learn
  • matplotlib

You can install them using pip:

pip install pandas numpy scikit-learn matplotlib

Usage

  1. Clone the repository to your local machine:
git clone https://github.com/your-username/titanic-kaggle.git
cd titanic-kaggle
  1. Download the Titanic dataset (train.csv) from Kaggle or provide the path to the file in the code where it reads the dataset.
df = pd.read_csv("/Users/pb/Downloads/titanic/train.csv")
  1. Data Preprocessing

    • The 'Sex' column is encoded using LabelEncoder to convert categorical data to numerical values (0 for one category and 1 for the other).

    • Missing values in the "Age" column are replaced with the mean age of the dataset.

    • Unnecessary columns like "Embarked", "PassengerId", "Name", "Ticket", and "Cabin" are dropped from the features.

  2. Split the dataset into training and test sets using train_test_split.

  3. Model Selection and Training

    • The model chosen for this task is a Random Forest Classifier with 100 estimators.

    • The model is trained on the training data using the fit method.

  4. Model Evaluation

    • The accuracy of the model is computed on the test set using accuracy_score.

    • The "out-of-bag" (oob) score of the Random Forest Classifier is also displayed.

    • ROC-AUC score, precision, recall, and F1-score are computed using various evaluation metrics from scikit-learn.

  5. Visualization

    • Precision-Recall Curve is plotted to visualize the precision and recall trade-off.

    • Precision vs. Recall plot is displayed to explore the relationship between precision and recall.

    • ROC Curve is plotted to visualize the true positive rate (sensitivity) against the false positive rate (1-specificity).

  6. Running the Code

    To run the code, ensure you have the required libraries installed and have the Titanic dataset in the correct path or adjust the path in the pd.read_csv() function accordingly. Then, simply execute the code in your Python environment.

python titanic.py

Metrics:

  1. Accuracy: 81.00558659217877%
  2. oob score: 81.17999999999999 %
  3. acc_random_forest: 98.03
  4. ROC-AUC-Score: 0.996966182600511
  5. Precision: 0.78099173553719
  6. Recall: 0.7052238805970149

Plots

Precision and Recall Plot

Precision-Recall Curve

Precision vs. Recall Plot

PrecisionVsRecall Curve

ROC Curve

ROC Curve

Disclaimer

Keep in mind that this is a basic implementation, and there are many ways to improve the model's performance, such as hyperparameter tuning, feature engineering, or using different machine learning algorithms. This code serves as a starting point for your exploration in the Titanic Kaggle competition.

Feel free to explore, modify, and experiment with the code to enhance your results.

Happy coding and good luck with the competition!

About

Model trained for Kaggle's Titanic competition

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages