https://www.kaggle.com/competitions/titanic
- R : Random Forest (2023.12.02)
- Python : HGB 1.2 (2022.07.31)
- Python : HGB 1.1 (2022.07.28)
- Python : HGB 1.0 (2022.07.27)
- 1st attempt to apply the Random Forest method in R
- Pre-processing : Performed label encoding for
Sex
Embarked
df$Sex <- as.numeric(factor(df$Sex), level=levels(df$Sex)) df$Embarked <- as.numeric(factor(df$Embarked), level=levels(df$Embarked))
- Utilized the
randomForest()
function without specific optionsmd <- randomForest(Survived ~ ., data = df_train) pred <- predict(md, newdata = df_valid, type = "class")
- Pre-processing : Performed label encoding for
- Performance Scores (Accuracy) : Not so different from the previous trials with HGB
- Test (
train.csv
) :0.8379888
- Submission :
0.76794
- Test (
- Kaggle Code : Random Forest in Titanic (Version 6)
Titanic_RandomForest.r
is executable in a local environment
- More advanced HGB(Histogram-based Gradient Boosting)
- Convert
Pclass
as a categorical variable additionallydf = pd.get_dummies(df, columns=["Pclass", "Embarked", "Sex"])
- Change the parameter
max_iter
value from 1000 to 3000hgb = HistGradientBoostingClassifier(max_leaf_nodes=5, learning_rate=0.01, max_iter=3000, random_state=604)
- Convert
- Performance Scores (Accuracy)
- Test (
train.csv
) :0.8547486033519553
- Submission :
0.74641
(rather stepped back??)
- Test (
- Kaggle Code : HGB(Histogram-based Gradient Boosting) in Titanic (Version 1.21)
- HGB(Histogram-based Gradient Boosting) with some parameters' change
I setmax_iter=1000
in my dream last night …… omghgb = HistGradientBoostingClassifier(max_leaf_nodes=5, learning_rate=0.01, max_iter=1000, random_state=604)
- Performance Scores (Accuracy)
- Test (in
train.csv
) :0.8435754189944135
- Submission :
0.76555
(somewhat improved but I'm still thirsty!)
- Test (in
- Kaggle Code : HGB(Histogram-based Gradient Boosting) in Titanic (Version 1.1)
- HGB(Histogram-based Gradient Boosting) with default parameters
- Use
HistGradientBoostingClassifier()
fromsklearn
- Pre-processing
- Remove 4 variables : 1 PassengerId, 3 Name, 8 Ticket (useless) / 10 Cabin (too many NaN)
- Replace 3 variables : 4 Sex(categorical) 5 Age(fill NaN) 11 Embarked(fill NaN, categorical)
- Performance Scores (Accuracy)
- Trainning :
0.9459309962075663
- Validation :
0.8217275682064414
- Test (in
train.csv
) :0.8324022346368715
- Submission :
0.75598
(disappointed ……)
- Trainning :
- Kaggle Code : HGB(Histogram-based Gradient Boosting) in Titanic (Version 1.0)