ENH Improve wording in stratification notebook (#760)

Co-authored-by: ArturoAmorQ <arturo.amor-quiroz@polytechnique.edu> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
INRIA · May 17, 2024 · ca7d1d7 · ca7d1d7
1 parent 9a9e74f
commit ca7d1d7
Showing 1 changed file with 40 additions and 31 deletions.
diff --git a/python_scripts/cross_validation_stratification.py b/python_scripts/cross_validation_stratification.py
@@ -36,10 +36,11 @@
 model = make_pipeline(StandardScaler(), LogisticRegression())
 
 # %% [markdown]
-# Once we created our model, we will use the cross-validation framework to
-# evaluate it. We will use the `KFold` cross-validation strategy. We will define
-# a dataset with nine samples and repeat the cross-validation three times (i.e.
-# `n_splits`).
+# Once the model is created, we can evaluate it using cross-validation. We start
+# by using the `KFold` strategy.
+#
+# Let's review how this strategy works. For such purpose, we define a dataset
+# with nine samples and split the dataset into three folds (i.e. `n_splits=3`).
 
 # %%
 import numpy as np
@@ -51,12 +52,12 @@
     print("TRAIN:", train_index, "TEST:", test_index)
 
 # %% [markdown]
-# By defining three splits, we will use three samples for testing and six for
-# training each time. `KFold` does not shuffle by default. It means that it will
-# select the three first samples for the testing set at the first split, then
-# the next three samples for the second split, and the three next for the
-# last split. In the end, all samples have been used in testing at least once
-# among the different splits.
+# By defining three splits, we use three samples (1-fold) for testing and six
+# (2-folds) for training each time. `KFold` does not shuffle by default. It
+# means that the three first samples are selected for the testing set at the
+# first split, then the three next three samples for the second split, and the
+# three next for the last split. In the end, all samples have been used in
+# testing at least once among the different splits.
 #
 # Now, let's apply this strategy to check the generalization performance of our
 # model.
@@ -73,8 +74,8 @@
 
 # %% [markdown]
 # It is a real surprise that our model cannot correctly classify any sample in
-# any cross-validation split. We will now check our target's value to understand
-# the issue.
+# any cross-validation split. We now check our target's value to understand the
+# issue.
 
 # %%
 import matplotlib.pyplot as plt
@@ -86,18 +87,17 @@
 _ = plt.title("Class value in target y")
 
 # %% [markdown]
-# We see that the target vector `target` is ordered. It will have some
-# unexpected consequences when using the `KFold` cross-validation. To illustrate
-# the consequences, we will show the class count in each fold of the
-# cross-validation in the train and test set.
+# We see that the target vector `target` is ordered. This has some unexpected
+# consequences when using the `KFold` cross-validation. To illustrate the
+# consequences, we show the class count in each fold of the cross-validation in
+# the train and test set.
 #
 # Let's compute the class counts for both the training and testing sets using
 # the `KFold` cross-validation, and plot these information in a bar plot.
 #
-# We will iterate given the number of split and check how many samples of each
-# are present in the training and testing set. We will store the information
-# into two distincts lists; one for the training set and one for the testing
-# set.
+# We iterate given the number of split and check how many samples of each are
+# present in the training and testing set. We then store the information into
+# two distinct lists; one for the training set and one for the testing set.
 
 # %%
 import pandas as pd
@@ -114,8 +114,8 @@
     test_cv_counts.append(target_test.value_counts())
 
 # %% [markdown]
-# To plot the information on a single figure, we will concatenate the
-# information regarding the fold within the same dataset.
+# To plot the information on a single figure, we concatenate the information
+# regarding the fold within the same dataset.
 
 # %%
 train_cv_counts = pd.concat(
@@ -138,13 +138,13 @@
 train_cv_counts.plot.bar()
 plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
 plt.ylabel("Count")
-_ = plt.title("Training set")
+_ = plt.title("Training set class counts")
 
 # %%
 test_cv_counts.plot.bar()
 plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
 plt.ylabel("Count")
-_ = plt.title("Test set")
+_ = plt.title("Test set class counts")
 
 # %% [markdown]
 # We can confirm that in each fold, only two of the three classes are present in
@@ -168,7 +168,7 @@
 # 90%. Now that we solved our first issue, it would be interesting to check if
 # the class frequency in the training and testing set is equal to our original
 # set's class frequency. It would ensure that we are training and testing our
-# model with a class distribution that we will encounter in production.
+# model with a class distribution that we would encounter in production.
 
 # %%
 train_cv_counts = []
@@ -191,13 +191,13 @@
 train_cv_counts.plot.bar()
 plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
 plt.ylabel("Count")
-_ = plt.title("Training set")
+_ = plt.title("Training set class counts\n(with suffling)")
 
 # %%
 test_cv_counts.plot.bar()
 plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
 plt.ylabel("Count")
-_ = plt.title("Test set")
+_ = plt.title("Test set class counts\n(with suffling)")
 
 # %% [markdown]
 # We see that neither the training and testing sets have the same class
@@ -242,18 +242,27 @@
 train_cv_counts.plot.bar()
 plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
 plt.ylabel("Count")
-_ = plt.title("Training set")
+_ = plt.title("Training set class counts\n(with stratifying)")
 
 # %%
 test_cv_counts.plot.bar()
 plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
 plt.ylabel("Count")
-_ = plt.title("Test set")
+_ = plt.title("Test set class counts\n(with stratifying)")
 
 # %% [markdown]
 # In this case, we observe that the class counts are very close both in the
 # train set and the test set. The difference is due to the small number of
 # samples in the iris dataset.
 #
-# In conclusion, this is a good practice to use stratification within the
-# cross-validation framework when dealing with a classification problem.
+# In other words, stratifying is more effective than just shuffling when it
+# comes to making sure that the distributions of classes in all the folds are
+# representative of the entire dataset. As training and testing folds have
+# similar class distributions, stratifying leads to a more realistic measure of
+# the model’s ability to generalize. This is specially important when the
+# performance metrics depend on the proportion of the positive class, as we will
+# see in a future notebook.
+#
+# The interested reader can learn about other stratified cross-validation
+# techniques in the [scikit-learn user
+# guide](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators-with-stratification-based-on-class-labels).