Skip to content

Commit

Permalink
General wording
Browse files Browse the repository at this point in the history
  • Loading branch information
ArturoAmorQ committed May 13, 2024
1 parent 625f9c8 commit 7b3a18a
Showing 1 changed file with 27 additions and 31 deletions.
58 changes: 27 additions & 31 deletions python_scripts/ensemble_bagging.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,16 +8,15 @@
# %% [markdown]
# # Bagging
#
# This notebook introduces a very natural strategy to build ensembles of machine
# learning models named "bagging".
# In this notebook we introduce a very natural strategy to build ensembles of
# machine learning models, named "bagging".
#
# "Bagging" stands for Bootstrap AGGregatING. It uses bootstrap resampling
# (random sampling with replacement) to learn several models on random
# variations of the training set. At predict time, the predictions of each
# learner are aggregated to give the final predictions.
#
# First, we will generate a simple synthetic dataset to get insights regarding
# bootstraping.
# We first create a simple synthetic dataset to better understand bootstrapping.

# %%
import pandas as pd
Expand Down Expand Up @@ -55,9 +54,8 @@ def generate_data(n_samples=30):

# %% [markdown]
#
# The relationship between our feature and the target to predict is non-linear.
# However, a decision tree is capable of approximating such a non-linear
# dependency:
# The target to predict is a non-linear function of the only feature. However, a
# decision tree is capable of approximating such a non-linear dependency:

# %%
from sklearn.tree import DecisionTreeRegressor
Expand All @@ -84,16 +82,16 @@ def generate_data(n_samples=30):
#
# ## Bootstrap resampling
#
# Given a dataset with `n` data points, bootstrapping corresponds to resampling
# with replacement `n` out of such `n` data points uniformly at random.
# Bootstrapping involves uniformly resampling `n` data points from a dataset of
# `n` points, with replacement, ensuring each sample has an equal chance of
# selection.
#
# As a result, the output of the bootstrap sampling procedure is another dataset
# with also n data points, but likely with duplicates. As a consequence, there
# are also data points from the original dataset that are never selected to
# appear in a bootstrap sample (by chance). Those data points that are left away
# are often referred to as the out-of-bag sample.
# with `n` data points, likely containing duplicates. Consequently, some data
# points from the original dataset may not be selected for a bootstrap sample.
# These unselected data points are often referred to as the out-of-bag sample.
#
# We will create a function that given `data` and `target` will return a
# We now create a function that, given `data` and `target`, returns a
# resampled variation `data_bootstrap` and `target_bootstrap`.


Expand All @@ -116,7 +114,7 @@ def bootstrap_sample(data, target, seed=0):

# %% [markdown]
#
# We will generate 3 bootstrap samples and qualitatively check the difference
# We generate 3 bootstrap samples and qualitatively check the difference
# with the original dataset.

# %%
Expand Down Expand Up @@ -179,9 +177,9 @@ def bootstrap_sample(data, target, seed=0):
# %% [markdown]
#
# On average, roughly 63.2% of the original data points of the original dataset
# will be present in a given bootstrap sample. Since the bootstrap sample has
# the same size as the original dataset, there will be many samples that are in
# the bootstrap sample multiple times.
# are present in a given bootstrap sample. Since the bootstrap sample has the
# same size as the original dataset, there are many samples that are in the
# bootstrap sample multiple times.
#
# Using bootstrap we are able to generate many datasets, all slightly different.
# We can fit a decision tree for each of these datasets and they all shall be
Expand Down Expand Up @@ -224,7 +222,7 @@ def bootstrap_sample(data, target, seed=0):
# %% [markdown]
# ## Aggregating
#
# Once our trees are fitted, we are able to get predictions for each of them. In
# Once our trees are fitted, we are able to get predictions from each of them. In
# regression, the most straightforward way to combine those predictions is just
# to average them: for a given test data point, we feed the input feature values
# to each of the `n` trained models in the ensemble and as a result compute `n`
Expand Down Expand Up @@ -262,7 +260,7 @@ def bootstrap_sample(data, target, seed=0):

# %% [markdown]
#
# The unbroken red line shows the averaged predictions, which would be the final
# The continuous red line shows the averaged predictions, which would be the final
# predictions given by our 'bag' of decision tree regressors. Note that the
# predictions of the ensemble is more stable because of the averaging operation.
# As a result, the bag of trees as a whole is less likely to overfit than the
Expand Down Expand Up @@ -338,15 +336,14 @@ def bootstrap_sample(data, target, seed=0):

# %% [markdown]
# We used a low value of the opacity parameter `alpha` to better appreciate the
# overlap in the prediction functions of the individual trees.
#
# This visualization gives some insights on the uncertainty in the predictions
# in different areas of the feature space.
# overlap in the prediction functions of the individual trees. Such
# visualization also gives us an intuition on the variance in the predictions
# across different zones of the feature space.
#
# ## Bagging complex pipelines
#
# While we used a decision tree as a base model, nothing prevents us of using
# any other type of model.
# Even if here we used a decision tree as a base model, nothing prevents us from
# using any other type of model.
#
# As we know that the original data generating function is a noisy polynomial
# transformation of the input variable, let us try to fit a bagged polynomial
Expand All @@ -361,15 +358,14 @@ def bootstrap_sample(data, target, seed=0):

polynomial_regressor = make_pipeline(
MinMaxScaler(),
PolynomialFeatures(degree=4),
PolynomialFeatures(degree=4, include_bias=False),
Ridge(alpha=1e-10),
)

# %% [markdown]
# This pipeline first scales the data to the 0-1 range with `MinMaxScaler`. Then
# it extracts degree-4 polynomial features. The resulting features will all stay
# in the 0-1 range by construction: if `x` lies in the 0-1 range then `x ** n`
# also lies in the 0-1 range for any value of `n`.
# This pipeline first scales the data to the 0-1 range using `MinMaxScaler`. It
# then generates degree-4 polynomial features. By design, these features remain
# in the 0-1 range, as any power of `x` within this range also stays within 0-1.
#
# Then the pipeline feeds the resulting non-linear features to a regularized
# linear regression model for the final prediction of the target variable.
Expand Down

0 comments on commit 7b3a18a

Please sign in to comment.