Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add illustration of the effect of scaling data #454

Merged
merged 3 commits into from
Sep 23, 2021

Conversation

ArturoAmorQ
Copy link
Collaborator

@ArturoAmorQ ArturoAmorQ commented Sep 13, 2021

closes #446

Adds code to visualize the effect of StandardScaler on the adult_census dataset

@ArturoAmorQ ArturoAmorQ changed the title Fix #446 Fix #446 (Show illustration of StandardScaler) Sep 13, 2021
@glemaitre
Copy link
Collaborator

Side note @ArturoAmorQ Instead of adding "Fixes #xxx" in the title, you can add "closes #xxx" in the description of the summary of the first PR message. It will have the advantage that issue #xxx will be close when we are going to merge.

Copy link
Collaborator

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go with this synthetic example, that would be my comments. However I will propose an alternative

# ```

# %%
import matplotlib.pyplot as plt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not import `matplotlib that early. I would delay it to put it in the cell where we do the first plot

import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

# we generate a synthetic dataset "X" with gaussian blobs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should not have comments in the code itself. It would be better to be a real sentence in markdown


# we generate a synthetic dataset "X" with gaussian blobs
centers = [[0, 2], [3, 0.5]]
X, _ =make_blobs(n_samples=100, n_features=2, centers=centers,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
X, _ =make_blobs(n_samples=100, n_features=2, centers=centers,
X, _ = make_blobs(n_samples=100, n_features=2, centers=centers,

centers = [[0, 2], [3, 0.5]]
X, _ =make_blobs(n_samples=100, n_features=2, centers=centers,
cluster_std=0.5, center_box=(1, 10.0),
shuffle=True, random_state=0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this stage, I think that we should plot the original dataset.
Then add some text to do the scikit-learn scaling and finally a new cell with the plotting of the scaling

python_scripts/02_numerical_pipeline_scaling.py Outdated Show resolved Hide resolved
@glemaitre
Copy link
Collaborator

In the first notebook, we did use plt.hist. Instead of generating a new dataset a bit dissociated from the notebook, we could focus on a single feature. It avoids having to explain anything regarding Gaussian blobs and will be more intuitive.

In short, we could plot the histogram of the "age" columns before and after scaling. Basically, it will give these figures:
draft

# %%
fig, axs = plt.subplots(ncols=2, sharey=True, figsize=(6, 4))
axs[0].hist(data_train["age"], bins=15)
axs[1].hist(data_train_scaled[:, 0], bins=15)
axs[0].set_xlabel("Age")
axs[1].set_xlabel("Normalized Age")
axs[0].set_ylabel("Frequency")
fig.suptitle("Histogram of the age feature \nbefore and after normalization", y=1.05)

(I would not use the code above and instead do 2 separate figures to have more basic matplotlib code.)

The nice part is that we show that the distribution does change. Only the x-axis values are changing. We could add use the transformer to transform the minimum age and maximum age on the x-axis to explain how do we get these new values in the scaled data. We can then easily emphasize that the values becomes negative.

@glemaitre glemaitre changed the title Fix #446 (Show illustration of StandardScaler) Add illustration of the effect of scaling data Sep 16, 2021
@ArturoAmorQ
Copy link
Collaborator Author

The idea of this PR and #446 is to show a figure similar to the one we show in Quiz M1.2 Q4, i.e., using two features to create a plot in the Cartesian plane. The goal is to improve the score for both Q4 (64.7%) and Q5 (40.7%).

I am not sure if people will be able to see the connection between an hist plot and said figures. The fact is that the text explanation has shown not to suffice for passing the concept, so I would assume this connection is not that easy to make.

It avoids having to explain anything regarding Gaussian blobs and will be more intuitive.

Using the data-set could be a good idea as long as we plot it in the Cartesian plane, but letting people play with the position of the centers is worth it in my opinion. Maybe we can find a turnaround like creating using the data-set and encouraging people to visit this example in the documentation, or just adding a fixed image as mentioned in #446. Thoughts on that?

@glemaitre
Copy link
Collaborator

The goal is to improve the score for both Q4 (64.7%) and Q5 (40.7%).

65% is fine. The second question is difficult because you need to think that features can become negative and this aspect was discussed in the course before. Thus, I am fine with adding basically the answer in the notebook with the data at hand.

I am not sure if people will be able to see the connection between an hist plot and said figures. The fact is that the text explanation has shown not to suffice for passing the concept, so I would assume this connection is not that easy to make.

Adding the new dataset and figure just allow to answer the quiz without not much thinking but it is completely disconnected from the course. I would really not do that just for the sake of people getting the right answer. Instead, we should make it explicit on the data at hand what we missed to explain at first.

Using the data-set could be a good idea as long as we plot it in the Cartesian plane

The histogram is useful there. It allows showing that there is no change in the distribution indeed. We can still plot thing in the feature space thought. I could imagine a strip plot together with the marginal distribution on the top

but letting people play with the position of the centers is worth it in my opinion

IMO, he does not bring any values

Maybe we can find a turnaround like creating using the data-set and encouraging people to visit this example in the documentation, or just adding a fixed image as mentioned in #446.

This example is quite complex but it is worth to link to it explaining that StandardScaler is only an option and other preprocessors are existing.

@ArturoAmorQ
Copy link
Collaborator Author

I committed a new illustration using the adult_census data-set to avoid creating an ad hoc data-set with make_blobs.

Copy link
Collaborator

@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments

@@ -192,7 +192,23 @@
data_train_scaled.describe()

# %% [markdown]
# We can easily combine these sequential operations with a scikit-learn
# We can use a jointplot to visualize the histograms and scatterplot of any
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think before this you can add something to say that in the previous cell the mean of all the columns is close to 0 and the standard deviation of all the columns is close to 1 which is the StandardScaler's purpose

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks for the comments!

python_scripts/02_numerical_pipeline_scaling.py Outdated Show resolved Hide resolved
python_scripts/02_numerical_pipeline_scaling.py Outdated Show resolved Hide resolved
python_scripts/02_numerical_pipeline_scaling.py Outdated Show resolved Hide resolved
@lesteve lesteve merged commit f7d3890 into INRIA:master Sep 23, 2021
@ArturoAmorQ ArturoAmorQ deleted the AddScaler branch September 24, 2021 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Show illustration of StandardScaler() in lecture notebook
3 participants