Add illustration of the effect of scaling data #454

ArturoAmorQ · 2021-09-13T09:51:27Z

closes #446

Adds code to visualize the effect of StandardScaler on the adult_census dataset

glemaitre · 2021-09-16T14:11:04Z

Side note @ArturoAmorQ Instead of adding "Fixes #xxx" in the title, you can add "closes #xxx" in the description of the summary of the first PR message. It will have the advantage that issue #xxx will be close when we are going to merge.

glemaitre

If we go with this synthetic example, that would be my comments. However I will propose an alternative

glemaitre · 2021-09-16T14:43:00Z

python_scripts/02_numerical_pipeline_scaling.py

+# ```
+
+# %%
+import matplotlib.pyplot as plt


I would not import `matplotlib that early. I would delay it to put it in the cell where we do the first plot

glemaitre · 2021-09-16T14:44:09Z

python_scripts/02_numerical_pipeline_scaling.py

+import matplotlib.pyplot as plt
+from sklearn.datasets import make_blobs
+
+# we generate a synthetic dataset "X" with gaussian blobs


I think that we should not have comments in the code itself. It would be better to be a real sentence in markdown

glemaitre · 2021-09-16T14:44:17Z

python_scripts/02_numerical_pipeline_scaling.py

+
+# we generate a synthetic dataset "X" with gaussian blobs
+centers = [[0, 2], [3, 0.5]]
+X, _ =make_blobs(n_samples=100, n_features=2, centers=centers,


Suggested change

X, _ =make_blobs(n_samples=100, n_features=2, centers=centers,

X, _ = make_blobs(n_samples=100, n_features=2, centers=centers,

glemaitre · 2021-09-16T14:45:15Z

python_scripts/02_numerical_pipeline_scaling.py

+centers = [[0, 2], [3, 0.5]]
+X, _ =make_blobs(n_samples=100, n_features=2, centers=centers,
+                 cluster_std=0.5, center_box=(1, 10.0),
+                 shuffle=True, random_state=0)


At this stage, I think that we should plot the original dataset.
Then add some text to do the scikit-learn scaling and finally a new cell with the plotting of the scaling

python_scripts/02_numerical_pipeline_scaling.py

glemaitre · 2021-09-16T15:15:43Z

In the first notebook, we did use plt.hist. Instead of generating a new dataset a bit dissociated from the notebook, we could focus on a single feature. It avoids having to explain anything regarding Gaussian blobs and will be more intuitive.

In short, we could plot the histogram of the "age" columns before and after scaling. Basically, it will give these figures:

# %%
fig, axs = plt.subplots(ncols=2, sharey=True, figsize=(6, 4))
axs[0].hist(data_train["age"], bins=15)
axs[1].hist(data_train_scaled[:, 0], bins=15)
axs[0].set_xlabel("Age")
axs[1].set_xlabel("Normalized Age")
axs[0].set_ylabel("Frequency")
fig.suptitle("Histogram of the age feature \nbefore and after normalization", y=1.05)

(I would not use the code above and instead do 2 separate figures to have more basic matplotlib code.)

The nice part is that we show that the distribution does change. Only the x-axis values are changing. We could add use the transformer to transform the minimum age and maximum age on the x-axis to explain how do we get these new values in the scaled data. We can then easily emphasize that the values becomes negative.

ArturoAmorQ · 2021-09-16T16:01:39Z

The idea of this PR and #446 is to show a figure similar to the one we show in Quiz M1.2 Q4, i.e., using two features to create a plot in the Cartesian plane. The goal is to improve the score for both Q4 (64.7%) and Q5 (40.7%).

I am not sure if people will be able to see the connection between an hist plot and said figures. The fact is that the text explanation has shown not to suffice for passing the concept, so I would assume this connection is not that easy to make.

It avoids having to explain anything regarding Gaussian blobs and will be more intuitive.

Using the data-set could be a good idea as long as we plot it in the Cartesian plane, but letting people play with the position of the centers is worth it in my opinion. Maybe we can find a turnaround like creating using the data-set and encouraging people to visit this example in the documentation, or just adding a fixed image as mentioned in #446. Thoughts on that?

glemaitre · 2021-09-16T16:29:14Z

The goal is to improve the score for both Q4 (64.7%) and Q5 (40.7%).

65% is fine. The second question is difficult because you need to think that features can become negative and this aspect was discussed in the course before. Thus, I am fine with adding basically the answer in the notebook with the data at hand.

I am not sure if people will be able to see the connection between an hist plot and said figures. The fact is that the text explanation has shown not to suffice for passing the concept, so I would assume this connection is not that easy to make.

Adding the new dataset and figure just allow to answer the quiz without not much thinking but it is completely disconnected from the course. I would really not do that just for the sake of people getting the right answer. Instead, we should make it explicit on the data at hand what we missed to explain at first.

Using the data-set could be a good idea as long as we plot it in the Cartesian plane

The histogram is useful there. It allows showing that there is no change in the distribution indeed. We can still plot thing in the feature space thought. I could imagine a strip plot together with the marginal distribution on the top

but letting people play with the position of the centers is worth it in my opinion

IMO, he does not bring any values

Maybe we can find a turnaround like creating using the data-set and encouraging people to visit this example in the documentation, or just adding a fixed image as mentioned in #446.

This example is quite complex but it is worth to link to it explaining that StandardScaler is only an option and other preprocessors are existing.

ArturoAmorQ · 2021-09-17T13:58:27Z

I committed a new illustration using the adult_census data-set to avoid creating an ad hoc data-set with make_blobs.

lesteve

A few comments

lesteve · 2021-09-17T15:00:58Z

python_scripts/02_numerical_pipeline_scaling.py

@@ -192,7 +192,23 @@
 data_train_scaled.describe()

 # %% [markdown]
-# We can easily combine these sequential operations with a scikit-learn
+# We can use a jointplot to visualize the histograms and scatterplot of any


I think before this you can add something to say that in the previous cell the mean of all the columns is close to 0 and the standard deviation of all the columns is close to 1 which is the StandardScaler's purpose

Done, thanks for the comments!

python_scripts/02_numerical_pipeline_scaling.py

Fix INRIA#446

9f45230

ArturoAmorQ changed the title ~~Fix #446~~ Fix #446 (Show illustration of StandardScaler) Sep 13, 2021

glemaitre reviewed Sep 16, 2021

View reviewed changes

glemaitre changed the title ~~Fix #446 (Show illustration of StandardScaler)~~ Add illustration of the effect of scaling data Sep 16, 2021

Use adult_census dataset instead of blobs

e9934b4

lesteve reviewed Sep 17, 2021

View reviewed changes

Fix wording and PEP8 format

a78c2a4

lesteve merged commit f7d3890 into INRIA:master Sep 23, 2021

github-actions bot pushed a commit that referenced this pull request Sep 23, 2021

[ci skip] Add illustration of the effect of scaling data (#454) f7d3890

0e42004

ArturoAmorQ deleted the AddScaler branch September 24, 2021 13:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add illustration of the effect of scaling data #454

Add illustration of the effect of scaling data #454

ArturoAmorQ commented Sep 13, 2021 •

edited

Loading

glemaitre commented Sep 16, 2021

glemaitre left a comment

glemaitre Sep 16, 2021

glemaitre Sep 16, 2021

glemaitre Sep 16, 2021

glemaitre Sep 16, 2021

glemaitre commented Sep 16, 2021

ArturoAmorQ commented Sep 16, 2021

glemaitre commented Sep 16, 2021

ArturoAmorQ commented Sep 17, 2021

lesteve left a comment

lesteve Sep 17, 2021

ArturoAmorQ Sep 17, 2021

	X, _ =make_blobs(n_samples=100, n_features=2, centers=centers,
	X, _ = make_blobs(n_samples=100, n_features=2, centers=centers,

Add illustration of the effect of scaling data #454

Add illustration of the effect of scaling data #454

Conversation

ArturoAmorQ commented Sep 13, 2021 • edited Loading

glemaitre commented Sep 16, 2021

glemaitre left a comment

Choose a reason for hiding this comment

glemaitre Sep 16, 2021

Choose a reason for hiding this comment

glemaitre Sep 16, 2021

Choose a reason for hiding this comment

glemaitre Sep 16, 2021

Choose a reason for hiding this comment

glemaitre Sep 16, 2021

Choose a reason for hiding this comment

glemaitre commented Sep 16, 2021

ArturoAmorQ commented Sep 16, 2021

glemaitre commented Sep 16, 2021

ArturoAmorQ commented Sep 17, 2021

lesteve left a comment

Choose a reason for hiding this comment

lesteve Sep 17, 2021

Choose a reason for hiding this comment

ArturoAmorQ Sep 17, 2021

Choose a reason for hiding this comment

ArturoAmorQ commented Sep 13, 2021 •

edited

Loading