-
Notifications
You must be signed in to change notification settings - Fork 526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add illustration of the effect of scaling data #454
Conversation
Side note @ArturoAmorQ Instead of adding "Fixes #xxx" in the title, you can add "closes #xxx" in the description of the summary of the first PR message. It will have the advantage that issue #xxx will be close when we are going to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go with this synthetic example, that would be my comments. However I will propose an alternative
# ``` | ||
|
||
# %% | ||
import matplotlib.pyplot as plt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not import `matplotlib that early. I would delay it to put it in the cell where we do the first plot
import matplotlib.pyplot as plt | ||
from sklearn.datasets import make_blobs | ||
|
||
# we generate a synthetic dataset "X" with gaussian blobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should not have comments in the code itself. It would be better to be a real sentence in markdown
|
||
# we generate a synthetic dataset "X" with gaussian blobs | ||
centers = [[0, 2], [3, 0.5]] | ||
X, _ =make_blobs(n_samples=100, n_features=2, centers=centers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X, _ =make_blobs(n_samples=100, n_features=2, centers=centers, | |
X, _ = make_blobs(n_samples=100, n_features=2, centers=centers, |
centers = [[0, 2], [3, 0.5]] | ||
X, _ =make_blobs(n_samples=100, n_features=2, centers=centers, | ||
cluster_std=0.5, center_box=(1, 10.0), | ||
shuffle=True, random_state=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this stage, I think that we should plot the original dataset.
Then add some text to do the scikit-learn scaling and finally a new cell with the plotting of the scaling
The idea of this PR and #446 is to show a figure similar to the one we show in Quiz M1.2 Q4, i.e., using two features to create a plot in the Cartesian plane. The goal is to improve the score for both Q4 (64.7%) and Q5 (40.7%). I am not sure if people will be able to see the connection between an hist plot and said figures. The fact is that the text explanation has shown not to suffice for passing the concept, so I would assume this connection is not that easy to make.
Using the data-set could be a good idea as long as we plot it in the Cartesian plane, but letting people play with the position of the centers is worth it in my opinion. Maybe we can find a turnaround like creating using the data-set and encouraging people to visit this example in the documentation, or just adding a fixed image as mentioned in #446. Thoughts on that? |
65% is fine. The second question is difficult because you need to think that features can become negative and this aspect was discussed in the course before. Thus, I am fine with adding basically the answer in the notebook with the data at hand.
Adding the new dataset and figure just allow to answer the quiz without not much thinking but it is completely disconnected from the course. I would really not do that just for the sake of people getting the right answer. Instead, we should make it explicit on the data at hand what we missed to explain at first.
The histogram is useful there. It allows showing that there is no change in the distribution indeed. We can still plot thing in the feature space thought. I could imagine a strip plot together with the marginal distribution on the top
IMO, he does not bring any values
This example is quite complex but it is worth to link to it explaining that |
I committed a new illustration using the adult_census data-set to avoid creating an ad hoc data-set with make_blobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments
@@ -192,7 +192,23 @@ | |||
data_train_scaled.describe() | |||
|
|||
# %% [markdown] | |||
# We can easily combine these sequential operations with a scikit-learn | |||
# We can use a jointplot to visualize the histograms and scatterplot of any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think before this you can add something to say that in the previous cell the mean of all the columns is close to 0 and the standard deviation of all the columns is close to 1 which is the StandardScaler
's purpose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks for the comments!
closes #446
Adds code to visualize the effect of
StandardScaler
on the adult_census dataset