-
Notifications
You must be signed in to change notification settings - Fork 0
/
kernels_logistic_regression.py
291 lines (217 loc) · 13.1 KB
/
kernels_logistic_regression.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
# -*- coding: utf-8 -*-
"""kernels_Logistic_Regression.ipynb
Automatically generated by Colaboratory.
Original file is located at
https://colab.research.google.com/drive/1kGisAtd4vNjYfwh_O4fJchs7eogbKmkk
# Kernel logistic regression
- We will implement kernel logistic regression de novo using numpy and scipy
- all required imports have been added for you in cell 1
- here is the workflow:
- generate 200 points on the plane in an XOR configuration and visualize it
- build a kernel logistic model for three cases (1) ALL 200 points as landmarks, (2) landmarks chosen by kmeans clustering, (3) 4 strategically chosen landmarks
- Test the model in each case as follows
-create a grid of 50 x 50 test points on [-3,+3] x [-3,+3]
- build a kernel representation of the 2500 test points
- use fitted logistic model to predict probability of membership in class 1 for the 2500 points
- reshape prediction array into a 50x50 array for plotting
"""
# Commented out IPython magic to ensure Python compatibility.
import numpy as np
import pandas as pd
import sklearn as sklearn
from scipy.spatial.distance import pdist, cdist, squareform
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans
from ipywidgets import interact
import matplotlib.pyplot as plt
# %matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 10.0) # set default size of plots
# function for plotting the decision boundaries
# Z = prediction array reshaped as 50x50, X and y which are the original training data,
# xx and yy are the grid coordinates for the test points
def plot_boundary(Z,X,y,xx,yy):
image = plt.imshow(Z, interpolation='nearest', extent=(xx.min(), xx.max(), yy.min(), yy.max()),
aspect='auto', origin='lower', cmap=plt.cm.PuOr_r)
contours = plt.contour(xx, yy, Z, levels=[0.5], linewidths=2,colors=['k'])
plt.scatter(X[:, 0], X[:, 1], s=30, c=y, cmap=plt.cm.Paired,edgecolors=(0, 0, 0))
plt.xticks(())
plt.yticks(())
plt.axis([-3, 3, -3, 3])
plt.colorbar(image)
"""## Generate and visualize XOR data for training"""
rng = np.random.RandomState(0)
X = rng.randn(200, 2)
y = np.logical_xor(X[:, 0] > 0, X[:, 1] > 0)
print(X.shape,y.shape)
plt.scatter(X[:, 0], X[:, 1], s=50, c=y, cmap=plt.cm.viridis, edgecolors=(0, 0, 0))
plt.show()
"""## Generate a grid of 50x50 points for testing"""
# build the testpoints
xx, yy = np.meshgrid(np.linspace(-3, 3, 50),np.linspace(-3, 3, 50))
test_points = np.vstack((xx.ravel(), yy.ravel())).T
print(xx.shape,yy.shape,test_points.shape)
"""## Do kernel regression using all the points as landmarks
- use the Gaussian kernel, where s is the kernel width
$\large {e^{(\frac{-||x-x'||^2}{2s^2})}}$
- construct the kernel matrix K (hint: the function pdist in scipy.spatial.distance might be helpful). Use euclidean distance as your metric. Ignore the bias term (column of 1s) in the construction of K. So K will be of size 200 x 200
- https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html
- https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.squareform.html#scipy.spatial.distance.squareform (to convert pdist matrix to square)
- build a logistic regresion model using K as your data matrix and y as your label vector (use defaults for the LogisticRegression() call).
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
- construct the kernel matrix corresponding to the test points using the training set landmarks (see cdist)
- https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html
- predict class probabilities for the test points (use .predict_proba() for the logistic classifier)
- reshape the predicted values array into a 50x50 grid
- plot the decision boundary using the given function
"""
##### START YOUR CODE
s=1.5 # kernel width (you can play with this parameter)
# build the kernel matrix on training data (about 2 lines of vectorized code)
eucDist_X = pdist(X=X, metric='euclidean')
# print(eucDist_X.shape) # Has m*(m-1)/2 pdist entries
sqrMat_X = squareform(eucDist_X)
Ktrain = np.exp(-1*(sqrMat_X)**2/(2*s*s))
# print(Ktrain.shape) # 200 x 200
# build a logistic model on the kernel matrix (2 lines of code) using sklearn's LogisticRegression()
logreg = LogisticRegression().fit(Ktrain, y)
# construct the kernel representation of the test_points with the training set landmarks
# hint: consider using cdist in scipy.spatial.distance (about 2-3 lines of code)
sqrMat_Xtest = cdist(test_points, X, metric='euclidean')
Ktest = np.exp(-1*(sqrMat_Xtest)**2/(2*s*s))
# use your learned model to predict using the kernel representation.
# Store the predictions in array Z (1 line of code)
Z = logreg.predict_proba(Ktest)[:,1]
##### END YOUR CODE
# reshape Z into a 50x50 grid
Z = Z.reshape(xx.shape)
plot_boundary(Z,X,y,xx,yy)
"""## Kernel regression using clustering to select landmarks
- run Kmeans on the original data to build N clusters (N=30). You can vary this and study its impact.
- Use the cluster centers as landmarks for building the kernel representation. That is, construct the kernel matrix K (hint: the function cdist in scipy.spatial.distance might be helpful). Use euclidean distance as your metric. Ignore the bias term (column of 1s) in the construction of K. So K will be of size N x N
- build a logistic regresion model using K as your data matrix and y as your label vector
- construct the kernel matrix corresponding to the test points using the training set landmarks (using cdist)
- predict class probabilities for the test points
- reshape the predicted values array into a 50x50 grid
- plot the decision boundary using the given function
"""
N = 10
kmeans = KMeans(n_clusters=N, random_state=0).fit(X)
plt.scatter(X[:, 0], X[:, 1], s=30, c=y, cmap=plt.cm.viridis, edgecolors=(0, 0, 0))
plt.scatter(kmeans.cluster_centers_[:,0],kmeans.cluster_centers_[:,1], s=100, c='red' )
plt.show()
landmarks = kmeans.cluster_centers_
print(landmarks.shape)
##### START YOUR CODE
s = 1 # kernel width (play with this parameter)
# build the kernel matrix with the kmeans cluster centers (2 lines of code)
sqrMat_X = cdist(X, landmarks, metric='euclidean')
Ktrain = np.exp(-1*(sqrMat_X)**2/(2*s*s))
# build a logistic model on the kernel matrix (2 lines of code)
logreg = LogisticRegression().fit(Ktrain, y)
# build a kernel representation of the test points with the kmeans cluster centers (2 lines of code)
sqrMat_Xtest = cdist(test_points, landmarks, metric='euclidean')
Ktest = np.exp(-1*(sqrMat_Xtest)**2/(2*s*s))
# Predict probabilities on the testpoints (1 line of code)
Z = logreg.predict_proba(Ktest)[:,1]
##### END YOUR CODE
Z = Z.reshape(xx.shape)
plot_boundary(Z,X,y,xx,yy)
plt.scatter(kmeans.cluster_centers_[:,0],kmeans.cluster_centers_[:,1], s=100, c='red' )
plt.show()
"""## Kernel regression with strategically chosen kernels
- Use the specified centers as landmarks for building the kernel representation. That is, construct the kernel matrix K (hint: the function cdist in scipy.spatial.distance might be helpful). Ignore the bias term (column of 1s) in the construction of K. So K will be of size 4 x 4.
- build a logistic regresion model using K as your data matrix and y as your label vector
- construct the kernel matrix corresponding to the test points using the training set landmarks
- predict class probabilities for the test points
- reshape the predicted values array into a 50x50 grid
- plot the decision boundary using the given function
"""
centers = np.array([[-1,-1],[-1,1],[1,-1],[1,1]])
plt.scatter(X[:, 0], X[:, 1], s=30, c=y, cmap=plt.cm.viridis, edgecolors=(0, 0, 0))
plt.scatter(centers[:,0],centers[:,1], s=100, c='red' )
plt.show()
#### START YOUR CODE
s = 1.5 # kernel width (play with this parameter)
#build the kernel matrix with respect to the new landmarks (2 lines of code)
sqrMat_X = cdist(X, centers, metric='euclidean')
Ktrain = np.exp(-1*(sqrMat_X)**2/(2*s*s))
# build a logistic model on the kernel matrix (2 lines of code)
logreg = LogisticRegression().fit(Ktrain, y)
# build a kernel representation of the test points (2 lines of code)
sqrMat_Xtest = cdist(test_points, centers, metric='euclidean')
Ktest = np.exp(-1*(sqrMat_Xtest)**2/(2*s*s))
# predict on the testpoints (1 line of code)
Z = logreg.predict_proba(Ktest)[:,1]
#### END YOUR CODE
Z = Z.reshape(xx.shape)
plot_boundary(Z,X,y,xx,yy)
plt.scatter(centers[:,0],centers[:,1], s=100, c='red' )
plt.show()
"""# Experimenting with number of clusters and standard deviation"""
def LogisticRegression_StrategicLandmarks(s=1.5, N=20):
plt.rcParams['figure.figsize'] = (7.0, 7.0)
#build the kernel matrix with respect to the new landmarks (2 lines of code)
sqrMat_X = cdist(X, centers, metric='euclidean')
Ktrain = np.exp(-1*(sqrMat_X)**2/(2*s*s))
# build a logistic model on the kernel matrix (2 lines of code)
logreg = LogisticRegression().fit(Ktrain, y)
# build a kernel representation of the test points (2 lines of code)
sqrMat_Xtest = cdist(test_points, centers, metric='euclidean')
Ktest = np.exp(-1*(sqrMat_Xtest)**2/(2*s*s))
# predict on the testpoints (1 line of code)
Z = logreg.predict_proba(Ktest)[:,1]
#### END YOUR CODE
Z = Z.reshape(xx.shape)
plot_boundary(Z,X,y,xx,yy)
plt.scatter(centers[:,0],centers[:,1], s=100, c='red' )
plt.title('Strategic Landmarks')
plt.show()
## K Means cluster landmarks
kmeans = KMeans(n_clusters=N, random_state=0).fit(X)
landmarks = kmeans.cluster_centers_
# build the kernel matrix with the kmeans cluster centers (2 lines of code)
sqrMat_X = cdist(X, landmarks, metric='euclidean')
Ktrain = np.exp(-1*(sqrMat_X)**2/(2*s*s))
# build a logistic model on the kernel matrix (2 lines of code)
logreg = LogisticRegression().fit(Ktrain, y)
# build a kernel representation of the test points with the kmeans cluster centers (2 lines of code)
sqrMat_Xtest = cdist(test_points, landmarks, metric='euclidean')
Ktest = np.exp(-1*(sqrMat_Xtest)**2/(2*s*s))
# Predict probabilities on the testpoints (1 line of code)
Z = logreg.predict_proba(Ktest)[:,1]
##### END YOUR CODE
Z = Z.reshape(xx.shape)
plot_boundary(Z,X,y,xx,yy)
plt.title("KMeans Landmarks")
plt.scatter(kmeans.cluster_centers_[:,0],kmeans.cluster_centers_[:,1], s=100, c='red' )
plt.show()
# build the kernel matrix on training data (about 2 lines of vectorized code)
eucDist_X = pdist(X=X, metric='euclidean')
# print(eucDist_X.shape) # Has m*(m-1)/2 pdist entries
sqrMat_X = squareform(eucDist_X)
Ktrain = np.exp(-1*(sqrMat_X)**2/(2*s*s))
# print(Ktrain.shape) # 200 x 200
# build a logistic model on the kernel matrix (2 lines of code) using sklearn's LogisticRegression()
logreg = LogisticRegression().fit(Ktrain, y)
# construct the kernel representation of the test_points with the training set landmarks
# hint: consider using cdist in scipy.spatial.distance (about 2-3 lines of code)
sqrMat_Xtest = cdist(test_points, X, metric='euclidean')
Ktest = np.exp(-1*(sqrMat_Xtest)**2/(2*s*s))
# use your learned model to predict using the kernel representation.
# Store the predictions in array Z (1 line of code)
Z = logreg.predict_proba(Ktest)[:,1]
##### END YOUR CODE
# reshape Z into a 50x50 grid
Z = Z.reshape(xx.shape)
plot_boundary(Z,X,y,xx,yy)
plt.title('All landmarks')
plt.show()
interact(LogisticRegression_StrategicLandmarks, s=2.0, N=20)
"""
1. **All datapoints used as landmarks**: Having abundance of landmarks gave our model enough freedom to find the fine decision boundary wrapping around the classes very closely. This model is the most prone to overfitting over the other 2 choices of landmarks.
2. **KMeans Cluster landmarks**: Though using data clusters as landmarks tackles feature explosion problem, it might not be ideal if we have only a few centroids. In our example, some clusters are formed between datapoints belonging to different classes. Centroids like these are not efficient as they tend to fall on the decision boundary and fail to predict with high confidence. The model has to learn to identify which landmarks are the more useful over others. In our example, using a few more clusters can improve the performance of our model as it will have greater flexibility to choose the helpful landmarks from the available cluster centroids.
3. **Strategic landmarks**: Choosing strategic landmarks using domain knowledge helps us feed just the right landmarks that will contribute the best to forming the decision boundary. In our XOR dataset example, choosing landmarks in the 4 quarters makes it easy for the gaussian kernel to form decision boundaries around these points given the right amount of spread (s).
### how does the choice of kernel width affect the quality of the decision boundary learned?
- the smaller the spread, the sharper are the decision boundaries (i.e., model overfits given enough landmarks; otherwise labeled as *High variance model*)
- the greater the spread, the blurry the decision boundary gets and predictions are bound to have less confidence. (i.e., model underfits as the landmark spreads overlap with each other; otherwise labeled as *High Bias model*)
"""