-
Notifications
You must be signed in to change notification settings - Fork 1
/
FinalProjJSE.Rmd
445 lines (314 loc) · 18.3 KB
/
FinalProjJSE.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
---
title: 'YOUR COMPANY NAME: Predicting Biodegradability'
output:
html_document:
df_print: paged
header-includes: \usepackage{color}
toc: yes
html_notebook:
theme: united
toc: yes
pdf_document:
toc: yes
---
# Report Background
This report was prepared for **ChemsRUs** by *your company name.*
This notebook is by *YOUR NAME HERE*
The consultants in our company are:
+ Name 1
+ Name 2
+ Name 3
+ Name 4
+ Name 5
This report embeds all of the R code necessary to produce the results described in this report.
*NOTE: THIS IS A TEMPLATE FOR THE REPORT WITH INSTRUCTIONS FROM ASSIGNMENTS INCLUDED FOR YOUR CONVENIENCE. **DELETE THE INSTRUCTIONS GIVEN IN ITALICS IN YOUR FINAL REPORT NOTEBOOK.** THE NOTEBOOK SHOULD BE AS YOU WOULD GIVE CHEMS-R-US. CODE FRAGMENTS ARE PROVIDED. YOU CAN ADD OR DELETE R CODE BLOCKS AS NECESSARY. THERE IS SOME SAMPLE CODE AT THE END OF THIS NOTEBOOK. IT SHOULD BE REMOVED BEFORE SUBMISSION. *
```{r, include=FALSE, set.seed(300)}
knitr::opts_chunk$set(cache = T)
# These will install required packages if they are not already installed
if (!require("ggplot2")) {
install.packages("ggplot2", dependencies = TRUE)
library(ggplot2)
}
if (!require("knitr")) {
install.packages("knitr", dependencies = TRUE)
library(knitr)
}
if (!require("xtable")) {
install.packages("xtable", dependencies = TRUE)
library(xtable)
}
if (!require("pander")) {
install.packages("pander", dependencies = TRUE)
library(pander)
}
if (!require("devtools")) {
install.packages("devtools",dependencies = TRUE )
library(devtools)
}
if (!require("usethis")) {
install.packages("usethis" )
library(usethis)
}
#if (!require("e1071")) {
# install.packages("e1071" )
# library(e1071)
#}
if (!require("ggbiplot")) {
install_git("git://github.com/vqv/ggbiplot.git",dependencies = TRUE)
library("ggbiplot")
}
if (!require(reshape2)){
install.packages("reshape2", dependencies = TRUE)
library(reshape2)
}
if (!require(gridExtra)){
install.packages("gridExtra", dependencies = TRUE)
library(gridExtra)
}
if (!require(MASS)){
install.packages("MASS", dependencies = TRUE)
library(MASS)
}
if (!require(pROC)){
install.packages("pROC")
library(pROC)
}
if (!require(caret)){
install.packages("caret")
library(caret)
}
if (!require("glmnet")) {
install.packages("glmnet", dependencies = TRUE)
library(glmnet)
}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction
*Provide an overview of your report*
# Data Description:
*Provide a basic description of the data. Describe the size of all the different data sets (number of attributes, number of points in each class). Do some preliminary data exploration. Provide some sort of summary of the data and describe/show why scaling is a good idea for this data set. Make a PCA plot colored by class for the data before and after scaling. Describe the data preparation. Describe why you did the data preparation*
*The data preparation should include:*
+ Read in the external train and external test datasets.
+ Scale the training data to have mean 0 and standard deviations 1.
+ Scale the test data by subtracting subtract the training mean and scaling by the training standard deviations.
+ Divide the external training set into an internal train and internal test set using an 80% and 20% split.
*From now on, references to train and test data refer to the scaled training and test data.*
Handy HINTS: The following code scales data in matrix `tr` and then applies the same scaling to matrix `tst`.
```{r}
# sc_tr <- scale(tr,center = TRUE, scale = TRUE) # scale tr
# means <- attr(sc_tr, 'scaled:center') # get the mean of the columns
# stdevs <- attr(sc_tr, 'scaled:scale') # get the std of the columns
# sc_tst <- scale(tst, center=means, scale=stdevs)#scale tst using the means and std of tr
```
# Baseline Results:
*Investigation of alternative models using all the features. ChemsRUs has asked you to evaluate how LDA and logistic regression performs on this problem using all the features. Divide the training data into 80% train and 20% test splits. Set seed(20) before you split so you entire team uses the same train and test splits. Train LDA and logistic regression on the training data and evaluate how well it does on the training data. Compute the balanced accuracy and the AUC for the train and test results. Compare the results. *
# Feature Selection Results
*Create an approach for selecting the relevant features. Each team member should come up with their own approach. Describe your approach. Describe the features that you selected. Create a PCA biplot comparing the two classes. Create a classifier using LDA and logistic regression using these features. Evaluate how they perform on the training and sets in terms of balanced accuracy and AUC. Compare your results. Discuss your findings.*
# Challenge Prediction
My challenge ID is .... with an AUC score of ..... for prediction and .... for feature selection.
*Pick your best shot classification and feature selection methods and then enter the contest. Provide your scores in the text. Discuss your results and the strengths and weaknesses of the different approaches. There should be a separate entry for each participant. Make sure that between all of your teams entries three different classification methods are used and three different feature selection methods are used. Using PCA with another method counds as a different classification method*
The contest can be found here:
<!-- https://competitions.codalab.org/competitions/21678?secret_key=c29154f0-213e-43ea-8e01-4eef00f9d123 -->
https://competitions.codalab.org/competitions/22460
Provide a csv file with your predictions of the biodegradibility of each data point in `chems_test.csv`. Chems-R-Us will use this to independently verify the quality of your results. These predictions should be given as a csv files with on column containing the prediction (`1` or `-1`) for each points in chemstest.csv. Provide a csv file with your prediction of which features are real. The feature predictions should be given as a csv files with on column containing the prediction (`1` or `0`) indicating if each of the 168 features should be included.
Create the files:
- `classification.csv`: Test target values (437 lines x 1 column)
- `selection.csv`: Solution indicating which variables are real and which are fake (168 lines x 1 column)
Your submission must be a zip archive containing the following files:
- `classification.csv`, your predicted labels for test dataset. It should include plus or minus one values, one for each test sample, representing the class label predictions.
- `selection.csv`, representing the features you selected as real or fake (ie `0` or `1`).
https://competitions.codalab.org/competitions/22460
**HINT:** The following R code can be used to generate a zip archive from the two files:
```{r}
# Writing output to submit to contest
# Make a prediction of all class -1. You will need to replace this with your prediction
# NOTE: nrow needs to be the length of the test set!!!
ypred<-matrix(-1,nrow=437,437,ncol=1)
#default is to predict all features
featurepred<-matrix(1,nrow=168,168,ncol=1)
# Here is a sample file for submission to the website
write.table(ypred,file = "classification.csv", row.names=F, col.names=F)
write.table(featurepred,file = "selection.csv", row.names=F, col.names=F)
system("zip my_test.csv.zip classification.csv selection.csv")
```
# Comparison of Methods
*Create a **table** comparing the results between your group members on the contest data as reported on the leader board. Create an ROC curve plot of the test classification results that includes all of the methods in one plot. Include a table made by creating a data frame that summarizes the results of each team member and then nicely printing it with the `kable()` function; **Do not include a screen shot of "raw" R output!! in your presentations** Include the number of team member, the number of features used, training set AUC, internal testing set AUC, classification AUC from challenge, and feature selection AUC challenges. Discuss which method performed best for feature selection, and which method worked best for prediction. What method would you recommend overall? Why?*
# Additional Analysis
*Provide additional analysis and/or visualizations that may be insightful to Chems-R-Us. Use some method in R not covered by prior labs. Use your imagination, extra credit for creativity here! Discuss the insights your analysis provides. **Each team member should provide their own analysis/visualization and discussion.** Be sure to title any figures! Comment your code so all can understand what you are doing. Feel free to use any R code from class or from the web.*
# Conclusion
*Provide a conclusion which summarizes your results briefly and adds any observations/suggestions that you have for Chems-R-Us about the data, model, or future work.*
# Notebook Grading Rubrics [REMOVE FROM FINAL NOTEBOOK]
The project will be graded using the following rubric.
+ (10 pts) Did the data description described both test and train? Did it explain data preparation.
+ (10 pts) Was the PCA analysis properly done and discussed?
+ (10 pts) Were the baseline results properly done, presented, and discussed?
+ (10 pts) Was the procedure for feature selection well thought-out, presented, described, and executed?
+ (10 pts) Was the procedure for constructing the final challenge predictive model well thought-out, described, and executed?
+ (10 pts) Did the individual enter the contest sucessfully and report their result. Were the results discussed?
+ (15 pts) Was the table comparing the groups results created. Was the ROC curve comparing the classification predictions of the groups results provided. Were the results well analyzed and discussed? Did the groups results include at least three different feature selection methods and three different prediction methods? Were the strengths and weaknesses of the the different approaches discussed?
+ (10 pts) Were the additional visualizations and analysis effective? Was a method not covered in class included? Were the results discussed?This grade will be an individual grade.
+ (10 pts) Was the procedure for the feature challenge well thought out and successfully done.
+ (5 pts) Was a conclusion provided that included a summary of primary findings and discussion of future work?
+ (10 pts) What is the grammatical quality and clarity of the written report? Did you communicate your results effectively in written form. Did you discuss results
+ (3 pts) Up to 3 points extra credit available for extra effort and very creative solutions
# Additional Sample Code [REMOVE FROM FINAL NOTEBOOK]
This code provides an example on how to use AUC to calculate training and test set accuracy. Note that AUC using the continuous predictions before they are thresholded to be `1` or `-1`. The famous Titanic dataset is used for this example.
```{r}
# Prepare Titanic data
cdata.df <-read.csv("~/MATP-4400/data/titanic3.csv")
cdata.df <-cdata.df[sample(nrow(cdata.df),400),]
T.df<-cdata.df[,c("survived","sex","age","parch","fare")]
T.df$sex<-as.numeric(as.factor(T.df$sex))
T.df<-T.df[complete.cases(T.df),]
survived<-as.factor(T.df$survived)
```
We add a fake feature to the dataset by making a copy of a feature and randomly permuting it. That way any relationship with survival is just by chance.
```{r}
# Add a "nonsense" feature by permuling the 2 feature
datasize<-nrow(T.df)
myperm<-sample(1:datasize, datasize, replace = FALSE )
Q<-cbind(T.df[,-1],fake=T.df[myperm,2])
labels<-as.factor(T.df$survived)
Q.matrix<-as.matrix(Q)
```
We split the code into 75% train and 25% split.
```{r}
#Split the data into training and testing sets
#ss will be the number of data in the training set
n <- nrow(Q)
ss<- ceiling(n*0.75)
train.perm <- sample(1:n,ss)
#The first column is just the sample label, we can discard it for statics (This is the -1 in the second position)
train <- Q[train.perm , ] #The training data is just the training rows
test <- Q[-train.perm, ] # Using -train gives us all rows except the training rows.
#The last column is the class label, which is 1 or -1, we can split it off as the class
trainclass <- survived[train.perm]
testclass <- survived[-train.perm]
rnorm(1)
```
Let's take a look at the features
```{r}
boxplot(train, main="Feature Distributions before Normalization")
rnorm(1)
```
Looks like we have scaling problems, so let's scale the data.
Note how the train is scaled first and then the test is scaled using the mean and standard deviation of the test data.
```{r}
sc_tr<- scale(train, center = TRUE, scale = TRUE) # scale Q,
# scale the test set while we are at it.
means <- attr(sc_tr, 'scaled:center') # get the mean of the columns
stdevs <- attr(sc_tr, 'scaled:scale') # get the std of the columns
sc_test <- scale(test, center=means, scale=stdevs)#scale test using the means and std of tr
boxplot(sc_tr, main="Feature Distributions after Normalization")
rnorm(1)
```
## Fit LDA
Now we fit an LDA model to the training data. We look at the training set accuracy by examining the AUC and the balanced accuracy.
```{r}
# Fit LDA model
fit <- lda(sc_tr,trainclass,prior=c(1,1)/2)
# Predict testing data
mypredictlda<-predict(fit,sc_test)
summary(mypredictlda)
# get ranking of data for use in AUC caluculation. The raninking is the scalar projections scalar projection of each data point on the weights. It is calculated by lda in the variable x.
ranking<-as.numeric(mypredictlda$x)
# Compute ROC results
myroc <- pROC::roc(testclass,ranking, levels=c('0','1'))
# compute AUC=Area under ROC curve
myauc <- pROC::auc(myroc)
myauc
# Calcuate confusion matrix to see balanced accuracy. mypredict$class ahs the classes found by LDA.
confusionMatrix(table(testclass,mypredictlda$class))
rnorm(1)
```
```{r echo=FALSE}
# This is hidden code to display the BA and AUC in the text below
cm <- confusionMatrix(table(testclass,mypredictlda$class))
myba <- cm$byClass["Balanced Accuracy"]
rnorm(1)
```
The testing balanced accuracy for LDA is `r myba` and the AUC is `r myauc`. We leave checking the training accuracy up to you.
Let's take a look at the LDA separating plane weights, to see how the features are used to predict biodegradability.
```{r}
fit$scaling
# get the coefficient for LDA coefficient for fake
fakecoefflda<-fit$scaling[5]
rnorm(1)
```
We can see that the fake feature has a small weight on it, 'r fakecoefflda', even though it is not associated with survival.
## Fit logistic regression model with Lasso
Now we fit the logistic regression model with Lasso
```{r}
# Fit logistic regression model
glmnetfit <- glmnet(sc_tr,trainclass,family="binomial",alpha=1,nlambda=100,standardize=FALSE)
# Predict testing data
#Slambda is the value of lambda that you would like to predict the response
slambda=.05
mypredictglmnet<-predict(object=glmnetfit,s=slambda,newx=sc_test)
summary(mypredictglmnet)
# get ranking of data for use in AUC caluculation. The raninking is the scalar projections scalar projection of each data point on the weights. It is calculated by lda in the variable x.
rankingglmnet<-as.numeric(mypredictglmnet)
# Compute ROC results
myrocGLMnet <- pROC::roc(testclass,rankingglmnet, levels=c('0','1'))
# compute AUC=Area under ROC curve
myaucGLMnet <- pROC::auc(myrocGLMnet)
myaucGLMnet
# Calcuate confusion matrix to see balanced accuracy. mypredict$class ahs the classes found by LDA.
confusionMatrix(data = as.factor(as.integer(rankingglmnet>0.5)), reference = testclass)
rnorm(1)
```
Now lets look at GLM classification weights for lambda=slambda.
```{r echo=FALSE}
glmweights<-coef(glmnetfit,s=slambda)
glmweights
rnorm(1)
```
Looking at the logistic regression with Lasso with slambda =1 we can see that the model only uses 'sex' and 'age' while maintain AUC of 0.7442 and balanced accuracy of .7460. Other values of slambda will give different results.
## Fit logistic regression model
Now we fit the logistic regression model
```{r}
#make a dataframe with labels since glm likes data frames
sc_train.df<-as.data.frame(sc_tr)
my.data<-cbind(trainclass,sc_train.df)
# Fit logistic regression model using glm
# Formula uses all of the features
glmfit <- glm(trainclass~.,data=my.data,family="binomial")
summary(glmfit)
# Predict testing data
mypredictlr<-predict(glmfit,as.data.frame(sc_test),type="response")
# Compute ROC results
myrocLR <- pROC::roc(testclass,mypredictlr, levels=c('0','1'))
# compute AUC=Area under ROC curve
myaucLR <- pROC::auc(myrocLR)
myaucLR
# Calcuate confusion matrix to see balanced accuracy
# This shows how to call it without using table first.
# Note we have to threhold mypredictlr to get classes.
confusionMatrix(data = as.factor(as.integer(mypredictlr>0.5)), reference = testclass)
rnorm(1)
```
```{r echo=FALSE}
# This is hidden code to display the BA and AUC in the text below
cmLR <- confusionMatrix(data = as.factor(as.integer(mypredictlr>0.5)), reference = testclass)
mybaLR <- cmLR$byClass["Balanced Accuracy"]
rnorm(1)
```
Now lets look at the GLM weights.
```{r}
glmfit$coefficients
# get the coefficient for LDA coefficient for fake
fakecoeffglm<-glmfit$coefficients["fake"]
rnorm(1)
```
From the output of `glm`, we can see that `fake` has a coefficient of `r fakecoeffglm`, but the p-value indicates that this feature is insignificant and could be removed.
The training balanced accuracy of logistic regression is `r mybaLR` and the AUC is `r myaucLR`. The logistic regression and LDA models have very similar performance. Of course the testing set accuracy is what really matters.
```{r}
# Plot ROC curves
ggroc(list(LDA = myroc, LR = myrocLR, LASSO=myrocGLMnet)) +
labs(color = "Method") +
ggtitle("Comparison of test accurcy on Chems-R-Us data")
rnorm(1)
```
**Sensitivity** (also called the *true positive rate* or the *recall*) measures the proportion of actual positives that are correctly identified as such (e.g., the percentage of passengers that survived eople who are correctly identified as having survived).
**Specificity** (also called the *true negative rate*) measures the proportion of actual negatives that are correctly identified as such (e.g., the percentage of patients who died who are correctly identified as not having died).
The ROC curves for LDA, GLMNET, and LR confirm that they have very similar performance since there are no large differences.